Definition 1

Best Arm Identification in Generalized Linear Bandits

Abbas Kazerouni

Department of Electrical Engineering, Stanford University, Stanford, CA 94305,

Lawrence M. Wein

Graduate School of Business, Stanford University, Stanford, CA 94305,

Motivated by drug design, we consider the best-arm identification problem in generalized linear bandits. More specifically, we assume each arm has a vector of covariates, there is an unknown vector of parameters that is common across the arms, and a generalized linear model captures the dependence of rewards on the covariate and parameter vectors. The problem is to minimize the number of arm pulls required to identify an arm that is sufficiently close to optimal with a sufficiently high probability. Building on recent progress in best-arm identification for linear bandits (Xu et al. 2018), we propose the first algorithm for best-arm identification for generalized linear bandits, provide theoretical guarantees on its accuracy and sampling efficiency, and evaluate its performance in various scenarios via simulation.

Key words: best arm identification, generalized linear bandits, sequential clinical trial


The multi-armed bandit problem is a prototypical model for optimizing the tradeoff between exploration and exploitation. We consider a pure-exploration version of the bandit problem known as the best-arm identification problem, where the goal is to minimize the number of arm pulls required to select an arm that is - with sufficiently high probability – sufficiently close to the best arm. We assume that each arm has an observable vector of covariates or features, and there is an unknown vector of parameters (of the same dimension as the vector of features) that is common across arms. Whereas in a linear bandit the mean reward of an arm is the linear predictor (i.e., the inner product of the parameter vector and the feature vector), in our generalized linear model the mean reward is related to the linear predictor via a link function, which allows for mean rewards that are nonlinear in the linear predictor, as well as binary or integer rewards (via, e.g., logistic or Poisson regression). Hence, with every pull of an arm, the decision maker refines the estimate of the unknown parameter vector and learns simultaneously about all arms.

Our motivation for studying this version of the bandit problem comes from drug design. The field of drug design is immense (Martin 2010). There are different types of drugs (e.g., organic small molecules or biopolymer-based, i.e. biopharmaceutical, drugs), several phases of development, and a variety of experimental and computational methods used to select compounds with favorable properties. Three characteristics of our bandit problem make it ideally suited to certain subproblems within the drug design process. The first characteristic is the desire to run as few experiments as possible with the goal of selecting a particular compound (i.e., an arm in our model) for the next stage of analysis or testing. Hence, the actual performance of the selected arms during testing is of secondary importance (these pre-clinical tests do not involve human subjects), giving rise to a pure exploration problem. Second, drugs can often be described by a vector of features, which may include calculated properties of molecules, or the three-dimensional structure of the molecule and how it relates to the three-dimensional structure of the biological target. Finally, in many of these experimental settings, a generalized linear model provides a better fit than the linear model. This is true when the outcome is binary; e.g., in animal experiments, there is either survival or death, or either the presence or absence of disease, and some in vitro assays are qualitative (i.e., binary output). It is also true when the outcome is counting data (e.g., the number of cells or animals that survived or died), or when the relationship between the outcome and the best linear predictor is nonlinear.

There may also be other potential applications of this model, e.g., where the aim is to choose the best (non-personalized) ad for an advertising campaign, and the observed output is binary (e.g., the ad led to a purchase or a click-through).

Literature Review. A recent review (Bubeck and Cesa-Bianchi 2012) categorizes bandit problems into three types: Markovian (maximizing discounted rewards, often in a Bayesian setting, and exemplified by the celebrated result in Gittins (1979)), stochastic (typically minimizing regret in a frequentist setting, with independent and identically distributed rewards (Lai and Robbins 1985)), and adversarial (a worst-case setting, where an adversary chooses the rewards (Auer et al. 2002)). We study the stochastic problem, but consider best-arm identification rather than regret minimization. In passing, we note that the ranking-and-selection problem (Kim and Nelson 2006) in the simulation literature has a similar goal to best-arm identification, although evaluating an alternative requires running simulation experiments.

There is a vast literature on best-arm identification in the multi-arm bandit setting with independent arms, where pulling one arm does not reveal any information about the reward of other arms (Even-Dar et al. 2006, Bubeck et al. 2009, Audibert et al. 2010, Gabillon et al. 2012, Kalyanakrishnan et al. 2012, Karnin et al. 2013, Chen et al. 2014). Different algorithms have been developed for variants of this setting, most of which use gap-based exploration where arms are played to reduce the uncertainty about the gaps between the rewards of pairs of arms. Because playing an arm reveals information only about that arm, each arm needs to be played several times to reduce the uncertainty about its reward. As such, these algorithms can be practically implemented only when the number of available arms is relatively small.

Rather than assume independent arms, our formulation considers parametric arms, where each arm has a covariate vector and there is an unknown parameter vector that is common across arms. There has been considerable work on the linear parametric bandits (i.e., the mean reward of an arm is the inner product of its covariate vector and the parameter vector) under the minimum-regret objective (e.g., Auer (2002), Rusmevichientong and Tsitsiklis (2010) and references therein) as well as alternative probabilistic models of arm dependence (e.g., Russo and Van Roy (2014) and references therein). In addition, contextual bandit models allow additional side information in each round, which can model, e.g., patient information in clinical trials or consumer information in online advertising (e.g., Wang et al. (2005), Seldin et al. (2011), Goldenshluger and Zeevi (2013)). Relevant for our purposes, the generalized linear parametric bandit, which uses an inverse link function to relate the linear predictor and the mean reward, has been studied under regret minimization (Filippi et al. 2010, Li et al. 2017).

However, relatively little work exists on best-arm identification in parametric bandits. The first analysis of best-arm identification in linear bandits (Soare et al. 2014) proposed a static exploration algorithm that was inspired by the transductive experiment design principle (Yu et al. 2006). This algorithm determines the sequence of to-be-played actions before making any observations and fails to adapt to the observed rewards. Recently (Xu et al. 2018), major progress has been made via the first adaptive algorithm for best-arm identification in linear bandits. These authors design a gap-based exploration algorithm by employing the standard confidence sets constructed in the literature for linear bandits under regret minimization.

Our Contribution. We adapt the gap-based exploration algorithm of Xu et al. (2018) from the linear setting to the generalized linear case, which requires us to derive confidence sets for reward gaps between different pairs of arms. In the regret-minimization setting, the typical approach to the linear bandit is to develop a confidence set for the unknown parameter vector that governs the rewards of all arms, whereas in the best-arm setting, a confidence set on the reward gaps is needed. In the best-arm identification for the linear bandit (Xu et al. 2018), the authors were able to convert the confidence set for the parameter vector into efficient confidence sets for the reward gaps. However, this approach breaks down in the generalized linear bandit; i.e., naively converting the confidence set for the parameter vector in Filippi et al. (2010) into confidence sets for reward gaps between arms leads to extremely loose confidence sets, which strongly degrades the performance of the gap-based exploration algorithm. Rather than use this indirect method, we build gap confidence sets directly from the data.

The remainder of this paper is organized as follows. In Section id1, we formulate the best-arm identification problem for generalized linear bandits. We describe our algorithm in Section id1 and establish theoretical guarantees in Section id1. We provide simulation results in Section id1 and offer concluding remarks in Section id1.

Consider a decision maker who is seeking to find the best among a set of available arms. We let denote the set of possible arms. There is a feature vector associated with arm , for . These feature vectors are known to the decision maker and each summarizes the available information about the corresponding arm. We employ a generalized linear model (McCullagh and Nelder 1989) and assume that the reward of each arm has a particular distribution in an exponential family with mean


where is an unknown parameter that governs the reward of all arms and is a strictly increasing function known as the inverse link function. Different choices for the function in (id1) result in modeling different reward structures inside the exponential family. For example, choosing and correspond to a Poisson regression model and a logistic regression model, respectively.

The decision maker chooses an arm to play in each round . If arm is played in round , a stochastic reward is observed, which satisfies

Let be the optimal arm; i.e., the arm with the highest expected reward. By exploring different arms, the decision maker is trying to find the optimal arm as soon as possible based on the noisy observations. Let be a stopping time that dictates whether enough evidence has been gathered to declare the optimal arm. The declared optimal arm is denoted by . An exploration strategy can be represented by , where, at any time , is a function mapping from the previous observations to the arm to be played next, and determines whether enough information has been gathered to declare the optimal arm, . Because finding the exact optimal arm may require a prohibitively large amount of exploration, the performance of an exploration strategy is evaluated via the following relaxed criterion.

Definition 1

Given and , an exploration strategy is said to be optimal if


In this definition, denotes an acceptable region around the optimal arm and represents the confidence in identifying an arm within this region. This criterion relaxes the notion of optimality by allowing the exploration strategy to return a sufficiently good – but not necessarily optimal – arm. With this definition in place, the decision maker’s goal is to design an optimal exploration strategy with the smallest possible stopping time.

Before proceeding to the algorithm, we introduce additional notation and state a set of regularity assumptions. We let denote the set of feature vectors and assume that feature vectors, the unknown reward parameter and the rewards are bounded; i.e., there exist such that , , and almost surely for all . We also assume that is continuously differentiable, Lipschitz continuous with constant and satisfies . For example, in the case of logistic regression, and depends on . We define the gap between any two arms to be and define the optimal gap associated to an arm as


Finally, for any positive semi-definite matrix , we let .

In this section, we propose an exploration strategy for the problem formulated in Section id1. Following Xu et al. (2018), our algorithm consists of the following steps:

  1. Build confidence sets for the pairwise gaps between arms,

  2. Identify the potential best arm and an alternative arm that has the most ambiguous gap with the best arm,

  3. Play an arm to reduce this ambiguity.

These steps are repeated sequentially until the ambiguity in step 2 drops below a certain threshold.

The confidence sets are derived in Subsection id1 and the algorithm is presented in Subsection id1.

To build the confidence sets for reward gaps, we follow ideas in Filippi et al. (2010) but develop confidence sets directly for gaps instead of arm rewards.

Let be the history of actions played and random rewards observed prior to period , and let be shorthand notation for the feature vector associated with the arm played in period . For any , let be the empirical covariance matrix and assume it is nonsingular for any for some fixed value . We let be the minimum eigenvalue of and define . Given the observations by the start of period , the Maximum Likelihood (ML) estimate of the reward parameter, , solves the equation


Based on the estimated reward parameter, we can take


as an estimate for the gap between arms , which is a function of the observations made prior to period .

Given , let


where is a tunable parameter and is a time-varying quantity that scales the width of the confidence sets for all pairs of arms. We set


in the theoretical analysis in Section id1, and set so as to achieve robust performance across a variety of scenarios in the computational study in Section id1. We consider the following confidence set for the gap between any two arms based on the observations made prior to period :


The confidence set defined in (8) is centered around the estimated gap between the two arms in (5) and, as we will show in Section id1, contains the true gap with high probability. To further simplify the representation, we let


represent the width of the confidence set in (8).

Although we follow the basic approach in Filippi et al. (2010) in deriving these confidence sets, it is worth noting that they cannot be deduced from the results presented in Filippi et al. (2010). More precisely, an immediate use of the results in that paper gives rise to a very loose confidence set that is similar to (8) except that is replaced by the looser factor .

Output: approximately best arm
1:  Initial Exploratory Phase: Play random arms and gather observations in , and initialize ’s for the played arms
2:  for  do
3:     //select a gap to examine
5:     if  then
6:        return as the best arm
7:     end if
8:     //select an action
9:     select-arm
10:     Play
11:     Observe
14:  end for
Algorithm 1 GLGapE
Output: a gap to be examined
1:  Find the ML estimate
4:  Find
5:  Let
7:  return
Algorithm 2 select-gap
Output: arm to be played
1:  Find as the solution of
2:  Determine for
3:  Find
4:  return
Algorithm 3 select-arm

With the confidence sets established, we are now ready to describe our proposed algorithm. Details and successive steps of the proposed algorithm are presented in Algorithms 1-3. Following the gap-based exploration scheme, the proposed algorithm consists of two major components that are described below.

  Selecting a Gap to Explore:

The algorithm starts by playing random arms such that the empirical covariance matrix is nonsingular. At any subsequent period , the algorithm first finds the empirically best arm . Then, to check whether this arm is within distance of the true optimal arm, the algorithm takes a pessimistic approach. In particular, it finds another arm that is the most advantageous over arm within the gap confidence sets; i.e., . Note that this pessimistic gap consists of two components: the estimated gap and the uncertainty in the gap.

If this pessimistic gap is less than , the algorithm stops and declares the empirically best arm as the optimal arm. Otherwise, it selects an arm to reduce the uncertainty component in the identified gap. According to (9), this uncertainty is governed by where and .

  Selecting an Arm:

While there are different ways to reduce , we follow the approach in Xu et al. (2018). With a slight abuse of notation, let us define for any sequence of feature vectors , where represents a generic time period. Let be the sequence of feature vectors that would have minimized the uncertainty in the direction of . For each arm , let denote the relative frequency of appearing in the sequence when . As has been shown in Section 5.1 of Xu et al. (2018),


where is the solution of the linear program


To minimize the uncertainty in the direction of , the algorithm plays the arm


where is the number of times arm has been played prior to period .

In this section, we provide theoretical guarantees for the performance of the proposed algorithm. We prove that the algorithm indeed finds an optimal arm in Subsection id1 and provide an upper bound on the stopping time of the algorithm in Subsection id1.

We start by proving that the confidence sets constructed in Section id1 hold with high probability at all times.

Proposition 1

Fix and such that and , where is the feature dimension. Then the following holds with probability at least :




The proof is a slight modification of the proof of Proposition 1 in Filippi et al. (2010). \@trivlist

Let . According to (4), the ML estimate satisfies . By the mean value theorem, there exist points with such that


On the other hand, by the fundamental theorem of calculus, we have



The definition of implies that for any ,

It follows that . Hence, and are positive definite and nonsingular. Therefore, from (15) and (16), it follows that


The inequality implies that , and hence for arbitrary . Thus, by (17) we get


As has been shown in the proof of Proposition 1 in Filippi et al. (2010),


holds with probability at least . Combining (18) and (19) shows that


holds with probability at least . Finally, because , taking the maximum over on the right side of (20) completes the proof.  \@endparenv

The following theorem, which is a direct consequence of Proposition 1, shows that the confidence sets in (8) contain the true gaps at all times with high probability. We define event to be


and throughout this section we set according to (7).

Theorem 1

Let be such that . Then event occurs with probability at least .


For any , let and define the event as

Applying Proposition 1 for implies that . The union bound then gives

The proof is completed by noting that . \@endparenv

With the confidence sets established, the next theorem proves optimality of the proposed algorithm.

Theorem 2

Let and be arbitrary. Then the proposed algorithm is optimal; i.e., at its stopping time, the algorithm returns an arm such that


Let be the stopping time of the algorithm and let be the returned arm. Suppose that . Then, according to line 5 of Algorithm 1, we have

From this, we get that

which means that event does not occur. According to Theorem 1, this can happen with probability at most . Thus, happens with probability at most .  \@endparenv

In this subsection, we study the sample complexity of the proposed algorithm. The following theorem provides an upper bound on the number of experiments the proposed algorithm needs to carry out before identifying the optimal arm. Before stating the result, let us take to be the smallest gap and for any , define


which represents the complexity of the exploration problem in terms of the problem parameters.

Theorem 3

Let be the stopping time of the proposed algorithm. Then


is satisfied with probability at least . In asymptotic notation, (23) can be expressed as


Theorem 3 provides an upper bound on the stopping time of the proposed algorithm in terms of the parameters of the exploration problem. As expected, the number of experiments required by the proposed algorithm before declaring a near-optimal arm decreases in the reward tolerance () and the error probability (), and increases in the number of features () and arms (). In terms of the dependence on dimension (), reward bound () and the complexity parameter (), the sample complexity in (24) is similar to that derived in Xu et al. (2018) for linear bandits. The main difference is the appearance of the factor in the complexity bound (24), which encodes the difficulty of learning the inverse link function . As the inverse link function becomes flatter on the boundaries of the input domain (i.e., has smaller ), more samples are required to distinguish between different pairs of arms.

The remainder of this subsection is devoted to proving Theorem 3, which requires a set of preliminaries. We start by introducing some additional notation. For any and any , let

and define . Note that with this notation, defined in Algorithm 2 can be represented as . For any , let be the optimal value of the linear program in (11); i.e.,


For any two real numbers , we use the shorthand notation .

Our proof for Theorem 3 relies on a number of results from the literature, which we state here for completeness. The following two lemmas are proved in Xu et al. (2018).

Lemma 1 (Lemma 1 of Xu et al. (2018))

For any , we have


Lemma 2 (Lemma 4 of Xu et al. (2018))

When event holds, satisfies the following bounds:

  1. If either or is the best arm:

  2. If neither nor is the best arm:

The following lemma is proved in Antos et al. (2010).

Lemma 3 (Proposition 6 of Antos et al. (2010))

For any , let and , for some . Define . Then, for any positive such that , we have .

The following lemma establishes an upper bound on the solution of the linear program in (11).

Lemma 4

Let and let be the solution of (11). Then we have


Let be such that and all other elements of are zero. Clearly, satisfies the constraint in (11). Therefore, we have

\@endparenvWith the above lemmas in place, we are ready to prove the following theorem.

Theorem 4

If event occurs, then the stopping time of the proposed algorithm satisfies


Suppose that event occurs. Let be an arbitrary arm, and let be the last round in which arm was pulled. Because , Lemma 2 implies that

which in turn gives rise to the following three inequalities:


Rearranging (26) yields


From Lemma 1, we have

which by substituting from (9) gives


Combining (27) and (28) gives


Because arm was selected by Algorithm 3 in round , it follows that


Then, from (29), we get


Note that . Hence, it follows from (22) and (31) that

which completes the proof.  \@endparenv

We are now in a position to prove Theorem 3. \@trivlist

Suppose that event holds. Then


where (32) follows by Jensen’s inequality and the logarithmic arithmetic-geometric mean inequality (i.e., ). Applying the inequality for any to (32) yields


Let and define . With a change of variable , (33) can be written as

According to Lemma 3, this is possible only if