Best Arm Identification in Generalized Linear Bandits
Abbas Kazerouni
Department of Electrical Engineering, Stanford University, Stanford, CA 94305, abbask@stanford.edu
Lawrence M. Wein
Graduate School of Business, Stanford University, Stanford, CA 94305, lwein@stanford.edu
Motivated by drug design, we consider the bestarm identification problem in generalized linear bandits. More specifically, we assume each arm has a vector of covariates, there is an unknown vector of parameters that is common across the arms, and a generalized linear model captures the dependence of rewards on the covariate and parameter vectors. The problem is to minimize the number of arm pulls required to identify an arm that is sufficiently close to optimal with a sufficiently high probability. Building on recent progress in bestarm identification for linear bandits (Xu et al. 2018), we propose the first algorithm for bestarm identification for generalized linear bandits, provide theoretical guarantees on its accuracy and sampling efficiency, and evaluate its performance in various scenarios via simulation.
Key words: best arm identification, generalized linear bandits, sequential clinical trial
The multiarmed bandit problem is a prototypical model for optimizing the tradeoff between exploration and exploitation. We consider a pureexploration version of the bandit problem known as the bestarm identification problem, where the goal is to minimize the number of arm pulls required to select an arm that is  with sufficiently high probability – sufficiently close to the best arm. We assume that each arm has an observable vector of covariates or features, and there is an unknown vector of parameters (of the same dimension as the vector of features) that is common across arms. Whereas in a linear bandit the mean reward of an arm is the linear predictor (i.e., the inner product of the parameter vector and the feature vector), in our generalized linear model the mean reward is related to the linear predictor via a link function, which allows for mean rewards that are nonlinear in the linear predictor, as well as binary or integer rewards (via, e.g., logistic or Poisson regression). Hence, with every pull of an arm, the decision maker refines the estimate of the unknown parameter vector and learns simultaneously about all arms.
Our motivation for studying this version of the bandit problem comes from drug design. The field of drug design is immense (Martin 2010). There are different types of drugs (e.g., organic small molecules or biopolymerbased, i.e. biopharmaceutical, drugs), several phases of development, and a variety of experimental and computational methods used to select compounds with favorable properties. Three characteristics of our bandit problem make it ideally suited to certain subproblems within the drug design process. The first characteristic is the desire to run as few experiments as possible with the goal of selecting a particular compound (i.e., an arm in our model) for the next stage of analysis or testing. Hence, the actual performance of the selected arms during testing is of secondary importance (these preclinical tests do not involve human subjects), giving rise to a pure exploration problem. Second, drugs can often be described by a vector of features, which may include calculated properties of molecules, or the threedimensional structure of the molecule and how it relates to the threedimensional structure of the biological target. Finally, in many of these experimental settings, a generalized linear model provides a better fit than the linear model. This is true when the outcome is binary; e.g., in animal experiments, there is either survival or death, or either the presence or absence of disease, and some in vitro assays are qualitative (i.e., binary output). It is also true when the outcome is counting data (e.g., the number of cells or animals that survived or died), or when the relationship between the outcome and the best linear predictor is nonlinear.
There may also be other potential applications of this model, e.g., where the aim is to choose the best (nonpersonalized) ad for an advertising campaign, and the observed output is binary (e.g., the ad led to a purchase or a clickthrough).
Literature Review. A recent review (Bubeck and CesaBianchi 2012) categorizes bandit problems into three types: Markovian (maximizing discounted rewards, often in a Bayesian setting, and exemplified by the celebrated result in Gittins (1979)), stochastic (typically minimizing regret in a frequentist setting, with independent and identically distributed rewards (Lai and Robbins 1985)), and adversarial (a worstcase setting, where an adversary chooses the rewards (Auer et al. 2002)). We study the stochastic problem, but consider bestarm identification rather than regret minimization. In passing, we note that the rankingandselection problem (Kim and Nelson 2006) in the simulation literature has a similar goal to bestarm identification, although evaluating an alternative requires running simulation experiments.
There is a vast literature on bestarm identification in the multiarm bandit setting with independent arms, where pulling one arm does not reveal any information about the reward of other arms (EvenDar et al. 2006, Bubeck et al. 2009, Audibert et al. 2010, Gabillon et al. 2012, Kalyanakrishnan et al. 2012, Karnin et al. 2013, Chen et al. 2014). Different algorithms have been developed for variants of this setting, most of which use gapbased exploration where arms are played to reduce the uncertainty about the gaps between the rewards of pairs of arms. Because playing an arm reveals information only about that arm, each arm needs to be played several times to reduce the uncertainty about its reward. As such, these algorithms can be practically implemented only when the number of available arms is relatively small.
Rather than assume independent arms, our formulation considers parametric arms, where each arm has a covariate vector and there is an unknown parameter vector that is common across arms. There has been considerable work on the linear parametric bandits (i.e., the mean reward of an arm is the inner product of its covariate vector and the parameter vector) under the minimumregret objective (e.g., Auer (2002), Rusmevichientong and Tsitsiklis (2010) and references therein) as well as alternative probabilistic models of arm dependence (e.g., Russo and Van Roy (2014) and references therein). In addition, contextual bandit models allow additional side information in each round, which can model, e.g., patient information in clinical trials or consumer information in online advertising (e.g., Wang et al. (2005), Seldin et al. (2011), Goldenshluger and Zeevi (2013)). Relevant for our purposes, the generalized linear parametric bandit, which uses an inverse link function to relate the linear predictor and the mean reward, has been studied under regret minimization (Filippi et al. 2010, Li et al. 2017).
However, relatively little work exists on bestarm identification in parametric bandits. The first analysis of bestarm identification in linear bandits (Soare et al. 2014) proposed a static exploration algorithm that was inspired by the transductive experiment design principle (Yu et al. 2006). This algorithm determines the sequence of tobeplayed actions before making any observations and fails to adapt to the observed rewards. Recently (Xu et al. 2018), major progress has been made via the first adaptive algorithm for bestarm identification in linear bandits. These authors design a gapbased exploration algorithm by employing the standard confidence sets constructed in the literature for linear bandits under regret minimization.
Our Contribution. We adapt the gapbased exploration algorithm of Xu et al. (2018) from the linear setting to the generalized linear case, which requires us to derive confidence sets for reward gaps between different pairs of arms. In the regretminimization setting, the typical approach to the linear bandit is to develop a confidence set for the unknown parameter vector that governs the rewards of all arms, whereas in the bestarm setting, a confidence set on the reward gaps is needed. In the bestarm identification for the linear bandit (Xu et al. 2018), the authors were able to convert the confidence set for the parameter vector into efficient confidence sets for the reward gaps. However, this approach breaks down in the generalized linear bandit; i.e., naively converting the confidence set for the parameter vector in Filippi et al. (2010) into confidence sets for reward gaps between arms leads to extremely loose confidence sets, which strongly degrades the performance of the gapbased exploration algorithm. Rather than use this indirect method, we build gap confidence sets directly from the data.
The remainder of this paper is organized as follows. In Section id1, we formulate the bestarm identification problem for generalized linear bandits. We describe our algorithm in Section id1 and establish theoretical guarantees in Section id1. We provide simulation results in Section id1 and offer concluding remarks in Section id1.
Consider a decision maker who is seeking to find the best among a set of available arms. We let denote the set of possible arms. There is a feature vector associated with arm , for . These feature vectors are known to the decision maker and each summarizes the available information about the corresponding arm. We employ a generalized linear model (McCullagh and Nelder 1989) and assume that the reward of each arm has a particular distribution in an exponential family with mean
(1) 
where is an unknown parameter that governs the reward of all arms and is a strictly increasing function known as the inverse link function. Different choices for the function in (id1) result in modeling different reward structures inside the exponential family. For example, choosing and correspond to a Poisson regression model and a logistic regression model, respectively.
The decision maker chooses an arm to play in each round . If arm is played in round , a stochastic reward is observed, which satisfies
Let be the optimal arm; i.e., the arm with the highest expected reward. By exploring different arms, the decision maker is trying to find the optimal arm as soon as possible based on the noisy observations. Let be a stopping time that dictates whether enough evidence has been gathered to declare the optimal arm. The declared optimal arm is denoted by . An exploration strategy can be represented by , where, at any time , is a function mapping from the previous observations to the arm to be played next, and determines whether enough information has been gathered to declare the optimal arm, . Because finding the exact optimal arm may require a prohibitively large amount of exploration, the performance of an exploration strategy is evaluated via the following relaxed criterion.
Definition 1
Given and , an exploration strategy is said to be optimal if
(2) 
In this definition, denotes an acceptable region around the optimal arm and represents the confidence in identifying an arm within this region. This criterion relaxes the notion of optimality by allowing the exploration strategy to return a sufficiently good – but not necessarily optimal – arm. With this definition in place, the decision maker’s goal is to design an optimal exploration strategy with the smallest possible stopping time.
Before proceeding to the algorithm, we introduce additional notation and state a set of regularity assumptions. We let denote the set of feature vectors and assume that feature vectors, the unknown reward parameter and the rewards are bounded; i.e., there exist such that , , and almost surely for all . We also assume that is continuously differentiable, Lipschitz continuous with constant and satisfies . For example, in the case of logistic regression, and depends on . We define the gap between any two arms to be and define the optimal gap associated to an arm as
(3) 
Finally, for any positive semidefinite matrix , we let .
In this section, we propose an exploration strategy for the problem formulated in Section id1. Following Xu et al. (2018), our algorithm consists of the following steps:

Build confidence sets for the pairwise gaps between arms,

Identify the potential best arm and an alternative arm that has the most ambiguous gap with the best arm,

Play an arm to reduce this ambiguity.
These steps are repeated sequentially until the ambiguity in step 2 drops below a certain threshold.
The confidence sets are derived in Subsection id1 and the algorithm is presented in Subsection id1.
To build the confidence sets for reward gaps, we follow ideas in Filippi et al. (2010) but develop confidence sets directly for gaps instead of arm rewards.
Let be the history of actions played and random rewards observed prior to period , and let be shorthand notation for the feature vector associated with the arm played in period . For any , let be the empirical covariance matrix and assume it is nonsingular for any for some fixed value . We let be the minimum eigenvalue of and define . Given the observations by the start of period , the Maximum Likelihood (ML) estimate of the reward parameter, , solves the equation
(4) 
Based on the estimated reward parameter, we can take
(5) 
as an estimate for the gap between arms , which is a function of the observations made prior to period .
Given , let
(6) 
where is a tunable parameter and is a timevarying quantity that scales the width of the confidence sets for all pairs of arms. We set
(7) 
in the theoretical analysis in Section id1, and set so as to achieve robust performance across a variety of scenarios in the computational study in Section id1. We consider the following confidence set for the gap between any two arms based on the observations made prior to period :
(8) 
The confidence set defined in (8) is centered around the estimated gap between the two arms in (5) and, as we will show in Section id1, contains the true gap with high probability. To further simplify the representation, we let
(9) 
represent the width of the confidence set in (8).
Although we follow the basic approach in Filippi et al. (2010) in deriving these confidence sets, it is worth noting that they cannot be deduced from the results presented in Filippi et al. (2010). More precisely, an immediate use of the results in that paper gives rise to a very loose confidence set that is similar to (8) except that is replaced by the looser factor .
With the confidence sets established, we are now ready to describe our proposed algorithm. Details and successive steps of the proposed algorithm are presented in Algorithms 13. Following the gapbased exploration scheme, the proposed algorithm consists of two major components that are described below.
 Selecting a Gap to Explore:

The algorithm starts by playing random arms such that the empirical covariance matrix is nonsingular. At any subsequent period , the algorithm first finds the empirically best arm . Then, to check whether this arm is within distance of the true optimal arm, the algorithm takes a pessimistic approach. In particular, it finds another arm that is the most advantageous over arm within the gap confidence sets; i.e., . Note that this pessimistic gap consists of two components: the estimated gap and the uncertainty in the gap.
If this pessimistic gap is less than , the algorithm stops and declares the empirically best arm as the optimal arm. Otherwise, it selects an arm to reduce the uncertainty component in the identified gap. According to (9), this uncertainty is governed by where and .
 Selecting an Arm:

While there are different ways to reduce , we follow the approach in Xu et al. (2018). With a slight abuse of notation, let us define for any sequence of feature vectors , where represents a generic time period. Let be the sequence of feature vectors that would have minimized the uncertainty in the direction of . For each arm , let denote the relative frequency of appearing in the sequence when . As has been shown in Section 5.1 of Xu et al. (2018),
(10) where is the solution of the linear program
(11) To minimize the uncertainty in the direction of , the algorithm plays the arm
(12) where is the number of times arm has been played prior to period .
In this section, we provide theoretical guarantees for the performance of the proposed algorithm. We prove that the algorithm indeed finds an optimal arm in Subsection id1 and provide an upper bound on the stopping time of the algorithm in Subsection id1.
We start by proving that the confidence sets constructed in Section id1 hold with high probability at all times.
Proposition 1
Fix and such that and , where is the feature dimension. Then the following holds with probability at least :
(13) 
where
(14) 
The proof is a slight modification of the proof of Proposition 1 in Filippi et al. (2010). \@trivlist
Let . According to (4), the ML estimate satisfies . By the mean value theorem, there exist points with such that
(15) 
On the other hand, by the fundamental theorem of calculus, we have
(16) 
where
The definition of implies that for any ,
It follows that . Hence, and are positive definite and nonsingular. Therefore, from (15) and (16), it follows that
(17) 
The inequality implies that , and hence for arbitrary . Thus, by (17) we get
(18) 
As has been shown in the proof of Proposition 1 in Filippi et al. (2010),
(19) 
holds with probability at least . Combining (18) and (19) shows that
(20) 
holds with probability at least . Finally, because , taking the maximum over on the right side of (20) completes the proof. \@endparenv
The following theorem, which is a direct consequence of Proposition 1, shows that the confidence sets in (8) contain the true gaps at all times with high probability. We define event to be
(21) 
and throughout this section we set according to (7).
Theorem 1
Let be such that . Then event occurs with probability at least .
For any , let and define the event as
Applying Proposition 1 for implies that . The union bound then gives
The proof is completed by noting that . \@endparenv
With the confidence sets established, the next theorem proves optimality of the proposed algorithm.
Theorem 2
Let and be arbitrary. Then the proposed algorithm is optimal; i.e., at its stopping time, the algorithm returns an arm such that
Let be the stopping time of the algorithm and let be the returned arm. Suppose that . Then, according to line 5 of Algorithm 1, we have
From this, we get that
which means that event does not occur. According to Theorem 1, this can happen with probability at most . Thus, happens with probability at most . \@endparenv
In this subsection, we study the sample complexity of the proposed algorithm. The following theorem provides an upper bound on the number of experiments the proposed algorithm needs to carry out before identifying the optimal arm. Before stating the result, let us take to be the smallest gap and for any , define
(22) 
which represents the complexity of the exploration problem in terms of the problem parameters.
Theorem 3
Let be the stopping time of the proposed algorithm. Then
(23) 
is satisfied with probability at least . In asymptotic notation, (23) can be expressed as
(24) 
Theorem 3 provides an upper bound on the stopping time of the proposed algorithm in terms of the parameters of the exploration problem. As expected, the number of experiments required by the proposed algorithm before declaring a nearoptimal arm decreases in the reward tolerance () and the error probability (), and increases in the number of features () and arms (). In terms of the dependence on dimension (), reward bound () and the complexity parameter (), the sample complexity in (24) is similar to that derived in Xu et al. (2018) for linear bandits. The main difference is the appearance of the factor in the complexity bound (24), which encodes the difficulty of learning the inverse link function . As the inverse link function becomes flatter on the boundaries of the input domain (i.e., has smaller ), more samples are required to distinguish between different pairs of arms.
The remainder of this subsection is devoted to proving Theorem 3, which requires a set of preliminaries. We start by introducing some additional notation. For any and any , let
and define . Note that with this notation, defined in Algorithm 2 can be represented as . For any , let be the optimal value of the linear program in (11); i.e.,
(25) 
For any two real numbers , we use the shorthand notation .
Our proof for Theorem 3 relies on a number of results from the literature, which we state here for completeness. The following two lemmas are proved in Xu et al. (2018).
Lemma 1 (Lemma 1 of Xu et al. (2018))
For any , we have
where
Lemma 2 (Lemma 4 of Xu et al. (2018))
When event holds, satisfies the following bounds:

If either or is the best arm:

If neither nor is the best arm:
The following lemma is proved in Antos et al. (2010).
Lemma 3 (Proposition 6 of Antos et al. (2010))
For any , let and , for some . Define . Then, for any positive such that , we have .
The following lemma establishes an upper bound on the solution of the linear program in (11).
Lemma 4
Let and let be the solution of (11). Then we have
Let be such that and all other elements of are zero. Clearly, satisfies the constraint in (11). Therefore, we have
\@endparenvWith the above lemmas in place, we are ready to prove the following theorem.
Theorem 4
If event occurs, then the stopping time of the proposed algorithm satisfies
Suppose that event occurs. Let be an arbitrary arm, and let be the last round in which arm was pulled. Because , Lemma 2 implies that
which in turn gives rise to the following three inequalities:
(26) 
Rearranging (26) yields
(27) 
Because arm was selected by Algorithm 3 in round , it follows that
(30) 
Then, from (29), we get
(31) 
We are now in a position to prove Theorem 3. \@trivlist