Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem
Abstract
Consider the problem of sampling sequentially from a finite number of populations,
specified by
random variables , and ; where denotes
the outcome from population the time it is sampled. It is assumed that for each fixed ,
is a sequence of i.i.d. normal random variables, with unknown mean and unknown variance .
The objective is to have a policy for deciding from which of the populations
to sample from at any time so as
to maximize the expected sum of outcomes of total samples or
equivalently to minimize the regret due to lack on information of the parameters and . In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the
sense of Theorem 4 below. This resolves a standing open problem from [Burnetas and Katehakis(1996b)]. Additionally, finite horizon regret bounds are given
/Users/mnk/Pictures/ \firstpageno1 \jmlrheading120151484/0010/00Wesley Cowan and Michael N. Katehakis \ShortHeadingsNormal Bandits of Unknown Means and VariancesCowan, Honda and Katehakis \firstpageno1
Inflated Sample Means, Multiarmed Bandits, Sequential Allocation
1 Introduction and Summary
Consider the problem of a controller sampling sequentially from a finite number of populations or ‘bandits’, where the measurements from population are specified by a sequence of i.i.d. random variables , taken to be normal with finite mean and finite variance . The means and variances are taken to be unknown to the controller. It is convenient to define the maximum mean, , and the bandit discrepancies where It is additionally convenient to define as the minimal variance of any bandit that achieves , that is .
In this paper, given samples from population we will take the estimators: and for and respectively. Note that the use of the biased estimator for the variance, with the factor in place of , is largely for aesthetic purposes  the results presented here adapt to the use of the unbiased estimator as well.
For any adaptive, nonanticipatory policy , indicates that the controller samples bandit at time . Define , denoting the number of times bandit has been sampled during the periods under policy ; we take, as a convenience, for all . The value of a policy is the expected sum of the first outcomes under , which we define to be the function
(1) 
where for simplicity the dependence of on the true, unknown, values of the parameters and , is supressed. The pseudoregret, or simply regret, of a policy is taken to be the expected loss due to ignorance of the parameters and by the controller. Had the controller complete information, she would at every round activate some bandit such that . For a given policy , we define the expected regret of that policy at time as
(2) 
It follows from Eqs. (1) and (2) that maximization of with respect to is equivalent to minimization of . This type of loss due to ignorance of the means (regret) was first introduced in the context of an problem by [Robbins(1952)] as the ‘loss per trial’ (for which ), constructing a modified (along two sparse sequences) ‘play the winner’ policy, , such that (a.s.) and , using for his derivation only the assumption of the Strong Law of Large Numbers. Following [Burnetas and Katehakis(1996b)] when , if is such that we say policy is uniformly convergent (UC) (since then ). However, if under a policy , grew at a slower pace, such as , or better etc., then the controller would be assured that is making a effective tradeoff between exploration and exploitation. It turns our that it is possible to construct ‘uniformly fast convergent’ (UFC) policies, also known as consistent or strongly consistent, defined as the policies for which:
The existence of UFC policies in the case considered here is well established, e.g., [Auer et al.(2002)Auer, CesaBianchi, and Fischer] (fig. 4. therein) presented the following UFC policy : {tcolorbox}[colback=blue!1, arc=3pt, width=.94] Policy (UCB1NORMAL). At each :

Sample from any bandit for which

If , for all sample from bandit with
(3) (Taking, in this case, as the unbiased estimator.)
Additionally, [Auer et al.(2002)Auer, CesaBianchi, and Fischer] (in Theorem 4. therein) gave the following bound:
(4) 
with
(5)  
(6) 
Ineq. (4) readily implies that . Thus, since for all and it follows that is uniformly fast convergent.
Given that UFC policies exist, the question immediately follows: just how fast can they be? The primary motivation of this paper is the following general result, from [Burnetas and Katehakis(1996b)], where they showed that for any UFC policy , the following holds:
(7) 
where the bound itself is determined by the specific distributions of the populations, in this case
(8) 
For comparison, depending on the specifics of the bandit distributions, there is a considerable distance between the logarithmic term of the upper bound of Eq. (4) and the lower bound implied by Eq. (8).
The derivation of Ineq. (7) implies that in order to guarantee that a policy is uniformly fast convergent, suboptimal populations have to be sampled at least a logarithmic number of times. The above bound is a special case of a more general result derived in [Burnetas and Katehakis(1996b)] (part 1 of Theorem 1 therein) for distributions with multiparameters being unknown (such as in the current problem of Normal populations with both the mean and the variance being unknown):
with
Previously, [Lai and Robbins(1985)] had obtained such lower bounds for distributions with oneparameter (such as in the current problem of Normal populations with unknown mean but known variance). Allocation policies that achieved the lower bounds were called asymptotically efficient or optimal in [Lai and Robbins(1985)].
Ineq. (7) motivates the definition of a uniformly fast convergent policy as having a uniformly maximal convergence rate (UM) or simply being asymptotically optimal, within the class of uniformly fast convergent policies, if since then .
[Burnetas and Katehakis(1996b)] proposed the following index policy as one that could achieve this lower bound: {tcolorbox}[colback=blue!1, arc=3pt, width=.94] Policy (UCBNORMAL)

For sample each bandit twice, and

for , sample from bandit with
(9)
[Burnetas and Katehakis(1996b)] were not able to establish the asymptotic optimality of the policy because they were not able to establish a sufficient condition (Condition A3 therein), which we express here as the following equivalent conjecture (the referenced open question in the subtitle). {conjecture} For each , for every , and for , the following is true:
(10) 
We show that the above conjecture is false (cf. Proposition A in the Appendix). This does not imply that fails to be UM (i.e., to be asymptotically optimal), but this failure means that the techniques established in [Burnetas and Katehakis(1996b)] are insufficient to verify its optimality. All is not lost, however. One of the central results of this paper is to establish that with a small change, the policy may be modified to one that is provably asymptotically optimal. We introduce in this paper the policy defined in the following way: {tcolorbox}[colback=blue!1, arc=3pt, width=.94] Policy (UCBNORMAL)

For sample each bandit three times, and

for , sample from bandit with
(11)
Remark 1
1) Note that policy is only a slight modification of policy , the only difference between their indices is the in the power on under the radical, i.e., in replacing in . This change, while seemingly asymptotically negligible (as in practice (a.s.) with ), has a profound effect on what is provable about .
2) We note that the indices of policy are a significant modification of those of the optimal allocation policy for the case of normal bandits with known variances, cf. [Burnetas and Katehakis(1996b)] and [Katehakis and Robbins(1995)], which are:
the difference being replacing the term in by in However, the indices of policy are a minor modification of the optimal policy the difference being replacing the term in by in
3) The and policies can be seen as connected in the following way, however, observing that is a firstorder approximation of .
Following [Robbins(1952)], and additionally [Gittins(1979)], [Lai and Robbins(1985)] and [Weber(1992)] there is a large literature on versions of this problem, cf. \citetburnetas2003asymptotic, \citetburnetas1997finite and references therein. For recent work in this area we refer to [Audibert et al.(2009)Audibert, Munos, and Szepesvári], [Auer and Ortner(2010)], [Gittins et al.(2011)Gittins, Glazebrook, and Weber], [Bubeck and Slivkins(2012)], [Cappé et al.(2013)Cappé, Garivier, Maillard, Munos, and Stoltz], [Kaufmann(2015)], [Li et al.(2014)Li, Munos, and Szepesvari], [Cowan and Katehakis(2015b)], [Cowan and Katehakis(2015c)], and references therein. For more general dynamic programming extensions we refer to [Burnetas and Katehakis(1997a)], [Butenko et al.(2003)Butenko, Pardalos, and Murphey], [Tewari and Bartlett(2008)], [Audibert et al.(2009)Audibert, Munos, and Szepesvári], [Littman(2012)], [Feinberg et al.(2014)Feinberg, Kasyanov, and Zgurovsky] and references therein. Other related work in this area includes: [Burnetas and Katehakis(1993)], [Burnetas and Katehakis(1996a)], [Lagoudakis and Parr(2003)], [Bartlett and Tewari(2009)], [Tekin and Liu(2012)], [Jouini et al.(2009)Jouini, Ernst, Moy, and Palicot], [Dayanik et al.(2013)Dayanik, Powell, and Yamazaki], [Filippi et al.(2010)Filippi, Cappé, and Garivier], [Osband and Van Roy(2014)], \citetdena2013.
To our knowledge, outside the work in [Lai and Robbins(1985)], \citetbkmab96 and [Burnetas and Katehakis(1997a)], asymptotically optimal policies have only been developed in in \citethonda2011asymptotically, and in \citethonda2010 for the problem of finite known support where optimal policies, cyclic and randomized, that are simpler to implement than those consider in \citetbkmab96 were constructed. Recently in \citetck2015u, an asymptotically optimal policy for uniform bandits of unknown support was constructed. The question of whether asymptotically optimal policies exist in the case discussed herein of normal bandits with unknown means and unknown variances was recently resolved in the positive by [Honda and Takemura(2013)] who demonstrated that a form of Thompson sampling with certain priors on achieves the asymptotic lower bound
The structure of the rest of the paper is as follows. In section 2, Theorem 2 establishes a finite horizon bound on the regret of . From this bound, it follows that is asymptotically optimal (Theorem 2), and we provide a bound on the remainder term (Theorem 2). Additionally, in Section 3, the Thompson sampling policy of [Honda and Takemura(2013)] and are compared and discussed, as both achieve asymptotic optimality.
2 The Optimality Theorem and Finite Time Bounds
The main results of this paper, that Conjecture 1 is false (cf. Proposition A in the Appendix), the asymptotic optimality, and the bounds on the behavior of , all depend on the following probability bounds; we note that tighter bounds seem possible, but these are sufficient for this paper.
Let be independent random variables, a standard normal, and a chisquared distribution with degrees of freedom, where .
For , the following holds for all :
(12) 
[of Proposition 2] The proof is given in the Appendix.
For policy as defined above, the following bounds hold for all and all :
(13) 
Before giving the proof of this bound, we present two results, the first demonstrating the asymptotic optimality of , the second giving an free version of the above bound, which gives a bound on the sublogarithmic remainder term. It is worth noting the following. The bounds of Theorem 2 can actually be improved, through the use of a modified version of Proposition 2, to eliminate the dependence, so the only dependence on is through the initial term. The cost of this, however, is a dependence on a larger power of . The particular form of the bound given in Eq. (13) was chosen to simplify the following two results, cf. Remark 4 in the proof of Propositition 2.
For a policy as defined above, is asymptotically optimal in the sense that
(14) 
[of Theorem 2] For any such that , we have from Theorem 2 that the followings holds:
(15) 
Taking the infimum over all such ,
(16) 
and observing the lower bound of Eq. (7) completes the result.
For a policy as defined above, , and more concretely
(17) 
where
(18) 
While the above bound admittedly has a more complex form than such a bound as in Eq. (4), it demonstrates the asymptotic optimality of the dominating term, and bounds the sublinear remainder term.
[of Theorem 2] The bound follows directly from Theorem 2, taking for , and observing the following bound, that for such that ,
(19) 
This inequality is proven separately as Proposition A in the Appendix.
We make no claim that the results of Theorems 2, 2 are the best achievable for this policy . At several points in the proofs, choices of convenience were made in the bounding of terms, and different techniques may yield tighter bounds still. But they are sufficient to demonstrate the asymptotic optimality of , and give useful bounds on the growth of .
[of Theorem 1] In this proof, we take as defined above. For notational convenience, we define the index function
(20) 
The structure of this proof will be to bound the expected value of for all suboptimal bandits , and use this to bound the regret . The basic techniques follow those in [Katehakis and Robbins(1995)] for the known variance case, modified accordingly here for the unknown variance case and assisted by the probability bound of Proposition 2. For any such that , we define the following quantities: Let and define . For ,
(21) 
Hence, we have the following relationship for , that
(22) 
The proof proceeds by bounding, in expectation, each of the four terms.
Observe that, by the structure of the index function ,
(23) 
The last inequality follows, observing that may be expressed as the sum of indicators, and seeing that the additional condition bounds the number of nonzero terms in the above sum. The additional simply accounts for the term and the term. Note, this bound is samplepathwise.
For the second term,
(24) 
The last inequality follows as, for fixed , may be true for at most one value of . Recall that has the distribution of a random variable. Letting , from the above we have
(25) 
The penultimate step is a Chernoff bound on the terms,
To bound the third term, a similar rearrangement to Eq. (24) (using the sample mean instead of the sample variance) yields:
(26) 
Recalling that for a standard normal,
(27) 
The penultimate step is a Chernoff bound on the terms, .
To bound the term, observe that in the event , from the structure of the policy it must be true that . Thus, if is some bandit such that , . In particular, we take to be a bandit that not only achieves the maximal mean , but also the minimal variance among optimal bandits, . We have the following bound,
(28) 
The last step follows as for in this range, . Hence
(29) 
As an aside, this is essentially the point at which the conjectured Eq. (10) would have come into play for the proof of the opt