# An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support

## Abstract

Consider the problem of a controller sampling sequentially from a finite number of populations, specified by random variables , and ; where denotes the outcome from population the time it is sampled. It is assumed that for each fixed , is a sequence of i.i.d. uniform random variables over some interval , with the support (i.e., ) unknown to the controller. The objective is to have a policy for deciding, based on available data, from which of the populations to sample from at any time so as to maximize the expected sum of outcomes of samples or equivalently to minimize the regret due to lack on information of the parameters and . In this paper, we present a simple UCB-type policy that is asymptotically optimal. Additionally, finite horizon regret bounds are given.

1 \jmlrheading120151-484/0010/00Wesley Cowan and Michael N. Katehakis \ShortHeadingsAn Optimal ISM Policy for Uniform Bandits of Unknown SupportCowan and Katehakis \firstpageno1

Keywords: Inflated Sample Means, Upper Confidence Bound, Multi-armed Bandits, Sequential Allocation

## Chapter \thechapter Introduction and Summary

### 1 Main Model

Let be a known family of probability densities on , each with finite mean. We define to be the expected value under density , and to be the support of . Consider the problem of sequentially sampling from a finite number of populations or ‘bandits’, where measurements from population are specified by an i.i.d. sequence of random variables with density . We take each as unknown to the controller. It is convenient to define, for each , and . Additionally, we take , the discrepancy of bandit .

We note, but for simplicity will not consider explicitly, that both discrete and continuous distributions can be studied when one takes to be i.i.d. with density , with respect to some known measure

For any adaptive, non-anticipatory policy , indicates that the controller samples bandit at time . Define , denoting the number of times bandit has been sampled during the periods under policy ; we take, as a convenience, for all . The value of a policy is the expected sum of the first outcomes under , which we define to be the function

(1) |

where for simplicity the dependence of on the unknown densities is suppressed. The regret of a policy is taken to be the expected loss due to ignorance of the underlying distributions by the controller. Had the controller complete information, she would at every round activate some bandit such that . For a given policy , we define the expected regret of that policy at time as

(2) |

We are interested in policies for which grows as fast as possible with , or equivalently that grows as slowly as possible with

### 2 Preliminaries - Background

We restrict in the following way:

Assumption 1. Given any set of bandit densities , for any sub-optimal bandit i.e., there exists some such that , and .

Effectively, this ensures that at any finite time, given a set of bandits under consideration, for any bandit there is a density in that would both potentially explain the measurements from that bandit, and make it the unique optimal bandit of the set.

The focus of this paper is on as the set of uniform densities over some unknown support.

Let denote the Kullback-Liebler divergence of density from ,

(3) |

It is a simple generalization of a classical result (part 1 of Theorem 1) of [Burnetas and Katehakis(1996b)] that if a policy is uniformly fast (UF), i.e., for all and for any choice of , then, the following bound holds:

(4) |

where the bound itself is determined by the specific distributions of the populations:

(5) |

For a given set of densities , it is of interest to construct policies such that

Such policies achieve the slowest (maximum) regret (value) growth rate possible among UF policies. They have been called UM or asymptotically optimal or efficient, cf. [Burnetas and Katehakis(1996b)].

For a given , let be an estimator of based on the first samples from . It was shown in [Burnetas and Katehakis(1996b)] that under sufficient conditions on , asymptotically optimal (UM) UCB-policies could be constructed by initially sampling each bandit some number of times, and then for , following an index policy:

(6) |

where the indices are ‘inflations of the current estimates for the means’ (ISM), specified as:

(7) |

The sufficient conditions on the estimators are as follows:

Defining

for all choices of and all , , the following hold for each as

These conditions correspond to Conditions A1-A3 given in [Burnetas and Katehakis(1996b)]. However under the stated Assumption 1 on given here, Condition A1 therein is automatically satisfied. Conditions A2 (see also Remark 4(b) in [Burnetas and Katehakis(1996b)]) and A3 are given as C1 and C2, above, respectively. Note, Condition (C1) is essentially satisfied as long as converges to (and hence sufficiently quickly with . This can often be verified easily with standard large deviation principles. The difficulty in proving the optimality of policy is often in verifying that Condition (C2) holds.

The above discussion is a parameter-free variation of that in [Burnetas and Katehakis(1996b)], where was taken to be parametrizable, i.e., , taking as a vector of parameters in some parameter space . Further, [Burnetas and Katehakis(1996b)] considered potentially different parameter spaces (and therefore potentially different parametric forms) for each bandit . There, Conditions A1-A3 (hence C1, C2 herein) and the corresponding indices were stated in terms of estimates for the bandit parameters, an estimate of the parameters of bandit , given samples. In particular, Eq. (7) appears essentially as

(8) |

Previous work in this area includes [Robbins(1952)], and additionally [Gittins(1979)], [Lai and Robbins(1985)] and [Weber(1992)] there is a large literature on versions of this problem, cf. [Burnetas and Katehakis(2003)], [Burnetas and Katehakis(1997b)] and references therein. For recent work in this area we refer to [Audibert et al.(2009)Audibert, Munos, and Szepesvári], [Auer and Ortner(2010)], [Gittins et al.(2011)Gittins, Glazebrook, and Weber], [Bubeck and Slivkins(2012)], [Cappé et al.(2013)Cappé, Garivier, Maillard, Munos, and Stoltz], [Kaufmann(2015)], [Li et al.(2014)Li, Munos, and Szepesvári], [?], [Cowan and Katehakis(2015)], and references therein. For more general dynamic programming extensions we refer to [Burnetas and Katehakis(1997a)], [Butenko et al.(2003)Butenko, Pardalos, and Murphey], [Tewari and Bartlett(2008)], [Audibert et al.(2009)Audibert, Munos, and Szepesvári], [Littman(2012)], [Feinberg et al.(2014)Feinberg, Kasyanov, and Zgurovsky] and references therein. To our knowledge, outside the work in [Lai and Robbins(1985)], [Burnetas and Katehakis(1996b)] and [Burnetas and Katehakis(1997a)], asymptotically optimal policies have only been developed in [Honda and Takemura(2013)] for the problem discussed herein and in [Honda and Takemura(2011)] and [Honda and Takemura(2010)] for the problem of finite known support where optimal policies, cyclic and randomized, that are simpler to implement than those consider in [Burnetas and Katehakis(1996b)] were constructed. Other related work in this area includes: [Katehakis and Derman(1986)], [Katehakis and Veinott Jr(1987)], [Burnetas and Katehakis(1993)], [Burnetas and Katehakis(1996a)], [Lagoudakis and Parr(2003)], [Bartlett and Tewari(2009)], [Tekin and Liu(2012)], [Jouini et al.(2009)Jouini, Ernst, Moy, and Palicot], [Dayanik et al.(2013)Dayanik, Powell, and Yamazaki], [Filippi et al.(2010)Filippi, Cappé, and Garivier], [Osband and Van Roy(2014)], [Burnetas and Katehakis(1997a)], [Androulakis and Dimitrakakis(2014)], [Dimitrakakis(2012)].

## Chapter \thechapter Optimal UCB Policies for Uniform Distributions

### 3 The B-K Lower Bounds and Inflation Factors

In this section we take as the set of probability densities on uniform over some finite interval, taking as uniform over Note, as the family of densities is parametrizable, this largely falls under the scope of [Burnetas and Katehakis(1996b)]. However, the results to follow seem to demonstrate a hole in that general treatment of the problem.

Note, some care with respect to support must be taken in applying [Burnetas and Katehakis(1996b)] to this case, to ensure that the integrals remain well defined. But for this , we have that for a given , for any such that , i.e., and ,

(9) |

If is not a subset of , we take as infinite.

For notational convenience, given , for each , we take as supported on some interval . Note then, .

Given samples from bandit , , we take

(10) |

as the maximum-likelihood estimators of and respectively. We may then define as the uniform density over the interval . Note, is the maximum-likelihood estimate of .

We can now state and prove the following.

Under Assumption 1 the following are true.

(11) |

(12) |

Eq. (11) follows from Eq. (5) and the observation that in this case:

For Eq. (12) we have:

(13) |

We are interested in policies such that achieves the lower bound indicated above, for every choice of . Following the prescription of [Burnetas and Katehakis(1996b)], i.e. Eq. (12), would lead to the following policy,

Policy BK-UCB : . At each :

i) For sample each bandit twice, and

ii) for , let be equal to:

(14) |

breaking ties arbitrarily.

It is easy to demonstrate that the estimators converge sufficiently quickly to in probability that Condition (C1) above is satisfied for . Proving that Condition (C2) is satisfied, however, is much much more difficult, and in fact we conjecture that (C2) does *not* hold for policy . While this does not indicate that that fails to achieve asymptotic optimality, it does imply that the standard techniques are insufficient to verify it. However, asymptotic optimality may provably be achieved by an (seemingly) negligible modification, via the following policy.

### 4 Asymptotically Optimal UCB Policy

We propose the following policy:

Policy UCB-Uniform: . At each :

i) For sample each bandit three times, and

ii) for , let be equal to:

(15) |

breaking ties arbitrarily.

In the remainder of this paper, we verify the asymptotic optimality of (Theorem \thechapter), and additionally give finite horizon bounds on the regret under this policy (Theorem \thechapter, \thechapter). Further, while Theorem \thechapter bounds the order of the remainder term as , this is refined somewhat in Theorem \thechapter to .

## Chapter \thechapter The Optimality Theorem and Finite Time Bounds

For the work in this section it is convenient to define the bandit spans, . We take to be the minimal span of any optimal bandit, i.e.,

Recall that . The primary result of this paper is the following.

For each sub-optimal (i.e., ), let be such that , , and . For as defined above, for all :

(16) |

The proof of Theorem \thechapter is the central proof of this paper. We delay it briefly, to present two related results that can be derived from the above. The first is that is asymptotically optimal.

For as defined above, is asymptotically optimal in the sense that

(17) |

Fix the as feasible in the hypotheses of Theorem \thechapter. In that case, we have

(18) |

Taking the infimum as yields

(19) |

This, combined with the previous observation about the in Eq. (11) completes the result.

We next give an ’-free’ version of the previous bound, which demonstrates the remainder term on the regret under is at worst .

For each sub-optimal (i.e., , let . For all ,

(20) |

Note the following bound, that

(23) |

This first inequality is proven separately as Proposition 5.2 in the Appendix. The second inequality is simply the observation that . Applying this bound to Theorem \thechapter yields the following bound,

(24) |

Taking completes the proof.

[Proof of Theorem 1] For any such that , recall that bandit is taken to be uniformly distributed on the interval . Let be as hypothesized. In this proof, we take as defined above. Additionally, for each we define and . We define the index function

(25) |

We define the following events of interest, and . We now define the following quantities: For ,

(26) |

Hence, we have the following relationship for , that

(27) |

The proof proceeds by bounding, in expectation, each of the three terms.

Observe that, by the structure of the index function ,

(28) |

Hence,

(29) |

The last inequality follows, observing that may be expressed as the sum of indicators, and seeing that the additional condition bounds the number of non-zero terms in the above sum. The additional simply accounts for the term and the term.

Note, this bound is sample-path-wise.

For the second term,

(30) |

The last inequality follows as, for fixed , may be true for at most one value of . It follows then that

(31) |

To bound the term, observe that in the event , from the structure of the policy it must be true that . Thus, if is some bandit such that , . In particular, we take to be the optimal bandit realizing the minimal span . It follows,

(32) |

The last step follows as for in this range, . Hence

(33) |

Here we may make use of the following result: {lemma} Let be i.i.d. random variables, with , and finite. For , let and