# Minimal-Exploration Allocation Policies: Asymptotic, Almost Sure, Arbitrarily Slow Growing Regret

## Abstract

The purpose of this paper is to provide further understanding into the structure of the sequential allocation (“stochastic multi-armed bandit”, or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or “pseudo”) regret associated with two simple classes of allocation policies .

For any slowly increasing function , subject to mild regularity constraints, we construct two policies (the -Forcing, and the -Inflated Sample Mean) that achieve a measure of regret of order almost surely as , bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function effectively controls the “exploration” of the classical “exploration/exploitation” tradeoff.

/Users/mnk/Pictures/

Keywords: Forcing Actions, Inflated Sample Means, Multi-armed Bandits, Sequential Allocation, Online Learning

## 1 Introduction and Summary

The basic problem involves sampling sequentially from a finite number of populations or “bandits,” where each population is specified by a sequence of real-valued i.i.d. random variables, , with representing the reward received the time population is sampled. The distributions of the are taken to be unknown; they belong to some collection of distributions . We restrict in two ways:

The first, that each population has some finite mean - unknown to the controller. The purpose of this assumption is to establish for each population the Strong Law of Large Numbers (SLLN),

(1) |

Second, we assert that each population has finite variance . The purpose of this assumption is to establish for each population the Law of the Iterated Logarithm (LIL),

(2) |

It will emerge that the important distribution properties for the populations are not the i.i.d. structure, but rather Eqs. (1), (2) alone. This allows for some relaxation of assumptions, as discussed in Section 5. In fact, the LIL (and therefore the assumption of finite variances) is only really required for the derivation of the regret remainder term bounds in the results to follow - the primary asymptotic results depend solely on the SLLN.

Additionally, we define , and we take the optimal bandit to be unique - that is, there is a unique such that . It is convenient to define the bandit discrepancies as .

For any adaptive policy , let indicate the event that population is sampled at time , and let denote the number of times has been sampled during periods , under policy ; for convenience we define for all . One is typically interested in maximizing in some well defined sense the sum of the first outcomes achieved by an adaptive policy To this end we note that if the controller had complete information (i.e., knew the distributions of the , for each ), she would at every round activate the “optimal” bandit . Natural measures of the loss due to this ignorance of the distributions, are the quantities below:

(3) | ||||

(4) |

The functions , have been called in the literature pseudo-reget, and regret; for notational simplicity their dependence on the unknown distributions is usually suppressed.

The motivation for considering minimizing alternative regret measures to is that while the investigator might be pleased to know that the policy she is utilizing has minimal *expected* regret, she might reasonably be more interested in behavior of the policy on the specific sample-path she is currently exploring rather than aggregate behavior over the entire probability space.
At an extreme end of this would be a result minimizing regret or pseudo-regret surely (sample-path-wise) or almost surely (with full probability), guaranteeing a sense of optimality independent of outcome. We offer an asymptotic result of this type here in Theorem 2.

Note that , and “good policies” are those that achieve a small rate of increase for one of the above regret functions. Further relationships and forms of pseudo-regret are explored in Bubeck and Cesa-Bianchi [3], e.g., the “sample regret” We find the pseudo-reget in some sense more philosophically satisfying to consider than sample regret, for the reason that - given her ignorance and the inherent randomness - the controller cannot reasonably regret the specific reward gained or lost from an activation of a bandit, as in She can only reasonably regret the decision to activate that specific bandit, which is captured by ’s dependence on the s alone.

Thus, we are particularly interested in high probability or guaranteed (almost sure) asymptotic bounds on the growth of the pseudo-regret as . The main result of this paper is Theorem 1 which establishes, by two examples, that for any arbitrarily (slowly) increasing function , e.g., , that satisfies mild regularity conditions there exist “-good policies” . The later policies are such that the following is true

(i.e., ) for every set of bandit distributions , for some positive finite constant .

The results presented here are in fact intuitive, in the following way: it will be shown that in the -Forcing and - index policies, the function essentially sets the investigator’s willingness to explore and experiment with bandits that do not currently (based on available data) seem to have the highest mean. Even if the controller explores very slowly (i.e., she chose a very slow growing ), as long as she explores long enough she will eventually develop accurate estimates of the means for each bandit, and incur very little regret (or pseudo-regret) past that point. We note here that, for the most part, we do not recommend the actual implementation or use of these policies. The cost of this guaranteed asymptotic behavior is that (depending on and the bandit specifics), slow pseudo-regret growth is only achieved on impractically large time-scales. We find it interesting, however, that such growth can be guaranteed - independent of the specifics of the bandits! - with as weak assumptions as the Strong Law of Large Numbers. This makes these results fairly broad. Additionally, the -Forcing and - index policies individually capture elements present in many other popular policies, and are suggestive of the almost sure asymptotical behavior of these policies. One takeaway from this is, perhaps, to emphasize that asymptotic behavior by itself is little basis for thinking of a policy as “good”. As essentially any asymptotic behavior is possible (through the choice of ), any useful qualification of a policy must consider not only the asymptotic behavior, but also the timescales over which it is practically achieved.

In the remainder of the paper, we define what it means for a policy to be -good (Definition 1), and establish the existence of -good policies (Theorem 1) for any satisfying mild regularity conditions. The proof is by example, through the construction of -Forcing and - index policies that satisfy its claim. Further, bounds on the corresponding order constants of pseudo-regret growth are established for each policy (Theorems 2 and 4), as well as bounds on the asymptotic remainder terms (Theorems 1 and 5 5), bounding the remainder from both above and below. We view the proofs of the asymptotic lower bounds, as well as the derivation of the remainder terms via a sort of ÔbootstrappingÕ on the earlier order results, as particularly interesting.

In the attempt to generalize some of these results for the - index policy, an interesting effect and seeming “phase change” in the resulting dynamics was discovered. Specifically, as discussed in Remark 2, when there are multiple optimal bandits, for of order greater than all optimal bandits are sampled roughly equally often, while for of order less than , the - index policy tends to fix on a single optimal bandit, sampling the other optimal bandits much more rarely in comparison.

## 2 Related Literature

Robbins [10] first analyzed the problem of maximizing asymptotically the expected value of the sum Using only the assumption of the Strong Law of Large Numbers for for . He constructed a modified (outside two sparse sequences of forced choices) “play the winner” (greedy) policy, , such that with probability one, as . From this he was able to claim, using the uniformly integrability property for the case of Bernoulli bandits that

(5) |

Lai and Robbins [9] considered the case in which the collection of distributions to consist of univariate density functions with respect to some measure where is known and the unknown scalar parameter is in some known set Let , and let denote the Kullback - Leibler divergence between and They established, under mild regularity conditions ((1.6), (1.7) and (1.9) therein), that if one requires a policy to have a regret that increases at slower than linear rate:

(6) |

then must sample among populations in such as way that its regret satisfies

(7) |

where

Burnetas and Katehakis [4] extended and simplified the above work for the case in which the collection of distribution is specified by a known function that may depend on an unknown vector parameter as follows. Let . They showed, under certain regularity conditions (part 1 of Theorem 1, therein) that if a policy satisfied Eq. (6), then it must sample among populations in such as way that its regret satisfies:

(8) |

where

(9) |

Further, under certain regularity conditions (cf. conditions “A1-A3” therein) regarding the estimates of the parameters and , they showed that policies which, after taking some small number of samples from each population, always choose the population with the largest value of the population dependent index:

(10) |

are asymptotically efficient (or optimal), i.e.,

(11) |

The index policy above, was a simplification of a UCB type policy first introduced in Lai and Robbins [9] that utilized forced actions. Policies that satisfy the requirements of Eq. (5), Eq. (6), and Eq. (11) were respectively called uniformly consistent (UC), uniformly fast convergent (UF), and uniformly maximal convergence rate (UM) or simply asymptotically optimal (or asymptotically efficient). The lower bound of Eq. (9) provides a baseline for comparison of the quality of policies and together with Eq. (11) and Eq. (8) provide an alternative way to state the asymptotic optimality of a policy as:

(12) |

Policies that achieve this minimal asymptotic growth rate have been derived for specific parametric models in Lai and Robbins [9], Burnetas and Katehakis [4], Honda and Takemura [7], Honda and Takemura [6], Honda and Takemura [8], Cowan et al. [5] and references therein. In general it is not always easy to obtain such optimal polices, thus, policies that satisfy the less strict requirement of Eq. (6), have been constructed, cf. Auer et al. [2], Audibert et al. [1], Bubeck and Cesa-Bianchi [3] and references therein. Such policies usually bound the regret as follows:

(13) |

where is, often much, bigger than for all .

The results presented herein can seem surprising, and it may appear to contradict (at least for ) the classical lower bound of for UF policies . For example, if we take to be the normal distribution with unknown mean and unknown variance , we have for any UF policy :

On the other hand we establish in the sequel that:

(14) |

However, no such contradiction exists: limits the of a UF policy from below. In such contexts that or are UF, if such contexts exist, the above constants will be bounded from below by . In such contexts that or are not UF, the bound does not apply. In such instances, we may in fact conclude from the results presented herein, and standard results relating modes of convergence, that for the policies constructed here, for , the sequences of random variables , are not uniformly integrable. An example as to how this can occur is given via the proof of Theorem 2 of Cowan et al. [5], where with a non-trivial probability, non-representative initial sampling of each bandit biases expected future activations of sub-optimal bandits super-logarithmically. This effect does not influence the long term almost sure behavior of these policies.

## 3 Main Theorems

We characterize a policy by the rate of growth of its pseudo-regret function with in the following way.

###### Definition 1

For a function , a policy is -good if for every set of bandit distributions , there exists a constant such that

(15) |

Remark 1: Essentially, a policy is -good if (a.s), Trivially, policies exist that are -good (i.e., ), for example any policy that samples all populations at constant rate .

We next state the following theorem:

###### Theorem 1

For , an unbounded, positive, increasing, concave, differentiable, sub-linear function, there exist -good policies.

The proof of this theorem is given by example with Theorems 2, 4, which demonstrate two -good policies: the -Forcing and the - index policies.

We note that in the sequel it will be assumed that any considered is an unbounded, positive, increasing, concave, differentiable, sub-linear function.

### 3.1 A Class of -Forcing Policies

Let be as hypothesized in Theorem 1. We define a -Forcing policy in the following way:

[colback=blue!1, arc=1pt, width=.99] -Forcing policy: A policy that first samples each bandit once, then for ,

(16) |

Briefly, at any time, if any population has been sampled fewer than times, sample it. Otherwise, sample from the population with the current highest sample mean. Ties are broken either uniformly at random, or at the discretion of the investigator. In this way, can be seen as determining the rate of exploration of currently sub-optimal bandits. This can be viewed as a variant on the policy considered in Robbins [10].

It is convenient to define the following constant,

(17) |

The value in some sense represents the pseudo-regret incurred each time the sub-optimal bandits are all activated once. The next result states that -Forcing policies satisfy the conditions of Theorem 1.

###### Theorem 2

For a policy as in (16), is -good, and

(18) |

The above theorem can be strengthened in the following way, bounding the asymptotic remainder terms almost surely:

###### Theorem 3

Proof. [Theorems 2 and 3] Theorems 2, 3 follow immediately from the following proposition, the proof of which is given in Appendix A:

###### Proposition 1

For policy as in (16), the following is true: For every , almost surely there exists a such that, for all ,

(21) |

Using the above relation to bound first the limits as of , then (observing that ), give the desired results.

Proposition 1 is considerably stronger than Theorems 2, 3. However, it somewhat obscures the true nature of what is going on: for sufficiently large , almost surely, sub-optimal bandits are only activated during the “forcing” phase of the policy, when some activations are below . As a result, since increases slowly (e.g. is sub-linearly), for large , - except for a discrepancy that occurs, for a brief stretch of activations, whenever surpasses the next integer threshold. At this point, the policy raises the activations of each sub-optimal bandit, restoring the previous equality. Hence, in fact, equality holds in Proposition 1 for most large . Discrepancy occurs increasingly rarely with , based on the hypotheses on . If, additionally, the controller specifies a deterministic scheme for tie-breaking, pseudo-regret may be determined explicitly for all sufficiently large . Leaving ties to the discretion of the controller, Proposition 1 is as strong a statement as can be made.

### 3.2 A Class of g-Index Policies

In this section, we consider an index policy related to the classical ”UCB” index policies. Let be as hypothesized. For each , define an index on ,

(22) |

[colback=blue!1, arc=1pt, width=.99] - index policy: A policy that first samples each bandit once, then for ,

(23) |

Briefly, at any time, the sample means of each bandit are “inflated” by the term, and the policy always activates the bandit with the largest inflated sample mean. When unsampled, a bandit’s inflated sample mean increases essentially at rate , hence drives the rate of exploration of current sub-optimal bandits. While this policy is inspired by more traditional ”Upper Confidence Bound” policies, we refer to this as an Inflated Sample Mean policy, as it has no deliberate connection to confidence bounds.

More general index policies of this type could also be considered, for instance based on an index where is some positive, increasing function of its argument. This is more in line with the common UCB policies, which frequently have inflation terms of the form (though this is hardly necessary, c.f. Cowan et al. [5]) with serving the “exploration-driving” role of . However, introducing this extra function does not influence the order of the growth of pseudo-regret, it simply changes the relevant order constants, at the cost of complicating the analysis.

Theorem 4 below shows that a - index policy satisfies the conditions of Theorem 1, and gives the minimal order constant for this policy.

###### Theorem 4

For a policy as in (23), if the optimal bandit is unique,

(24) |

The proof of this theorem depends on the following propositions, the proofs of which are given in Appendix B. Interestingly, these results (and therefore Theorem 4) depend only on the assumption of the SLLN, not the LIL.

###### Proposition 2

For each sub-optimal , , (a.s.) a finite constant such that for ,

(25) |

###### Proposition 3

For each sub-optimal , , (a.s.) some finite such that for ,

(26) |

Proof. [Theorem 4]
For each sub-optimal bandit , as an application of Props. 2, 3, taking the limit of
first as , then as , gives
, almost surely. The theorem then follows similarly, from the definition of pseudo-regret, Eq. (3).

Remark 2: In the case that the optimal bandit is not unique, it happens that Prop. 2 still holds. It can be shown then that remains -good in this case, and has a limiting order constant of at most ( as the number of optimal bandits). We leave as an open question, however, that of producing a Prop. 3-type lower bound and the verification of as the minimal order constant. The proof of Prop. 3 for depends on establishing a lower bound on the activations of the unique optimal bandit: in short, at time , since the sub-optimal bandits are activated at most times (which holds independent of ), it follows from its uniqueness that the optimal bandit is activated at least times. If, however, and the optimal bandit is not unique, while the optimal bandits must have been activated at least in total at time , and the distribution of these activations among the optimal bandits is hard to pin down. Simple simulations seem to indicate a sort of “phase change”, in that for of order greater than all optimal bandits are sampled roughly equally often, while for of order less than , the policy tends to fix on a single optimal bandit, sampling the other optimal bandits much more rarely in comparison.

We offer the following as a potential explanation of this observed effect (and justification of the difficult to observe term): Let us hypothesize, for the moment, that under any circumstances, the optimal bandits are activated linearly with time, that is for any optimal , , with the order coefficient depending on the specifics of that bandit. Under policy , activations are governed by a comparison of indices. We consider then the fluctuations in value of the two terms of the index, the sample mean and the inflation term . Under the assumption the optimal bandits are activated linearly, and reasonable assumptions on the bandit distributions (to grant the Law of the Iterated Logarithm), the fluctuations in the sample mean over time will be of order . The fluctuations in the inflation term will be of order . It would seem to follow then that for of order less than when comparing indices of optimal bandits, the sample mean is the dominant contribution to the index, while for of order greater than , the inflation term is the dominant contribution to the index. When the inflation term dominates, among the optimal bandits an “activate according to the largest index” policy essentially reduces to a “activate according to the smallest number of activations” policy, which leads to equalization and all optimal bandits being activated roughly equally often. When the sample mean dominates, among the optimal bandits an “activate according to the largest index” policy essentially reduces to an “activate according to the highest sample mean” or “play the winner” policy, which leads to the policy fixing on certain bandits for long periods.

This explanation would additionally suggest that on one side of the phase change, when the inflation term dominates, the only properties of the optimal bandits that matter for the dynamics of the problem are their means, that they all have the optimal mean . But on the other side of the phase change, when the sample mean dominates, other properties such as the variances influence the dynamics, through the Law of the Iterated Logarithm. However at this point in time, this remains, while interesting, speculative.

Based on the above results, we have the following result: For each , (a.s.) some finite such that for ,

(27) |

Similarly, for the optimal bandit ,

(28) |

It follows trivially from these that each bandit is activated infinitely often, i.e., almost surely is equivalent to the sequence , though with some (finite) stretches of term repetition. It follows then, applying the LIL that

(29) |

This provides greater control over the sample mean of each bandit than what the Strong Law of Large Numbers alone allows, and allows the results of the previous asymptotic results to be strengthened, as in the following theorem.

In short, we have that for a - index policy ,

It should be observed that, unlike previous results, this theorem is somewhat restrictive in its allowed . However, since the focus is traditionally on logarithmic regret, i.e., , it is clear that the above restrictions are nothing serious.

This theorem follows trivially from the following refinements of Props. 2, 3, and the definition of pseudo-regret, Eq. (3). Their proofs are given in Appendix C.

###### Proposition 4

If , for each sub-optimal , the following holds almost surely:

(32) |

###### Proposition 5

If , for each sub-optimal , the following holds almost surely:

(33) |

Again, we leave as an open problem that of extending these results to the case of non-unique optimal bandits.

## 4 Comparison between Policies

We have established two policies, -Forcing and - index, that each achieve pseudo-regret, almost surely. The question of which policy is “better” is not necessarily well posed. For one thing, the asymptotic pseudo-regret growth of either policy can be improved by picking a slower . In this sense, there is certainly no “optimal” policy as there will always be a slower . For a fixed , however, the question of which policy is better becomes context specific: for some bandit distributions, the order constant of the -Forcing policy, , will be smaller than the order constant of the - index policy, ; for some bandit distributions, the comparison will go the other way.

In terms of the results presented here, the pseudo-regret of the -Forcing policy is much more tightly controlled, Proposition 1 bounding the fluctuations in pseudo-regret around by at most a constant - indeed, at most . The bounds on the - index policy however are . But, this additional control of the -Forcing policy comes at a cost. It follows from the proof of Proposition 1 that for sub-optimal , for all large ,

(34) |

However, for the - index policy, following the proof of 4, for all sub-optimal , and large ,

(35) |

It is clear from this that the -Forcing policy is in some sense the more democratic of the two, sampling all sub-optimal bandits equally, regardless of quality. The - index policy is the more meritocratic, sampling sub-optimal bandits more rarely the farther they are from the optimum. This has the effect of boosting the sampling of bandits near the optimum, but this effect is somewhat counterbalanced as they contribute less to the pseudo-regret.

## 5 Relaxing Assumptions: i.i.d. Bandits

The assumption that the results from each bandit are i.i.d. is fairly standard - the problem is generally phrased as a matter of knowledge discovery about a set of unknown distributions, though the use of repeated measurements. However, it is interesting to observe that this assumption actually plays no part in the results and proofs present in this paper. The sole distributional property that mattered for establishing the policies as -good was the assumption that for each bandit there existed some finite such that almost surely with (though the Law of Iterated Logarithms was utilized to great effect in bounding the remainder terms). In fact, the expected values of the individual need not be , nor must the be independent of each other for a given . Further, it is never necessary that the bandits themselves be independent of each other! In that regard, the results herein are actually quite general statements about minimizing pseudo-regret under arbitrary multidimensional stochastic processes that satisfy that strong large number law-type requirement.

However, a word of caution is due: removing the restrictions on in this way, while not influencing the proofs of the results presented here, does somewhat call into question the definition of “pseudo-regret” as given in Eq. (3). The individual sample means freed, it is not necessarily reasonable to define a finite horizon pseudo-regret, , in terms of the infinite horizon means, . For instance, it is no longer necessarily true that the optimal, complete knowledge policy on any finite horizon is simply to activate a bandit with infinite horizon mean at every point. A more applicable definition of pseudo-regret would have to take into account what is reasonable to know or measure about the state of each bandit in finite time.

Acknowledgement: We would like to acknowledge support for this project from the National Science Foundation (NSF grant CMMI-14-50743).

## Appendix A Proof of Proposition 1

Proof. To prove Proposition 1, it will suffice to show the following: For all and all , (a.s.) a finite time such that that,

(36) |

Theorem 1 follows from this result and Eq. (3), with the appropriate choice of .

Without loss of generality, we may restrict ourselves to .

As a preliminary step: Based on the properties of , if is the total number of bandits, there exists a finite, not random, time such that , the following is true:

(37) |

This follows from the observation that , and that .

When implementing a -Forcing policy (hereafter referenced simply as ), there are essentially two alternating phases (or modes) of the policy: “catch up” and “play the winner”. During “catch up”, some number of bandits have fewer than activations (the sub- bandits), and they are activated until all bandits have at least activations. During “play the winner”, each bandit has at least activations, and the bandit with the current greatest sample mean is activated. These phases can be seen as governed by the function so that when the policy is in “catch up” mode, when the policy is in “play the winner” mode.

Having activated bandits according to policy up to time , suppose that , hence the policy enters or is in a period of “catch up”. Let be the number of sub- bandits at time . Because is increasing, and there are sub- bandits at time , it will take at least “catch up” activations before the policy enters a period of “play the winner” (). Consider activating bandits according to policy for activations. Note, , so from Ineq. (37) and increasing property of we have: . Additionally, , as every bandit realizing the minimum activations will have been activated at least once. It follows that

(38) |

Hence, after a period of activations from time , the spread has decreased by at least . Repeating this argument, based on the number of sub- bandits (if any) at time , it is clear that eventually - in finite time - a time is reached such that . At this point, all bandits have been activated at least times, and the policy enters a period of “play the winner”. We observe the loose, but sample-path-wise, bound that,

(39) |

since always, and at every step the number of sub- bandits is at most . Observe that if in fact , then we may take .

Having entered a period of or “play the winner” at time , let such that but . That is, in the transition from time to , surpasses the number of activations of some bandits and the policy enters a period of “catch up”. At such a point, we have the following relations:

(40) |

The first inequality is simply that , the second following since , and the last since . However, since the are integer valued and non-decreasing, the above yields

(41) |

Combining Eqns. (40), (41) yields the important relation that . Note additionally,

(42) |

Again noting the are integer valued, this implies that while there are sub- bandits at time , the only sub- bandits are those that realize the minimum number of activations . All other bandits have activations strictly greater than . Let the number of sub- bandits at time again be denoted . For additional activations under , in the “catch up” phase, we have that and . Hence, . For additional activations after time , each sub- bandit has been activated once, raising the minimum number of activations by 1: . Additionally, , hence .

We see therefore that after , at any point at which becomes positive after being at most zero, it is at most for a finite time - the “catch up” phase - before becoming negative. Hence it follows, that for , , or for each

(43) |

Note, this is true for all . This acts as justification for the description of as the “forcing function”, as the policy forces all activations to grow at least at asymptotically.

Since is unbounded and increasing, all populations are sampled infinitely often over time. Taking the strong law of large numbers to hold, for every and each , there exists almost surely some finite such that for all . It is worth noting here that while such a exists, it is random and unknowable to the investigator. Because of the properties of , we may define a finite such that . By Eq. (43), we have that for all ,

(44) |

Hence we have for each population, for every , there exists almost surely a finite random time past which the sample mean is trapped within the