Asymptotically Optimal Multi-Armed Bandit Activation Policies under Side Constraints
This paper introduces the first asymptotically optimal strategy for the multi armed bandit (MAB) problem under side constraints. The side constraints model situations in which bandit activations are not cost free, but incur known bandit dependent costs (utilize different resources), and the controller is always constrained by a limited resource availability. The main result involves the derivation of an asymptotic lower bound for the regret of feasible uniformly fast policies and the construction of policies that achieve this lower bound, under pertinent conditions. Further, we provide the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances and for the case of arbitrary discrete distributions with finite support.
Keywords Stochastic Bandits, Sequential Decision Making, Regret Minimization, Sequential Allocation.
Consider the problem of sequential activating one of a finite number of independent statistical bandits, where successive activations of each bandit yield iid random rewards with distributions that depend on unknown parameters with positive means. Each activation of a bandit incurs a number of bandit dependent and in general different types of activation costs. For each cost type there is a cost constraint that must be satisfied in every period. Each constraint represents a resource budget which can not be exceeded, however it is assumed that at each activation (period) any unused resource amounts can be carried forward for use in future activations (periods).
The objective is to obtain a feasible policy that maximizes asymptotically the total expect rewards or equivalently, it minimizes asymptotically a regret function. We develop a class of such feasible policies that are shown to be asymptotically optimal within a large class of good policies that uniformly fast (UF) convergent, in the sense of Burnetas and Katehakis (1996) and Lai and Robbins (1985). The results in this paper extends the work in Burnetas et al. (2017) which solved the case where there exists only one type of cost constraint for all bandits and provides a class of block-UCB (b-UCB) feasible policies, achieving this asymptotic lower bound that have a simpler form and are easier to compute than those in Burnetas et al. (2017).
There is an extensive literature on the multi-armed bandit (MAB) problem, cf. Lai and Robbins (1985); Katehakis and Robbins (1995); Mahajan and Teneketzis (2008); Audibert et al. (2009); Auer and Ortner (2010); Honda and Takemura (2011); Bubeck and Slivkins (2012); Cowan and Katehakis (2015); Lattimore (2018) and references therein. As far as we know, the first formulation of the MAB problem with a side constraint considered herein was given in Burnetas and Katehakis (1998). Tran-Thanh et al. (2010), considered the problem when the cost of activation of each arm is fixed and becomes known after the arm is used once. Burnetas and Kanavetas (2012) considered a version of this problem with a single constraint and constructed a consistent policy (i.e., with regret ).
A different constrained MAB model is given in Badanidiyuru et al. (2018). In this model there is a single resource upper bound for all constraints which when exhausted activations stop. They showed how to construct polices with sub-linear regret. Key differences between this work and our model are: i) each constraint has a different upper bound per period, which is renewed in each period and in each period any unused resource amount can be carried over, ii) in the aforementioned paper all constraints coefficients are (much) smaller than while in our model we allow several to be greater than the upper bound, the corresponding bandits (with large coefficients) can nevertheless be activated if the controller does not utilize the full amount of resources in one (or more) period(s) so as to carry over the excess resources for use in future periods. Some interesting applications of constrained MAB models are: Problems of dynamic procurement Singla and Krause (2013), auctions Tran-Thanh et al. (2014), and dynamic pricing Wang et al. (2014), Johnson et al. (2015). Ding et al. (2013) constructed UF policies (i.e., with regret ) for cases in which activation costs are bandit dependent iid random variables. For other recent related work we refer to: Guha and Munagala (2007); Tran-Thanh et al. (2012); Thomaidou et al. (2012); Lattimore et al. (2014); Sen et al. (2015); Pike-Burke and Grunewalder (2017); Zhou et al. (2018); Spencer and Kevan de Lopez (2018).
In this paper, we first establish in Theorem 5, a necessary asymptotic lower bound for the rate of increase of the regret function of f-UF policies. Then we construct a class of “block f-UF” policies and provide conditions under which they are asymptotically optimal within the class of f-UF policies, achieving this asymptotic lower bound, cf.Theorem 6. At the end we provide the explicit form of an asymptotically optimal f-UF policy and two applications, for the case in which the unknown distributions are Normal with unknown means and known variances and the case where they follow discrete distribution with finite support.
2 Model Formulation
Let , denote the independent bandits, where successive activations of a bandit constitute a sequence of i.i.d. random variables . For each fixed follows a univariate distribution with density with respect to a nondegenerate measure . The density is known and is a parameter belonging to some set . Let denote the set of parameters, , where . Given let be the vector of the expected values, i.e. . The true value of is unknown. We make the assumptions that outcomes from different bandits are independent and all means are positive.
Each activation of bandit incurs different types of cost: , To avoid trivial cases, we will assume that By relabeling the bandits we call as bandit the bandit which has the maximum number of costs that are the minimum among in the constraint type . Similarly, we label as bandit the bandit which has the maximum number of costs that are the maximum among in the constraint type . Again to avoid trivial cases, we will assume that for each constraint type and , for at least one constraint type . For simplicity of the mathematical analysis below we assume that there is no bandit with activating cost that is equal to Equivalently, for each constraint type , there exists , with and (note that ).
Following standard terminology, adaptive policies depend only on past activations and observed outcomes. Specifically, let , denote the bandit activated and the observed outcome at period . Let denote a history of activations and observations available at period t. An adaptive policy is a sequence of history dependent probability distributions on , such that Given , let denote the number of times bandit has been sampled during the first n periods . Let and be respectively the total reward earned and total cost for type incurred up to period , i.e.,
We call an adaptive policy feasible if
The objective is to obtain a feasible policy that maximizes asymptotically or equivalently, it minimizes asymptotically the regret function cf. (3.1).
2.1 Optimal Solution Under Known Parameters
It follows from standard theory of MDPs cf. Derman (1970), that if all parameters were known, the optimal activation(s) (the same in all periods) for maximizing the expected average reward are obtained as the solution to the following linear program (LP).
where the variables , for , represent the activation probabilities for bandit of an optimal randomized policy.
Thus, a basic matrix is an matrix that consists of one or at most bandit (and slack) variables (and ); recall that . Note that any basic feasible solution (BFS) corresponding to such a choice of the matrix is uniquely determined by the vector corresponding to the choice of basic bandit variables: , For simplicity, the sequel we will not distinguish between and , since if one knows one he knows the other. Thus, the vector uniquely determines a corresponding (possibly randomized) activation policy with randomization probabilities , in . We use to denote the set of bandits corresponding to a feasible choice of for simplicity written as
Given our assumptions on the , it follows that the feasible region of (2.1) is nonempty and bounded, hence corresponds to a finite number of BFS.
In the sequel it will be more convenient to work with the dual problem DLP stated below.
For a basic matrix of LP, we let denote the dual vector corresponding to , i.e., , where contains the means of the bandits given by the choice of
A BFS is optimal if and only if the reduced costs (dual slacks) for the corresponding basic matrix are all nonnegative, i.e.,
Note that it is easy to show that the reduced cost can be expressed as a linear combination of the bandit means, i.e., , where is an appropriately defined vector that does not depend on .
In the sequel we use the notation to denote the set of choices of corresponding to optimal solutions of the LP for a vector , i.e.,
3 Optimal Policies Under Unknown Parameters
3.1 The Regret Function
In this subsection we consider the case in which is unknown and define the regret of a policy as the finite horizon loss in expected reward with respect to the optimal policy corresponding to the case in which is known, i.e.,
As in Burnetas et al. (2017), we now state the following. A feasible policy is called consistent if and it is called uniformly fast (f-UF) if
3.2 Lower Bound for the Regret
For any , we define the sets and , as follows
where , is a new vector such that only parameter is changed from .
Note that the first set consists of all values of under which the problem with known parameters under the perturbed has a unique optimal solution that includes bandit . The second set , consists of all bandits that do not appear in any optimal solution under parameter set but, by changing only parameter , there is uniquely optimal solution that contains them.
We next define the minimum distance of a parameter vector to a new parameter vector which makes bandit to become optimal and hence appear in the unique optimal solution when its parameter becomes .
where, denotes the Kullback-Leibler distance for the distributions and i.e.,
The next Lemma establishes upper and lower bounds for the new mean under the changed parameter vector in terms of the quantity . The proof is specialized and not the focus of this paper, and is relegated to the appendix.
For any optimal matrix under , such
that for any
the following is true,
The above implies that
In order to establish a lower bound on the regret we need to express it as:
and any optimal basic matrix where the above expression follows from the LP and DLP relations since is an optimal basis: and .
Both terms of the right side of (3.2) are nonnegative, the first due to optimality of and the second due to the feasibility of . It follows that a necessary and sufficient condition for a policy to be f-UF is that for all and any optimal under the following two relations hold.
The following lemma and proposition are used to establish in Lemma 4 a lower bound for the activation frequencies of any f-UF policy. The readily imply the lower bound of such polices for the regret in Theorem 5. The proof of the lemma is relegated to the appendix.
If there is a uniquely optimal . Then the following hold.
If , , then , for out of the resource constraints, and equal to for the remaining resource constraints.
When is a singleton, i.e., , then , for every resource constraint .
The next proposition establishes that a f-UF policy is such that , it must be true that the number of activations from each bandit are at least , for some sequence of positive constants .
For any f-UF policy and for all we have that for , any and for all positive sequences: it is true that
Proof. Let , The definition of and Lemma 1 (ii) imply that we must have a which is uniquely optimal under (i.e., ) and . Then we have two cases for the uniquely optimal solution depending on wether is a singleton or not.
In the first case , and Lemma 2 implies that for a f-UF policy , thus, the definition of a f-UF policy implies that:
Now for any sequence with (for all ), we obtain the following.
In the second case , for , where bandit is one of . Then as before (3.5) holds It follows from Lemma 2 that for an f-UF policy we must have , for resource constraints, which we label as Using the last result and and (3.4) we obtain:
If we sum (3.5) for all it follows that
Now, let be the corresponding randomization probabilities, then and from (3.9) we have that
From the definition of we can write , and from the DLP we have . Also, from the DLP we obtain that , for and , where for .
After some algebra we can show that
Using , and (3.12) we
In addition, since (3.13) can be written as:
which simplifies into:
Multiplying both sides of the last equation by ,
For any let
where is the th resource constraint. Now, for each resource constraint , label as the minimum for . With thus defined and the above definitions we have:
Furthermore, from (3.9)
Now, (2.3) and the definition of imply the following.
For any bandit in the following arguments hold. For simplicity we present these arguments only for the specific bandit for which