Bounded Regret for Finitely Parameterized Multi-Armed Bandits

# Bounded Regret for Finitely Parameterized Multi-Armed Bandits

## Abstract

We consider the problem of finitely parameterized multi-armed bandits where the model of the underlying stochastic environment can be characterized based on a common unknown parameter. The true parameter is unknown to the learning agent. However, the set of possible parameters, which is finite, is known a priori. We propose an algorithm that is simple and easy to implement, which we call FP-UCB algorithm, which uses the information about the underlying parameter set for faster learning. In particular, we show that the FP-UCB algorithm achieves a bounded regret under some structural condition on the underlying parameter set. We also show that, if the underlying parameter set does not satisfy the necessary structural condition, FP-UCB algorithm achieves a logarithmic regret, but with a smaller preceding constant compared to the standard UCB algorithm. We also validate the superior performance of the FP-UCB algorithm through extensive numerical simulations.

{keywords}

Multi-Armed Bandits, Reinforcement Learning Learning, Sequential Decision Making

## I Introduction

Multi-Armed Bandits (MAB) problems are the canonical formalism for studying how an agent learns to take optimal actions by repeated interactions with a stochastic environment. The learning agent receives a reward at each time step, and it will depend on the action of the agent as well as the stochastic uncertainty associated with the environment. The goal of the learning agent is to take actions in such a way to maximize the cumulative reward. When the stochastic model of the environment is perfectly known, computing the optimal action is a straightforward optimization problem. The challenge, as in the case of most real-world problems, is that agent does not know the stochastic model of environment a priori, and it has to be learned using the sequential observations. The learning agent needs to do exploration, i.e., take various actions sequentially to gather information, in order to estimate the stochastic model of the system. At the same time, the learning agent needs to do exploitation of the available information at any given time for maximizing the cumulative reward. This exploration vs. exploitation trade-off is at the core of the multi-armed bandits problems.

Multi-armed bandits problems have been studied extensively in the literature. Lai and Robbins in their seminal paper [1] formulated the non-Bayesian stochastic i.i.d. multi-armed bandits problem and characterized the performance of a learning algorithm using the metric of regret. They showed that no learning algorithm will be able to achieve a regret better than . They also proposed a learning algorithm that achieves an asymptotic logarithmic regret, matching the fundamental lower bound. Ananthram et al. later extended this to the more general setting of Markovian rewards and multiple plays [2, 3]. A simple index based algorithm called UCB algorithm was introduced in [4] which achieves the order optimal regret in a non-asymptotic manner. This approach led to the development of a number interesting algorithms, like linear bandits [5], contextual bandits [6], combinatorial bandits [7], and decentralized and multi-player bandits [8].

Thompson (Posterior) Sampling is another class of algorithms that gives superior numerical performance in multi-armed bandits problems. Posterior sampling heuristic was first introduced by Thompson [9], but the first rigorous performance guarantee, an regret, was given in [10]. Thompson sampling idea has been used to develop algorithms for bandits with multiple plays [11], contextual bandits [12], general online learning problem [13], and reinforcement learning [14]. Both classes of algorithms have been used in a number of practical applications, like communication networks [15], smart grids [16], and recommendation systems [17].

Our contribution: We consider a class of MAB problems where the model of the underlying stochastic environment can be characterized based on a common unknown parameter. In particular, we consider the setting where the cardinality of the set of possible parameters is finite. This is inspired by many real-world applications. For example, in recommendation systems and e-commerce applications (Amazon, Netflix), it is typical to assume that each user has a certain ‘type’ parameter, and the set of possible parameters is finite. The preference of the user is characterized by her type (for example, prefer science books over fiction books). The set of all possible types and the preferences of each type may be known a priori, but the type of a new user may be unknown. So, instead of learning the preferences of this user over all possible choices, it may be easier to learn the type parameter of this user from a few observations. In this work, we propose an algorithm that explicitly uses the availability of such structural information about the underlying parameter set which enables a faster learning.

We show that, the proposed FP-UCB algorithm can achieve a bounded regret under some structural condition on the underlying parameter set. This is in sharp contrast to the increasing regret of the standard multi-armed bandits algorithms. We also show that, if the underlying parameter set does not satisfy the necessary structural condition, FP-UCB algorithm achieves a regret of , but with a smaller preceding constant compared to the standard UCB algorithm. The regret achieved by our algorithm also matches with the fundamental lower bound given by [18]. One remarkable aspect of our algorithm is that, it is oblivious to the fact if the underlying parameter set satisfies the necessary condition or not, and thus avoiding re-tuning of the algorithm depending on the problem instance. Instead, it achieves the best possible performance given the problem instance.

Related work: Finitely parameterized multi-armed bandits problem was first studied by Agrawal et al. [18]. They proposed an algorithm that achieves a bounded regret when the parameter set satisfies some necessary condition, and a logarithmic regret otherwise. However, their algorithm is rather complicated which limits its practical implementations and extension to other settings. The regret analysis is also involved and asymptotic in nature, different from the recent simpler index-based bandits algorithms and their finite time analysis. [18] also provided a fundamental lower bound for this class of problems. Compared to this work, our FP-UCB algorithm is simple, easy to implement, and easy to analyze, while providing non-asymptotic performance guarantees, which matches the lower bound.

There are many recent work on exploiting the available structure of the MAB problem for getting tighter regret bounds. In particular, [19] [20] [21] [22] consider the problem setting similar to this paper where the mean reward of each arm is parameterized by a single unknown parameter. [19] assumes that the reward functions are continuous in the global parameter and gives a bounded regret result. [20] gives specific conditions on the mean reward to achieve a bounded regret. [21] considers a latent bandit problem where the reward distributions are partitioned into a number of clusters and indexed by a latent parameter corresponding to the cluster. [22] characterizes the minimal rates at which sub-optimal arms have to be explored depending on the structural information, and proposes an algorithm which achieves these rates. [23] exploits a different structural information where it is shown that if the mean value of the best arm and the second best arm (but not the identity of the arms) are known, then a bounded regret can be achieved.[24] [25] also address problems with similar structural information. There are also work on bandits algorithms that try to exploit the side information [26] [27], and recently in the context of contextual bandits [28]. Our problem formulation, algorithm, and analysis are very different from these works.

## Ii Problem Formulation

We consider the following sequential decision making problem. In each time step the agent selects an arm (action) from the set of possible arms, denoted as, . Each arm , when selected, yields a random real-valued reward. More precisely, let be the random reward from arm in its th selection. We assume that is drawn according to a probability distribution with a mean . Here is the (true) parameter that determines the distribution of the stochastic rewards. The agent doesn’t know or the corresponding mean values . The random reward obtained from playing an arm repeatedly are i.i.d. and independent of the plays of the other arms. We assume that rewards are bounded with support in . The goal of the agent is to select a sequence of actions that maximizes the expected cumulative reward, i.e., .

Clearly, the optimal choice is to select the best arm (the arm with the highest mean value) all the time, i.e., , where . However, the agent will be able to make this optimal decision only if she knows the parameter or the corresponding mean values for all . The goal of a MAB algorithm is to learn to make the optimal sequence of decisions without knowing the true parameter a priori.

We consider the setting where the agent knows the set of possible parameters . We assume that is finite. If the true parameter were , then agent selecting arm will get a random reward drawn according to a distribution with a mean . We assume that for each , the agent knows and for all . The optimal arm corresponding to the parameter is denoted as . We emphasis that agent doesn’t know the true parameter (and hence the optimal action ) except the fact that it is in the finite set .

In the multi-armed bandits literature, it is standard to characterize the performance of an online learning algorithm using the metric of regret. Regret is defined as the performance loss of the algorithm as compared to the optimal algorithm with complete information. Since , the expected cumulative regret of a multi-armed bandits algorithm after time steps is defined as

 E[R(T)]:=E[T∑t=1(μa∗(θo)(θo)−μa(t)(θo))]. (1)

The goal of a multi-armed bandit learning algorithm is to select actions sequentially in order to minimize .

## Iii UCB Algorithm for Finitely Parameterized Multi-Armed Bandits

In this section, we present our algorithm for finitely parameterized multi-armed bandits and the main theorem. We first introduce a few notations for presenting the algorithm and the results succinctly.

Let be the number of times arm has been selected by the algorithm until time , i.e., . Here is an indicator function. Define the empirical mean corresponding to arm at time as,

 ^μi(t):=1ni(t)ni(t)∑τ=1Xi(τ). (2)

Define the set , which is the collection of optimal arms corresponding to all parameters in . Intuitively, a learning agent can restrict to selecting the arms from the set . Clearly, and this reduction can be useful when is much smaller than .

Our Finitely Parameterized Upper Confidence Bound (FP-UCB) Algorithm is given in Algorithm 1. Figure 1 gives an illustration of the episodes and time slots of the FP-UCB algorithm.

For stating the main result, we introduce a few more notations. We define the confusion set and as,

 B(θo) :={θ∈Θ:a∗(θ)≠a∗(θo) and μa∗(θo)(θo)=μa∗(θo)(θ)}, C(θo) :={a∗(θ):θ∈B(θo)}.

Intuitively, is the set of parameters that can be confused with the true parameter . If is non-empty, selecting the arm and estimating the empirical mean is not sufficient to identify the true parameter because the same mean reward can result from other parameters in . So, if is non-empty, more exploration (i.e., selecting suboptimal actions other than ) is necessary to identify the true parameter. This exploration will contribute to the regret. On the other hand, if is empty, optimal parameter can be identified with much less exploration, which results in a bounded regret. is the corresponding set of arms that needs to be explored sufficiently for identifying the optimal parameter.

We make the following assumption.

###### Assumption 1 (Unique best action).

For all , the optimal action, , is unique.

We note that this is a standard assumption in the literature. This assumption can be removed at the expense of more notations. We define as,

 Δi :=μa∗(θo)(θo)−μi(θo), (3)

which is the difference between the mean value of the optimal arm and the mean value of arm for the true parameter . This is the standard optimality gap notion used in the MAB literature [4]. Without loss of generality assume natural logarithms.

For each arm in , we define,

 βi :=minθ:θ∈B(θo),a∗(θ)=i|μi(θo)−μi(θ)|. (4)

We use the following Lemma to compare our result with classical MAB result. The proof for this lemma is given in the appendix.

###### Lemma 1.

Let and be as defined in (3) and (4) respectively. Then, for each , . Moreover, .

We now present the finite time performance guarantee for our FP-UCB algorithm.

###### Theorem 1.

Under the FP-UCB algorithm,

 E[R(T)]≤D1,  if B(θo) empty, and, (5) E[R(T)]≤D2+12log(T)∑i∈C(θo)Δiβ2i,  if B(θo) non-% empty, (6)

where are problem dependent constants that does not depend on .

###### Remark 1 (Comparison with the classical MAB results).

Both UCB type algorithms and Thompson Sampling type algorithms give a problem dependent regret bound . More precisely, assuming that the optimal arm is arm 1, the regret of the UCB algorithm, , is given by [4]

 E[RUCB(T)]=O(L∑i=21ΔilogT).

On the other hand, FP-UCB algorithm achieves the regret

 E[RFP-UCB(T)]=O(1),  if B(θo) empty, and, O⎛⎝∑i∈C(θo)Δiβ2ilogT⎞⎠,if B(θo) non-empty.

Clearly, for some MAB problems, FP-UCB algorithm achieves a bounded regret () as opposed to the increasing regret () of the standard UCB algorithm. Even in the cases where FP-UCB algorithm incurs an increasing regret (), the preceding constant () is smaller than the preceding constant () of the standard UCB algorithm because .

We now give the asymptotic lower bound for the finitely parameterized multi-armed bandits problem from [18], for comparing the performance of our FP-UCB algorithm.

###### Theorem 2 (Lower bound [18]).

For any uniformly good control scheme under the parameter ,

 liminfT→∞E[R(T)]log(T)≥maxθ∈B(θo)μa∗(θo)(θo)−μa∗(θ)(θo)Da∗(θ)(θo∥θ).

where is the KL-divergence between the distributions and .

###### Remark 2 (Optimality of the FP-UCB algorithm).

From Theorem 2, the achievable regret of any multi-armed bandits learning algorithm is lower bounded by when is empty, and when is non-empty. Our FP-UCB algorithm achieves these bounds and hence achieves the order optimal performance.

## Iv Analysis of FP-UCB Algorithm

In this section, we give the proof of Theorem 1. For reducing the notation, without loss of generality we assume that the true optimal arm is arm , i.e., We will also denote as , for any .

Now, we can rewrite the expected regret from (1) as

 E[R(T)]=E[T∑t=1(μo1−μoa(t))] =L∑i=2Δi E[T∑t=1\mathbbm1{a(t)=i}]=L∑i=2Δi E[ni(T)].

Since the algorithm selects arms only from the set , this can be written as

 E[R(T)]=∑i∈AΔi E[ni(T)]. (7)

We first prove the following important propositions.

###### Proposition 1.

For all , under FP-UCB algorithm,

 E[ni(T)]≤Ci, (8)

where is a problem dependent constant that does not depend on .

###### Proof.

Consider an arm . Then, by definition, there exists a such that . Fix a which satisfies this condition. Define

 α1(θ):=|μ1(θo)−μ1(θ)|.

It is straightforward to note that when , then the which we considered above is not in . Hence, by definition, .

For notational convenience, we will denote simply as , for any . Notice that the algorithm picks arm once in . Define (note that this is a random variable) to be the total number of episodes in time horizon for the FP-UCB algorithm. It is straightforward that . Now,

 E[ni(T)]=1+E⎡⎣T∑t=|A|+1\mathbbm1{a(t)=i}⎤⎦ \lx@stackrel(a)=1+E[KT∑k=1(\mathbbm1{i∈Ak}+\mathbbm1{Ak=∅})] ≤1+T∑k=1(P({i∈Ak})+P({Ak=∅})) (9) =1+T∑k=1(P({i∈Ak,1∈Ak}) +P({i∈Ak,1∉Ak})+P({Ak=∅})) ≤1+T∑k=1(P({i∈Ak,1∈Ak}) +P({i∈Ak,1∉Ak})+P({i∉Ak,1∉Ak})) ≤1+T∑k=1(P({i∈Ak,1∈Ak})+P({1∉Ak})). (10)

Here (a) follows from the algorithm definition.

We will first analyze the second summation term in (10). First observe that, we can write for any and episode . Thus, lies between 1 and . Now,

 T∑k=1P({1∉Ak}) \lx@stackrel(b)≤T∑k=1∑j∈AP(|^μj(tk)−μoj|>√3logknj(tk)) \lx@stackrel(c)=T∑k=1∑j∈AP⎛⎝∣∣ ∣∣1nj(tk)nj(tk)∑τ=1Xj(τ)−μoj∣∣ ∣∣>√3logknj(tk)⎞⎠ \lx@stackrel(d)≤T∑k=1∑j∈Ak∑m=1P(∣∣ ∣∣1mm∑τ=1Xj(τ)−μoj∣∣ ∣∣>√3logkm) \lx@stackrel(e)≤T∑k=1∑j∈Ak∑m=12exp(−2m3logkm) =T∑k=1∑j∈A2k−5≤4|A|. (11)

Here (a) follows from algorithm definition, (b) from the union bound, and (c) from the definition in (2). Inequality (d) follows by conditioning the random variable that lies between 1 and for any and episode . Inequality (e) follows from Hoeffding’s inequality [29, Theorem 2.2.6].

For analyzing the first summation term in (10), define the event Denote the complement of this event as . Now the first summation term in (10) can be written as

 T∑k=1P({i∈Ak,1∈Ak}) =T∑k=1P({i∈Ak,1∈Ak,Eck}) (12) +T∑k=1P({i∈Ak,1∈Ak,Ek}). (13)

Analyzing (12), we get,

 P({i∈Ak,1∈Ak,Eck}) =P⎛⎜ ⎜ ⎜ ⎜⎝∩j∈A{|^μj(tk)−μoj|<√3logknj(tk)},∩j∈A{|^μj(tk)−μj|<√3logknj(tk)},Eck⎞⎟ ⎟ ⎟ ⎟⎠ ≤P⎛⎜ ⎜ ⎜ ⎜⎝{|^μ1(tk)−μo1|<√3logkn1(tk)},|{^μ1(tk)−μ1|<√3logkn1(tk)},Eck⎞⎟ ⎟ ⎟ ⎟⎠=0. (14)

This is because the events and are disjoint under , that is, when . To see this, notice that

 {|^μ1(tk)−μo1|<√3logkn1(tk)} ⊆{|^μ1(tk)−μo1|<α1(θ)2}, {|^μ1(tk)−μ1|<√3logkn1(tk)} ⊆{|^μ1(tk)−μ1|<α1(θ)2},

for . Moreover, since , and are disjoint sets. Hence, their subsets are also disjoint.

For analyzing (13), define . Note that, according to the FP-UCB algorithm, arm can be selected if is empty as well, so . Define and as,

 ki(θ):=min{k:k≥3,k>⌈12log(k)/α21(θ)⌉}, (15) m(k):=max{1,k−⌈12log(k)/α21(θ)⌉}. (16)

Note that is a problem dependent constant and does not depend on . Also, for all . We claim that for all ,

 {n′1(tk)<12log(k)/α21(θ)} ⊆{1∉Aτ,for some τ,m(k)≤τ≤k−1}. (17)

To see this, suppose there exists no , such that . Then, for all where . So, by definition for . So, the complement of the RHS of (IV) is a subset of the complement of the LHS of (IV). Hence the claim follows.

Now,

 \lx@stackrel(a)≤T∑k=1P(n′1(tk)<12log(k)/α21(θ)) \lx@stackrel(b)=ki(θ)+T∑k=ki(θ)P(n′1(tk)<12log(k)/α21(θ)) \lx@stackrel(c)≤ki(θ)+T∑k=ki(θ)P({1∉Aτ,for some τ,m(k)≤τ≤k−1}) \lx@stackrel(d)=ki(θ)+ T∑k=ki(θ)P⎛⎝k−1⋃τ=m(k)⋃j∈A|^μj(τ)−μoj|>√3logτnj(tτ)⎞⎠ ≤ki(θ)+T∑k=ki(θ)k−1∑τ=m(k)∑j∈AP(|^μj(τ)−μoj|>√3logτnj(tτ)) \lx@stackrel(e)≤ki(θ)+T∑k=ki(θ)k−1∑τ=m(k)2|A|τ5 (18) ≤ki(θ)+T∑k=ki(θ)2|A|k(m(k))5 ≤ki(θ)+T∑k=ki(θ)2|A|k(k−⌈12log(k)α21(θ)⌉)5 \lx@stackrel(f)≤ki(θ)+Ki(θ), (19)

where is a problem dependent constant that does not depend on .

In the above analysis, (a) follows from the definition of and the observation that . Considering to be greater than , equality (b) follows; note that this is an artifact of the proof technique and does not affect the theorem statement since the regret , for any less than , can be trivially upper bounded by . Inequality (c) follows from (IV), (d) by the FP-UCB algorithm, (e) is similar to the analysis in (11), and (f) follows from the fact that for all .

Now, using (19) and (14) in (12) and (13), we get,

 T∑k=1P({i∈Ak,1∈Ak})≤ki(θ)+Ki(θ). (20)

Using (20) and (11) in (10), we get,

 E[ni(T)]≤Ci,

where , which is a problem dependent constant that does not depend on . This concludes the proof. ∎

###### Proposition 2.

For any , under FP-UCB algorithm,

 E[ni(T)]≤2+4|A|+12log(T)β2i. (21)
###### Proof.

Fix an . Then there exists a such that . Fix a which satisfies this condition. Define the event Now,

 E[ni(T)]=1+E⎡⎣T∑t=|A|+1\mathbbm1{a(t)=i}⎤⎦ =1+E⎡⎣T∑t=|A|+1\mathbbm1{a(t)=i,F(t)}⎤⎦ +E⎡⎣T∑t=|A|+1\mathbbm1{a(t)=i,Fc(t)}⎤⎦. (22)

Analyzing the first summation term in (22) we get,

 E⎡⎣T∑t=|A|+1\mathbbm1{a(t)=i,F(t)}⎤⎦ =E⎡⎣T∑t=|A|+1\mathbbm1{a(t)=i}\mathbbm1{ni(t−1)<12logT/β2i}⎤⎦ ≤1+12logT/β2i. (23)

We use the same decomposition as in the proof of Proposition 1 for the second summation term in (22). Thus we get,

 E⎡⎣T∑t=|A|+1\mathbbm1{a(t)=i,Fc(t)}⎤⎦= E[KT∑k=1\mathbbm1{i∈Ak,Fc(tk+1)}+\mathbbm1{Ak=∅,Fc(tk+1)}] ≤T∑k=1P({i∈Ak,1∈Ak,Fc(tk+1)})+ (24) T∑k=1P({1∉Ak,Fc(tk+1)}), (25)

following the analysis in (10). First, consider (25). From the analysis in (11) we have

 T∑k=1P({1∉Ak,Fc(tk+1)})≤T∑k=1P({1∉Ak})≤4|A|. (26)

For any and episode under event , we have

 ni(tk)≥12logTβ2i≥12logtkβ2i≥12logkβ2i

since satisfies . From (4), it further follows that

 √3logknj(tk)≤βi2≤|μi(θo)−μi(θ)|2.

So, following the analysis in (14) for (24), we get

 P({i∈Ak,1∈Ak,Fc(tk+1)}) =P⎛⎜ ⎜ ⎜ ⎜⎝∩j∈A{|^μj(tk)−μj(θo)|<√3logknj(tk)},∩j∈A{|^μj(tk)−μj(θ)|<√3logknj(tk)},Fc(tk+1)⎞⎟ ⎟ ⎟ ⎟⎠ ≤P⎛⎜ ⎜ ⎜ ⎜⎝{|^μi(tk)−μi(θo)|<√3logkni(tk)},{|^μi(tk)−μi(θ)|<√3logkni(tk)},Fc(tk+1)⎞⎟ ⎟ ⎟ ⎟⎠=0. (27)

Using equations (23), (26), and (27) in (22), we get

 E[ni(T)]≤2+4|A|+12log(T)β2i.

This completes the proof. ∎

We now give the proof of our main theorem.

###### Proof.

(of Theorem 1)

From (7),

 E[R(T)] =∑i∈AΔiE[ni(T)] =∑i∈A∖C(θo)ΔiE[ni(T)]+∑i∈C(θo)ΔiE[ni(T)]. (28)

Whenever is empty, notice that is empty. So, using Proposition 1, (28) becomes

 E[R(T)] =∑i∈AΔiE[ni(T)]≤∑i∈AΔiCi≤|A|maxi∈AΔiCi.

Whenever is non-empty, is non-empty. Analyzing (28), we get,

 E[R(T)]=∑i∈A∖C(θo)ΔiE[ni(T)]+∑i∈C(θo)ΔiE[ni(T)] \lx@stackrel(a)≤∑i∈A∖C(θo)ΔiCi+∑i∈C(θo)ΔiE[ni(T)] \lx@stackrel(b)≤∑i∈A∖C(θo)ΔiCi+∑i∈C(θo)Δi(2+4|A|+12log(T)β2i) ≤|A|maxi∈AΔi(2+Ci+4|A|)+12log(T)∑i∈C(θo)Δiβ2i.

Here (a) follows from Proposition 1, and (b) from Proposition 2. Setting

 D1 :=|A|maxi∈