Factored Bandits

# Factored Bandits

Julian Zimmert
University of Copenhagen
zimmert@di.ku.dk
\AndYevgeny Seldin
University of Copenhagen
seldin@di.ku.dk
###### Abstract

We introduce the factored bandits model, which is a framework for learning with limited (bandit) feedback, where actions can be decomposed into a Cartesian product of atomic actions. Factored bandits incorporate rank-1 bandits as a special case, but significantly relax the assumptions on the form of the reward function. We provide an anytime algorithm for stochastic factored bandits and up to constants matching upper and lower regret bounds for the problem. Furthermore, we show that with a slight modification the proposed algorithm can be applied to utility based dueling bandits. We obtain an improvement in the additive terms of the regret bound compared to state of the art algorithms (the additive terms are dominating up to time horizons which are exponential in the number of arms).

## 1 Introduction

We introduce factored bandits, which is a bandit learning model, where actions can be decomposed into a Cartesian product of atomic actions. As an example, consider an advertising task, where the actions can be decomposed into (1) selection of an advertisement from a pool of advertisements and (2) selection of a location on a web page out of a set of locations, where it can be presented. The probability of a click is then a function of the quality of the two actions, the attractiveness of the advertisement and visibility of the location it was placed at. In order to maximize the reward the learner has to maximize the quality of actions along each dimension of the problem. Factored bandits generalize the above example to an arbitrary number of atomic actions and arbitrary reward functions satisfying some mild assumptions.

In a nutshell, at every round of a factored bandit game the player selects atomic actions, , each from a corresponding finite set of size of possible actions. player The then observes a reward, which is an arbitrary function of satisfying some mild assumptions. For example, it can be a sum of the quality of atomic actions, a product of the qualities, or something else that does not necessarily need to have an analytical expression. The learner does not have to know the form of the reward function.

Our way of dealing with the combinatorial complexity of the problem is through introduction of unique identifiability assumption, by which the best action along each dimension is uniquely identifiable. A bit more precisely, when looking at a given dimension we call the collection of actions along all other dimensions a reference set. Unique identifiability assumption states that in expectation the best action along a dimension outperforms any other action along the same dimension by a certain margin when both are played with the same reference set irrespective of the selection of the reference set. This assumption is satisfied, for example, by the reward structure in linear and generalized linear bandits, but it is much weaker than the linearity assumption.

In Figure 1 we sketch the relations between factored bandits and other bandit models. We distinguish between bandits with explicit reward model, such as linear and generalized linear bandits, and bandits with weakly constrained reward model, including factored bandits and some relaxations of combinatorial bandits. A special case of factored bandits are rank-1 bandits [katariya2016stochastic]. In rank-1 bandits the player selects two actions and the reward is the product of their qualities. Factored bandits generalize this to an arbitrary number of actions and significantly relax the assumption on the form of the reward function.

The relation with other bandit models is a bit more involved. There is an overlap between factored bandits and linear, as well as generalized linear bandits [abbasi2011improved, filippi2010], but neither is a special case of the other. If we represent actions by unit vectors, then for (generalized) linear reward functions the models coincide. However, the (generalized) linear bandits allow a continuum of actions, whereas factored bandits relax the (generalized) linearity assumption on the reward structure to uniform identifiability.

There is a partial overlap between factored bandits and combinatorial bandits [cesa2012combinatorial]. The action set in combinatorial bandits is a subset of . If the action set is unrestricted, i.e. , then combinatorial bandits can be seen as factored bandits with just two actions along each of the dimensions. However, typically in combinatorial bandits the action set is a strict subset of and one of the parameters of interest is the permitted number of non-zero elements. This setting is not covered by factored bandits. While in the classical combinatorial bandits setting the reward structure is linear, we mention that there exist relaxations of the model, e.g. chen2016combinatorial.

Dueling bandits are not directly related to factored bandits and, therefore, we depict them with faded dashed blocks in Figure 1. While the action set in dueling bandits can be decomposed into a product of the basic action set with itself (one for the first and one for the second action in the duel), the observations in dueling bandits are the identities of the winners rather than rewards. Nevertheless, we show that the proposed algorithm for factored bandits can be applied to utility based dueling bandits.

The main contributions of the paper can be summarized as follows:

1. We introduce factored bandits and the uniform identifiability assumption.

2. Factored bandits with uniformly identifiable actions are a generalization of rank-1 bandits.

3. We provide an anytime algorithm for playing factored bandits under uniform identifiability assumption in stochastic environments and analyze its regret. We also provide a lower bound matching up to constants.

4. Unlike the majority of bandit models, our approach does not require explicit specification or knowledge of the form of the reward function (as long as the uniform identifiability assumption is satisfied). For example, it can be a weighted sum of the qualities of atomic actions (as in linear bandits), a product thereof, or any other function not necessarily known to the algorithm.

5. We show that the algorithm can also be applied to utility based dueling bandits, where the additive factor in the regret bound is reduced by a multiplicative factor of compared to state-of-the-art (where is the number of actions). It should be emphasized that in state-of-the-art regret bounds for utility based dueling bandits the additive factor is dominating for time horizons below , whereas in the new result it is only dominant for time horizons up to a constant independent of .

6. Our work provides a unified treatment of two distinct bandit models, rank-1 bandits and utility based dueling bandits.

The paper is organized in the following way. In Section 2 we introduce the factored bandit model and uniform identifiability assumption. In Section 3 we provide algorithms for factored bandits and dueling bandits. In Section 4 we analyze the regret of our algorithm and provide matching upper and lower regret bounds. In Section 5 we compare our work empirically and theoretically with prior work. We finish with a discussion in Section 6.

## 2 Problem setting

### 2.1 Factored bandits

We define the game in the following way. We assume that the set of actions can be represented as a Cartesian product of atomic actions, . We call the elements of atomic arms. For rounds the player chooses an action and observes a reward drawn according to an unknown probability distribution (i.e., it is a “stochastic” game). We assume that the mean rewards are bounded in and that the noise is conditionally 1-sub-Gaussian. Formally, this means that

 ∀λ∈RE[eληt|Ft−1]≤exp(λ22),

where is the filtration defined by the history of the game up to and including round . We denote .

###### Definition 1 (uniform identifiability).

An atomic set has a uniformly identifiable best arm if and only if

 (1)

We assume that all atomic sets have uniformly identifiable best arms. The goal is to minimize the pseudo-regret (also called regret from now on), which is defined as . Due to generality of the uniform identifiability assumption we cannot upper bound the instantaneous regret in terms of the gaps . However, a sequential application of (1) provides a lower bound

 μ(a∗)−μ(a)=μ(a∗)−μ(a1,a∗2,...,a∗L)+μ(a1,a∗2,...,a∗L)−μ(a) ≥ Δ1(a1)+μ(a1,a∗2,...,a∗L)−μ(a)≥...≥L∑ℓ=1Δℓ(aℓ). (2)

For the upper bound let be a problem dependent constant, such that holds. Since the mean rewards are in , the condition is always satisfied by and by equation (2) is always larger than 1. The constant appears in the regret bounds. In the extreme case when the regret guarantees are fairly weak. However, in many specific cases mentioned in the previous section, is typically small or even . We emphasize that algorithms proposed in the paper do not require the knowledge of . Thus, the dependence of the regret bounds on is not a limitation and the algorithms automatically adapt to more favorable environments.

### 2.2 Dueling bandits

The set of actions in dueling bandits is factored into . However, strictly speaking the problem is not a factored bandit problem, because the observations in dueling bandits are not the rewards.111In principle, it is possible to formulate a more general problem that would incorporate both factored bandits and dueling bandits. But such a definition becomes too general and hard to work with. For the sake of clarity we have avoided this path. When playing two arms, and , we observe the identity of the winning arm, but the regret is typically defined via average relative quality of and with respect to a “best” arm in .

The literature distinguishes between different dueling bandit settings. We focus on utility based dueling bandits [yue2009interactively] and show that they satisfy the uniform identifiability assumption.

In utility based dueling bandits, it is assumed that each arm has a utility and that the winning probabilities are defined by for a monotonously increasing link function . Let be 1 if wins against and 0 if wins against . Assume that there exists a best arm . Then for any arm and any , it holds that , which satisfies the uniform identifiability assumption.

## 3 Algorithms

Although in theory an asymptotically optimal algorithm for any structured bandit problem was presented in combes2017minimal, for factored bandits this algorithm does not only require solving an intractable semi-infinite linear program at every round, but it also suffers from additive constants which are exponential in the number of atomic actions . An alternative naive approach could be an adaptation of sparring yue2012k, where each factor runs an independent -armed bandit algorithm and does not observe the atomic arm choices of other factors. The downside of sparring algorithms, both theoretically and practically, is that the mean of any atomic arm is non-stationary. It obviously depends on the choices from other factors, which change over time.

Our Temporary Elimination Algorithm (TEA, Algorithm 1) avoids these downsides. It runs independent instances of the Temporary Elimination Module (TEM, Algorithm 3) in parallel, one per each factor of the problem. Each TEM operates on a single atomic set. The TEA is responsible of synchronization of TEM instances. Two main ingredients ensure stochastic efficiency. First, we use relative comparisons between arms instead of comparing absolute mean rewards. This cancels out the effect of non-stationary means. The second idea is to use local randomization in order to obtain unbiased estimates of the relative performance without having to actually play each atomic arm with the same reference, which would have led to prohibitive time complexity.

The TEM algorithm runs in externally synchronized phases. Each module selects active arms in getActiveSet, such that the optimal arm is included with high probability. The length of a phase is chosen such that each module can play each potentially optimal arm at least once in every phase. All modules schedule all arms for the phase in scheduleNext. This is done by choosing arms in a round robin fashion (random choices if not all arms can be played equally often) and ordering them randomly. All scheduled plays are executed and the modules update their statistics through the call of feedback routine. The modules use slowly increasing lower confidence bounds for the gaps in order to temporarily eliminate arms that are with high probability suboptimal. In all algorithms, we use .

#### Dueling bandits

For dueling bandits we only use a single instance of TEM. In each phase the algorithm generates two random permutations of the active set and plays the corresponding actions from the two lists against each other. (The first permutation is generated in Line 2 and the second in Line 2 of Algorithm 2.)

### 3.1 Tem

The TEM tracks empirical differences between rewards of all arms and in . Based on these differences, it computes lower confidence bounds for all gaps. The set contains those arms where all LCB gaps are zero. Additionally the algorithm keeps track of arms that were never removed from . During a phase, each arm from is played at least once, but only arms in can be played more than once. This is necessary to keep the additive constants at instead of .

## 4 Analysis

We start this section with the main theorem, which bounds the number of times the TEM pulls sub-optimal arms. Then we prove upper bounds on the regret for our main algorithms. Finally, we prove a lower bound for factored bandits that shows that our regret bound is tight up to constants.

### 4.1 Upper bound for the number of sub-optimal pulls by TEM

###### Theorem 1.

For any TEM submodule with arm size , running in an the TEA algorithm with and any suboptimal atomic arm , let denote the number of times TEM has played the arm up to time . Then there exist constants for , such that

 E[Nt(a)]≤\FPmul\result482.5\FPround\result\result0\resultΔ(a)2(log(2Ktlog2(t))+4log(48log(2Ktlog2(t))Δ(a)2))+C(a),

where in the case of factored bandits and for dueling bandits.

###### Proof sketch.

[The complete proof is provided in the Appendix.]

#### Step 1

We show that the confidence intervals are constructed in such a way that the probability of all confidence intervals holding at all times up from is at least . This requires a novel concentration inequality (Lemma A) for a sum of conditionally -sub-gaussian random variables, where can be dependent on the history. This technique might be useful for other problems as well.

#### Step 2

We split the number of pulls into pulls that happen in rounds where the confidence intervals hold and those where they fail: We can bound the expectation of based on the failure probabilities given by .

#### Step 3

We define as the last round in which the confidence intervals held and was not eliminated. We can split and use the confidence intervals to upper bound . The upper bound on requires special handling of arms that were eliminated once and carefully separating between the cases where confidence intervals never fail and those where they might fail. ∎

### 4.2 Regret Upper bound for Dueling Bandit TEA

A regret bound for the Factored Bandit TEA algorithm, Algorithm 1, is provided in the following theorem.

###### Theorem 2.

The pseudo-regret of Algorithm 1 at any time is bounded by

 RegT ≤κ⎛⎜⎝L∑ℓ=1∑aℓ≠a∗ℓ\FPmul\result482.5\FPround\result\result0\resultΔℓ(aℓ)(log(2|Aℓ|tlog2(t))+4log(48log(2|Aℓ|tlog2(t))Δℓ(aℓ)))) +maxℓ|Aℓ|∑ℓlog(|Aℓ|)+∑ℓ52|Aℓ|.
###### Proof.

The design of TEA allows application of Theorem 1 to each instance of TEM. Using , we have that

 RegT =E[T∑t=1μ(a∗)−μ(at)]]≤κL∑l=1∑aℓ≠a∗ℓE[NT(aℓ)]Δℓ(aℓ).

Applying Theorem 1 to the expected number of pulls and bounding the sums completes the proof. ∎

### 4.3 Dueling bandits

A regret bound for the Dueling Bandit TEA algorithm (DBTEA), Algorithm 2, is provided in the following theorem.

###### Theorem 3.

The pseudo-regret of Algorithm 2 for any utility based dueling bandit problem at any time is bounded by .

###### Proof.

At every round, each arm in the active set is played once in position and once in position in . Denote by the number of plays of an arm in the first position, the number of plays in the second position, and the total number of plays of the arm. We have

 RegT =∑a≠a∗E[Nt(a)]Δ(a)=∑a≠a∗E[NAt(a)+NBt(a)]Δ(a)=∑a≠a∗2E[NAt(a)]Δ(a)

The proof is completed by applying Theorem 1 to bound . ∎

### 4.4 Lower bound

We show that without additional assumptions the regret bound cannot be improved. The lower bound is based on the following construction. The mean reward of every arm is given by . The noise is Gaussian with mean 1. In this problem the regret can be decomposed into a sum over atomic arms of the regret induced by pulling these arms, . Assume that we only want to minimize the regret induced by a single atomic set . Further, assume that for all are given. Then the problem reduces to a regular -armed bandit problem. The asymptotic lower bound for -armed bandit under 1-Gaussian noise goes back to lai1985asymptotically: For any consistent strategy , the asymptotic regret is lower bounded by Due to regret decomposition, we can apply this bound to every atomic set separately. Therefore, the asymptotic regret for the factored bandit problem is

 liminfT→∞RegθTlog(T)≥L∑ℓ=1∑aℓ≠aℓ∗2Δℓ(aℓ).

This shows that our general upper bound is asymptotically tight up to leading constants and .

#### κ-gap

We note that there is a problem-dependent gap of between our upper and lower bounds. Currently we believe that this gap stems from the difference between information and computational complexity of the problem. Our algorithm operates on each factor of the problem independently of other factors and based on the “optimism in the face of uncertainty” principle. It is possible to construct examples in which the optimal strategy requires playing surely sub-optimal arms for the sake of information gain. For example, this kind of constructions were used by lattimore2016end to show suboptimality of optimism-based algorithms. Therefore, we believe that removing from the upper bound is possible, but requires a fundamentally different algorithm design. What is not clear is whether it is possible to remove without significant sacrifice of the computational complexity.

## 5 Comparison with prior work

### 5.1 Stochastic rank-1 bandit

Stochastic rank-1 bandits introduced by katariya2016stochastic are a special case of factored bandits. The authors published a refined algorithm for Bernoulli rank-1 bandits using KL confidence sets in katariyabernoulli. We compare our theoretical results with the first paper because it matches our problem assumptions. In our experiments we provide a comparison to both the original algorithm as well as the KL version.

In the stochastic rank-1 problem there are only 2 atomic sets of size and . The matrix of expected rewards for each pair of arms is of rank 1. It means that for each and , there exist such that . The proposed Stochastic rank-1 Elimination algorithm introduced by katariya2016stochastic is a typical elimination style algorithm. It requires knowledge of the time horizon and uses phases that increase exponentially in length. In each phase, all arms are played uniformly. At the end of a phase, all arms that are sub-optimal with high probability are eliminated.

#### Theoretical comparison

It is hard to make a fair comparison of the theoretical bounds, because TEA operates under much weaker assumptions. Both algorithms have a regret bound of . The problem independent multiplicative factors hidden under are smaller for TEA, even without considering that rank-1 Elimination requires a doubling trick for anytime applications. However, the problem dependent factors are in favor of rank-1 Elimination, where the gaps correspond to the mean difference under uniform sampling . In factored bandits, the gaps are defined as , which is naturally smaller. The difference stems from different problem assumptions. Stronger assumptions of rank-1 bandits make elimination easier as the number of eliminated suboptimal arms increases. The TEA analysis holds in cases where it becomes harder to identify suboptimal arms after removal of bad arms. This may happen when highly suboptimal atomic actions in one factor provide more discriminative information on atomic actions in other factors than close to optimal atomic actions in the same factor (this follows the spirit of illustration of suboptimality of optimistic algorithms by lattimore2016end). We leave it to future work to improve the upper bound of TEA under stronger model assumptions.

#### Empirical comparison

The details of empirical evaluation are provided in Appendix C. In the evaluation we clearly outperform rank-1 Elimination over different parameter settings and even beat the KL optimized version if the means are not too close to zero or one. This supports that our algorithm does not only provide a more practical anytime version of elimination, but also improves on constant factors in the regret. We believe that our algorithm design can be used to improve other elimination style algorithms as well.

### 5.2 Dueling Bandits: Related Work

To the best of our knowledge, the proposed Dueling Bandit TEA is the first algorithm that satisfies the following three criteria simultaneously for utility based dueling bandits:

• It requires no prior knowledge of the time horizon (nor uses the doubling trick or restarts).

• Its pseudo-regret is bounded by .

• There are no additive constants that dominate the regret for time horizons .

We want to stress the importance of the last point. For all state of the algorithms known to us, when the number of actions is more than 20-30, the additive term is dominating for any realistic time horizon . In particular, among the three algorithms introduced by ailon2014reducing introduce three algorithms for the utility based dueling bandit problem, the regret of Doubler scales with . The regret of MultiSBM has an additive term of order that is dominating for . The last algorithm, Sparring, has no theoretical analysis.

Algorithms based on the weaker Condorcet winner assumption apply to utility based setting, but they all suffer from equally large or even larger additive term. The RUCB algorithm introduced by zoghi2014relative has an additive term in the bound that is defined as , for and . By unwrapping these definitions, we see that the RUCB regret bound has an additive term of order . This is again the dominating term for time horizons . The same applies to RMED algorithm introduced by KHKN15, which has an additive term of . (The dependencies on the gaps are hidden under the -notation.) The D-TS algorithm by WL16 based on Thompson Sampling shows one of the best empirical performances, but its regret bound includes an additive constant of order .

Other algorithms know to us, Interleaved Filter [yue2012k], Beat the Mean [yue2011beat], and SAVAGE [urvoy2013generic], all require knowledge of the time horizon in advance.

For an empirical comparison, we have used the framework provided by KHKN15 and present the details and figures in Appendix C. The evaluation shows that the additive terms are indeed non-negligible and that Dueling Bandit TEA outperforms all baseline algorithms when the number of arms is sufficiently large.

## 6 Discussion

We have presented the factored bandits model and uniform identifiability assumption, which does not require the knowledge of the specific reward model. We presented an algorithm for playing stochastic factored bandits with uniformly identifiable actions and provide matching upper and lower bounds for the problem up to constant factors. Our algorithm and proofs might serve as a template to turn other elimination style algorithms into improved anytime algorithms. Factored bandits with uniformly identifiable actions generalize rank-1 bandits and allow to analyse utility based dueling bandits in a unified framework. Furthermore, we improve the additive constants in the regret bound compared to state-of-the-art algorithms for utility based dueling bandits.

There are multiple potential directions for future research. One example mentioned in the text is the possibility of improvement of the regret bound when additional restrictions on the form of the reward function are introduced or improvements of the lower bound for algorithms restricted in computational or memory complexity. Another example is a study of the adversarial version of the problem.

## Appendix A Proof of Theorem 1

We require a few lemmas to prove Theorem 1. We provide the proofs to these Lemmas in the next section. {restatable}lemuglyinequality Let , then for any :

 z(log(f(x))+y)x<1.
{restatable}

lemnValuedSubGaussianSum Let and be a sequence of sub-Gaussian random variables adapted to the filtration , i.e. . Assume for all , with almost surely. Then

 P⎡⎢⎣∃t∈N:t∑i=1Xi≥ ⎷2σ2ntlog(f(nt)δ)⎤⎥⎦≤δ,

where . (Note that unlike lemma B we do not require to be independent of .) {restatable}lemsingleDiff Given random variables with means , such that all are -sub-Gaussian. (e.g. Bernoulli random variables) Given further two sample sizes , such that . Then for and disjoint uniform samples of indices in without replacement, the random variable

 Z=1m∑i∈ImXi−1k∑i∈IkXi,

is -sub-Gaussian.

###### Proof of Theorem 1.

We follow the steps from the sketch.

#### Step 1

We define the following shifted random variables. The reward functions satisfy for all . Therefore . So we can bound and . Define the events and their complements . According to lemma 1, is -sub-Gaussian. So is a sum of conditionally -sub-Gaussian random variables, such that , Therefore we can apply Lemma A. For both cases and , the probability never increases in time. Using a union bound over for , we get

#### step 2

We split the number of pulls in two categories: those that appear in rounds where the confidence intervals hold, and those that appear in rounds where they fail: , .

 Nt(ai) ≤NEt(ai)+N¯¯¯Et(ai) E[NEt(ai)] =P[¯¯¯¯¯F]E[NEt(ai)|F]+P[¯¯¯¯¯F]E[NEt(ai)|¯¯¯¯¯F].

In the high probability case, we are with probability in the event and is 0. In the setting of , we can exclude the first round and start with and . This is because we do not use the confidence intervals in the first round.

 E[N¯¯¯Et(ai)] ≤∞∑s=2ts+1−tsf(ts)≤∞∑s=1Mf(Ms) ≤Mf(M)+∞∑s=2Mf(Ms)≤12+∞∑s=11f(s)≤32

We use the fact that is monotonically decreasing, so the expression gets minimized if all rounds are maximally long.

#### Step 3:

bounding Let be the last round at which the arm was not eliminated. We claim that at the beginning of round must be surely smaller or equal to . Assume the opposite holds, then according to Lemma A with and : So we have that and would have been excluded at the beginning of round , which is a contradiction. Let denote the number of plays of in round . Then for the different cases we have: The first case is trivial because each arm can only be played times in a single round and in rounds with . The second case follows from the fact that is always in set under the event . So and . The amount of pulls in a single round is naturally bounded by . Given that under the event , the set never resets and the set only decreases if an arm is eliminated, we can bound Finally the last case follows trivially because in the case of , we have .

#### Step 4:

combining everything

In the high probability case, we have with probability at least :

 Nt(ai)≤NEt(ai)+N¯¯¯Et(ai) ≤2Ni,∗(s′)+C(ai) ≤\FPmul\result482\FPround\result\result0\resultΔ(a)2(log(2Kδ−1)+4log(48log(2Kδ−1)Δ(a)2))+C(ai) s.t. ∑a≠a∗C(a)≤Mlog(K)+K

If additionally , then the bound improves to

 Nt(ai)≤NEt(ai)+N¯¯¯Et(ai) ≤Ni,∗(s′)+1 ≤\FPmul\result481\FPround\result\result0\resultΔ(a)2(log(2Kδ−1)+4log(48log(2Kδ−1)Δ(a)2))+1

In the setting of , we have

 E[NEt(ai)−C(ai)]≤2Ni,∗(s′)+1f(M)MNi,∗(s′) ≤\FPmul\result482.5\FPround\result\result0\resultΔ(a)2(log(2Ktlog2(t))+4log(48log(2Ktlog2(t))Δ(a)2)).

So

 E[Nt(ai)]

where

 ∑a≠a∗E[C(a)+N¯¯¯Et(ai)] ≤Mlog(K)+K+1f(M)MK+32K ≤Mlog(K)+52K.

Finally if additionally , this bound improves to

 E[Nt(ai)]