Bandits with Side Observations: Bounded vs. Logarithmic Regret

# Bandits with Side Observations: Bounded vs. Logarithmic Regret

## Abstract

We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency , an extra observation is gathered by the agent for free. We prove that, no matter how small is the agent can ensure a regret uniformly bounded in time.

More precisely, we construct an algorithm with a regret smaller than , up to multiplicative constant and terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.

## 1 Introduction

We consider the celebrated multi-armed bandit framework (sometimes also called online learning), a repeated decision problem where an agent (or an algorithm, a machine, a player, etc.) takes sequentially decisions from a finite set. Each decision gives a stochastic reward to the agent of fixed expectation. The main objective is to derive an algorithm maximizing the cumulative reward or minimizing its normalized version, the so-called “regret”. The latter is simply the difference between the cumulative expected reward of an agent knowing in hindsight the optimal decision, and the cumulative reward of the algorithm.

Online learning can be traced back to the 30’s, when Thompson analysed random clinical trial using an analogy with finding the best slot-machine in a casino by pulling sequentially their arms in order to minimize the total loss. During the 20th century, many improvements have been made, at least on the asymptotic version of the problem. The quantity of theoretical studies and practical applications of bandits have exploded since the early 2000. There are several reasons for that. First of all, a simple yet almost optimal algorithm called UCB has been developed. Its simple structure allows to adapt it to many different settings. As a consequence, many possible applications of online learning have been developed. Amongst them, we can mention the routing problem: given a network with congested edges, one must find the quickest way from some origin to a destination (this setting incorporates a combinatorial structure); this can be used to send packets in a network, as well as finding the quickest itinerary from a point A to a point B. Online advertising is another application: given a possible set of ads, one must find the ad with the highest probability of click. The last application we mention is concerned with wireless network and/or cognitive radio, where either a radio can change from an available channel to other channels to improve its reception or emission quality, or alternatively a wireless source, in a relay selection problem where multiple relays are available, can explore those nodes to achieve better transmissions rates. One of the typical and crucial assumption of all these models is that the agent only observes the outcome of his decisions, but not what the other decisions would have given him. For instance, using a slot machine only gives you a feedback on the performance of that very machine, displaying an ad only gives information of the probability of clicks on that specific ad, etc. This assumption is actually called “bandit feedback”. At the other end of the spectrum, the dual assumption (mostly used in non-stationary environment that we are not concerned with in that paper) is the “full information feedback”, where all the outcomes of all decisions are observed at all stages. However, none of our motivating examples satisfies this strong assumption.

However, we argue that the bandit feedback is also too strong and that in many cases more informations are available to the agent. Typically, the agent will always observe the outcome of his own decision, but with some small probability he might also get one (or several, but that is irrelevant to our setting) extra “free” information. For instance, consider the original multi-armed bandit problem. A gambler is in a casino and wants to find out which slot machine is the best one. From time to time, he might observe other gamblers playing nearby machines. Even if this does not cost him anything, he gets feedback on the other machines. This effect also appears in other settings. In wireless network, a source with an allocated transmission capacity (because of a power-saving allocation protocol for instance) sends data through a relay and may have the opportunity to send another custom packet (so that the energy needed to send this packet is less than the available energy) through another relay in order to estimate transmissions rates. In online advertisement (and actually many other industrial markets), companies are willing to spend a small fraction of their data, say with probability as in the celebrated -greedy algorithm, just to acquire new information. An algorithm is only evaluated on the remaining (of proportion ) fraction of the data treated. In a multi-armed bandit setting, this means that with probability , the next decision is “free”. Finally, we can also think that in the congested network problem, an algorithm can from time to time send “fake”, but free, packets to test the congestion; conversely, an app trying to minimize the congestion time of its users might be able to use free information if it notices that a bucket of users (for instance, those that are registered) might explore new road willingly, i.e., without uninstalling the app.

We therefore focus on the classical multi-armed bandits but where some extra and free information is available from time to time. Clearly, if the probability that it happens is arbitrarily close to 0, the improvement will be negligible. But we aim at constructing “optimal” algorithm, i.e., whose regret is small and in a multiplicative constant of the best regret achievable regret by “meaningful” algorithms. All these concepts are explained in details in the remaining of the paper that is organized as follows.

The model is introduced in Section 2, where we provide a very naïve algorithm achieving bounded regret (uniformly in time). We exhibit in Section 3 non-trivial lower bounds (we emphasize here that traditional bandit lower bounds are void in our setting). Algorithms are described and analysed in Section 4. Finally, Section 5 is dedicated to experiments illustrating the different guarantees and dependencies in the parameters of the models.

### 1.1 Related Works

This paper is not the first one to consider additional, free informations, available to the agents while optimizing. There are many different ways of modelling this idea, but our paper is the first one (to our knowledge) that also focus on strategical aspects of obtaining these free informations to reduce regret, especially in the stochastic case.

There exists models where when a specific decision is taken, automatically (resp. with some probability), the performance of some other decision are observed (Alon et al., 2015; Chen et al., 2016; Caron et al., 2012). Those models assume that there exists a directed (resp. weighted) graph whose set of nodes is the set of decisions. When the agent takes a decision, he also observes the outcome of any node linked (resp. with a probability proportional to the weight of the edge) to the current decision node. Our passive model could be recast as a specific case of that setting, but our results are much finer than the ones available for the general case.

In (Yu and Mannor, 2009) the rewards are stochastic but their means change at unknown time points. Free additional informations are queried by the algorithm in order to detect these change points. They however are not used to decrease the regret of the base bandit algorithm.

Another trend of literature of additional free information in multi-armed bandit studies the “adversarial” case, where no stationary assumption is made on the sequence of rewards (namely, there are not i.i.d.)(Audibert and Bubeck, 2010; Cesa-Bianchi et al., 2006; Mannor and Shamir, 2011). However the rate of convergence in the two extreme cases (bandit and full information) have the same dependency in , the total number of stages. To be precise, the regret is either of the order of (in the bandit case) or (in the full information case), where is the number of decisions. Intermediate settings (where observations are available at each stage) interpolate between those two cases.

In the stochastic case though, regret is uniformly bounded with full information and grows logarithmically in the bandit case. As a consequence, even the rate of convergence will depend on the size of free informations.

## 2 Multi-Armed Bandits, Regret Minimization and Feedbacks

In that section, we describe precisely the stochastic multi-armed bandit problem and its objective, the minimization of regret.

### 2.1 Stochastic Multi-Armed Bandits

#### Bandit vs Full-Information

At each successive stage , an agent takes a decision (or pulls an arm using the multi-armed bandit lingo) in the finite set . After pulling this arm, the agent receives the reward , which is sampled from a real reward distribution of expectation . As a consequence, the stochastic bandit problem is parametrised by the vector of distribution, , or alternatively in the non-parametric case, by the vector of expected rewards . Throughout the paper, the results are stated using the arbitrary ordering . Obviously, those vectors are unknown to the agent, who is aiming at optimizing her cumulative expected reward . Actually, instead of this cumulative reward, the objective is normalized into cumulative regret minimization.

The cumulative regret (or simply regret) of an algorithm at stage is defined as

 RT=Tmaxi∈[K]μ(i)−T∑t=1μ(it),

i.e., it is the difference between the maximal possible cumulative reward up to stage and the expectation of the reward gained by the successive choices of arms . Following the classical notations, we define and the gaps . In the non-parametric case, these gaps are the relevant quantities characterising the complexity of a bandit problem.

There are different standard assumption on the feedbacks available to the agent before taking a new decision. In the bandit setting, she observes only her reward (and, specifically, not the other ) at the end of stage . In the full information setting, she observes the full vector of rewards . With full information, the Follow The Leader (FTL) algorithm that selects the arg max of the empirical average attains a uniformly bounded regret (with respect to ). In the bandit setting, FTL gets a linear regret, yet the logarithmic optimal dependency in is achieved by many algorithms. One of the most popular, called Upper Confidence Bound (UCB), selects the argmax of the empirical average augmented of an error term where is the number of pulls of arm up to stage , while .

Many other algorithms are variants of UCB, by modifying the error term, changing some parameters, specifying it for a given class of parametric distributions, etc.

As specified and motivated in the Introduction, we aim at analysing intermediate settings between bandit and full information, in which a subset of the reward vector might be observed. More precisely, at some stages, the agent not only observes an arm by pulling it but might also observe a second arm for free, i.e., without getting a reward (and without incurring any regret). We consider several ways in which these free observations can be obtained: they can be deterministically available periodically (for instance every rounds) or arrive randomly (at each stage with probability ); the agent can also be a passive observer if she can not choose from which arm she gets an extra information (the environment chooses it for her, in a manner to be specified latter on), or she can be an active observer if she can choose the arm to observe freely.

We end this section with some notations. In the random time arrival of free information, we assume that at each stage a Bernoulli random variable with expectation (whose law is denoted by ) is sampled and a free observation is available if . The particular setting in which is constant will be called static random. We will denote by the arm pulled and by the arm chosen to be observed using the free information (if available). The total number of pulls of arm up to stage is , the number of free observations and the total number of observation of arm is .

### 2.2 A Finite Regret Setting

It is not really difficult to devise a naïve algorithm with a (uniformly) bounded regret at least in the deterministic case, when a free observation is obtained every round. We consider for simplicity the case of arms in this section as it gives all the intuitions. Consider the following (heavily sub-optimal) strategy, which we denote by FTL-robin: pull the leading arm (the one with the highest empirical average ) and when a free sample is available, observe arms in a round-robin fashion.

After a period of stages, both arms have their observation counters increased by at least one. As a consequence, this simple algorithm FTL-robin can be seen as a full-information algorithm which would take stages to get the observations. To simplify intuitions

###### Lemma 1.

The regret of the FTL-robin algorithm on the deterministic setting with satisfies

 ERT≤¯¯cϵ1Δ,  where  Δ=|μ(1)−μ(2)|,

and there exist distributions such that

 c–ϵ1Δ≤ERT,

where are universal constants that do not involve any parameter of the problem.

This lemma shows that even the simplest algorithm gets a finite regret in this setting. The proof is almost trivial and omitted. To provide some insights, just assume that and . Then the regret of FTL-robin is equal to the times the number of times that is smaller than 0. Basic computations show that this number is of order .

The relevant question is then not the asymptotic regime, but what is the precise optimal dependency on . Indeed, when , this bound gets larger than the regret of another naive approach, which is to use an algorithm for bandits and discard the additional information.

This free information problem is characterized by a transition from ”small” , where the amount of additional information is not enough to improve the performance of bandit algorithm, to ”big” , where the regret is finite and the setting is closer to full-information.

We answer the question of what ”small” and ”big” mean in this context and where the transition occurs and we display algorithms enjoying both logarithmic regret when is small and finite regret when it is big.

## 3 Lower Bounds

We first consider the definition of optimality of an algorithm, that is, what is the minimal regret achievable by any ”reasonable” algorithm, in a sense we will make precise. Our lower bounds will highlight a transition from logarithmic (with respect to the horizon ) to finite regimes when gets big enough.

There are now quite standard techniques to devise lower bounds for stochastic bandits problems, but surprisingly these techniques are inadequate in our case, due to the finiteness of the optimal regret. As a finite regret is possible, a traditional, asymptotic lower bound for (Lai and Robbins, 1985) could only be 0 and hence would not be informative. We can obtain a finite time version of this type of bound as in (Garivier et al., 2016) by imposing that our algorithm should perform better than a reference algorithm.

###### Definition 1.

An algorithm is said to be sub-logarithmic with constants , if on all bandit problems it verifies for all stages ,

 ERT≤CK∑i=1logTΔi+C0K∑i=2Δi.

There exists sub-logarithmic algorithms (UCB for example, with constants , (Auer et al., 2002)). A sub-logarithmic algorithm is performing at least as good as the UCB baseline. This finite time constraint on the performance of the algorithm translates into a lower bound: to perform relatively well on all bandit problems, an algorithm cannot outperform the lower bound guarantee on any of them.

### 3.1 Passive Observer

When the observer is passive (i.e., she does not choose the arm to observe freely), we assume that is equal to with probability chosen by the environment. Consider the static setting in which for all , and the probabilities do not depend on the stage (we will thereafter omit the subscript ).

Standard lower bound techniques proceed as follows: at stage , the expected number of pulls of an arm is linked to the Kullback-Leibler divergence between the bandit problem studied and a related alternative, in which this arm would be the best one (roughly speaking, in order to be able to “test” that the problem is not the alternative one, a minimum number of samples of that arm must be gathered in the original problem).

A bound on this divergence gives a constraint of the form for some function . Then a lower bound for the regret is the minimal value of respecting all these constraints, that can be computed through some linear program. With this proof technique, we obtain lemma 2 .

###### Lemma 2.

The regret of a sub-logarithmic algorithm with constants , must verify

 E1RT ≥K∑i=2max{0,hi(T)2Δi−ϵp(i)TΔi)}.

where (see appendix for a detailed definition).

As mentioned above, this lower bound is void as it reaches 0 as soon as is big enough, bigger than .

We want to explain why this lower bound fails to provide relevant informations as our algorithm (see Section 4) are somehow inspired by this. Recall that the lower bound only states that any reasonable algorithm must have gathered, for each sub-optimal arm, a given number of observations, namely . However, grows sub-linearly, while the number of free observations grows linearly. So if is large enough, there will be in total enough free observations to allocate of them to arm and an optimal algorithm should somehow have used only free information to explore.

However, this is only possible if the free observations were gathered at the beginning of the problem and not scarcely with time! Indeed, in the traditional lower bounds techniques, the fact that arm is observed at the beginning or at the end of time is irrelevant (since the cost of one pull is constant throughout time). They totally discard the fact that the quantities and must be non-decreasing. Tighter, relevant lower bounds can be recovered using this monotonicity.

###### Theorem 1.

The regret of a sub-logarithmic algorithm with constants , must verify

 ERT≥K∑i=212Δir(i)T

where

 r(i)T =log(TΔ2i2ClogT∑j≠iΔiΔi+Δj)+ηi(T)−2ϵp(i)Δ2iT

if and otherwise

 r(i)T =[log⎛⎜ ⎜⎝1ϵ14Cp(i)∑j≠iΔiΔi+Δj⎞⎟ ⎟⎠ −loglog(12ϵp(i)Δ2i)+ηi(12ϵp(i)Δ2i)−1].

The function goes to zero in . See appendix for details.

Theorem 1 correctly reports a lower bound increasing with the horizon. It shows a transition from a optimal regret for to a finite regret function of when gets bigger. According to Theorem 1, the correct dependency in in the regret should be in , not as seen for the naive FTL-robin algorithm.

We can also wonder what is the most favorable passive setting. Simple computations show that free observations should be drawn according to the probability vector where is proportional to (here, we actually ignore the and terms of Theorem 1), leading to a lowest lower bound

 E1RT ≥K∑i=212Δilog⎛⎜⎝1ϵ∑Kj=21Δj4C∑j≠i1Δi+Δj⎞⎟⎠+α ≥K∑i=212Δilog(14Cϵ)+α,

where regroups the and terms in theorem 1. This lower bound shows in particular that when all sub-optimal arms have the same gap, the optimal sample distribution is uniform and the lower bound is of order .

### 3.2 Active Observer

An active observer has the possibility to chose the weights at each stage , potentially achieving a much better distribution of the free observations up to stage than any static distribution. As before, standard techniques give the following lower bound.

###### Lemma 3.

The regret of a sub-logarithmic algorithm with constants , verifies

 ERT≥k∑i=2hi(T)2Δi−Δk(ϵT−∑j>khj(T)2Δ2j),

where .

The structure of the solution to the optimization problem in this case is again educational: an optimal algorithm presented with a given amount of free observations would spend them at the beginning, before costly pulls, and will spend them on the worst arms. This intuition drove the construction of algorithms for active observer in section 4:

First gather free observations, ideally accordingly to the proportion then discards arms for which enough information were gathered, and use a standard optimal bandit algorithm on the remaining ones.

As in the passive observer case, although this lower bound can be meaningful for small horizon , it becomes void for larger horizons. A better lower bound using the monotony of the number of pulls and of the regret is provided in the next theorem.

###### Theorem 2.

For let . The regret of any active sub-logarithmic algorithm with constants , verifies

 ERT

When all gaps are equal to the same value , the leading term of this lower bound is of the form

 maxk:tk≤Tk−1Δlog(1ϵK−kK).

In particular, this result states that as goes to infinity, the regret is asymptotically lower bounded by .

## 4 Algorithms and Upper-Bounds

In this section, we exhibit algorithms matching the lower bounds derived in the previous section, up to terms, showing that they indeed represent accurately the problem complexity.

### 4.1 Passive Observer

A passive observer does not get to choose the arms on which free information is gained. As in the classical stochastic multi-armed bandit, the only decision is therefore which arm to pull. It is then natural to extend known algorithms by taking into account all observations from both provenances.

As UCB pulls the arm with maximal index , we extend it by using all available observations both in the empirical mean and exploration term. Algorithm 1 pulls .

###### Theorem 3.

Consider the static passive observer case, where follows the categorical distribution with parameters and the probability of getting a free observation is for all stages .

Then the regret of ucb verifies both

 ERT ≤K∑i=224ΔilogT, and ERT ≤K∑i=224Δilog50ϵp(i) +K∑i=224Δimax{log1eΔ2i,loglog20ϵp(i)}.

Hence UCB with passive observations recovers the dependency in , up to a doubly logarithmic term when is small compared to the squared gaps. When the dominant term in this maximum is , the regret due to arm has the form , which is sub-optimal with respect to (see Theorem 1). This is due to the sub-optimality of UCB itself: while the regret of UCB on a bandit problem is , other algorithms of the same family like UCB2 (Auer et al., 2002), Improved-UCB, (Auer and Ortner, 2010) or MOSS (Audibert and Bubeck, 2009; Degenne and Perchet, 2016) get an improved regret of order .

The dependency in means that as small as gives useful information to a learner. Obviously there is no gain to be had if , as there is in average less than one additional observation before , but few more free observations are enough to improve the regret.

### 4.2 Active Observer

While a uniform allocation of the free observations over the arms gets the right dependency in , having the choice of the arm which will be observed allows an algorithm to get the right dependency in the parameters of the bandit problem. In the active setting, the algorithm can choose freely which of the arms will get an additional observation, when such an observation is available.

To devise an algorithm taking advantage of this possibility, we try to mimic the lower bound for fixed stage, as in Lemma 3. A good algorithm should use the available free observations first to discard the worse arms, before using costly pulls only on the remaining arms.

We introduce an algorithm combining two subroutines: an Explore-Then-Commit (ETC) (Even-Dar et al., 2006; Perchet and Rigollet, 2013) algorithm on the free observations is used to narrow the set of arms which need to be pulled and an algorithm of the UCB family is used on this set. As we seek for optimality with respect to the problem parameters we use OCUCB-n (Lattimore, 2016), which is the UCB-type algorithm closest to it. ETC is described in Algorithm 3. OCUCB-n with parameters and pulls at stage the arm with maximal index

 ¯¯¯¯¯X(i)t+ ⎷2ηlogB(i)t−1Ni(t) where B(i)t−1 =max{e,log(t),tlogt∑Ki=1min{Ni,NρjN1−ρi}}

where is a shorthand notation for .

The main algorithm use a succession of epochs. In epoch number , the ETC subroutine collects (free) information on all the arms in , while OCUCB-n pulls arms in an available subset of the arms . At the end of epoch , the free observations gathered are used to discard arms from which are not optimal with high enough confidence, forming . There is a finite depending on and the gaps such that with high probability, for , hence arm contributes to the regret only up to epoch and the regret is finite.

In order to write a regret upper bound for our active algorithm, we introduce quantities for and ,

 Hi,ρ =iΔ2i+K∑j=i+11Δ2(1−ρ)iΔ2ρj.

These constants transcribe the difficulty of the problem. A number of observations of order will be necessary for ETC to eliminate arm with high confidence.

###### Theorem 4.

The regret of the active algorithm 2 with parameters and on problems with rewards in is

 ERT ≤CηK∑i=24Δimax{log(1ϵ),log√Hi,ρ} +51K+O(K∑i=21Δi(loglogHi,1ϵ)2)

with a constant that depends only on (see (Lattimore, 2016) for details on ).

Our analysis of Explore-Then-Commit relies on a new maximal concentration inequality which can be of independent interest.

###### Lemma 4.

Let be a -sub-Gaussian martingale difference sequence then, for every and every integers ,

 P{∃t≤T,¯¯¯¯Zt≥√2σ2tlog(Tδt)}≤6δ√log(1δ).

Asymptotically, we obtain

 limsupδ→0P{∃t≤T,¯¯¯¯Zt≥√2σ2tlog(Tδt)}δ√log(1δ)≤√e/8.

This value is .

#### Heuristics and Influence of ϵ

Besides the algorithm already discussed, we also experimented on the following heuristic: choose a bandit algorithm of the UCB family, which pulls the arm with a maximal index; use it to pull the arm with maximal index and if an observation is available, observe the second maximal arm. We provide no regret analysis for this heuristic but study its performance in the experimental section.

Concerning the dependency in , we can make the following interesting remark. To simplify notations, we will assume that all arms have the same gap and we remove constants for this analysis. With these simplifications, we proved that regret at stage is of the order of . Obviously, if is almost equal to , this upper-bound is void and the algorithm should not depend on the free observations. One might ask what is the threshold at which free informations become relevant at stage .

Notice that standard information theory arguments yield that if , and even if the free observations were gathered at the begining of the problem, only arms could be removed (with high probability) from the set of possible optimal arms. Hence these free information are not useful for at least arms and regret will have to scale as , the optimal rate for the bandit problem with arms with equal gaps .

On the other hand, if , then (up to multiplicative constant), dominates . As a consequence, the relevant threshold for the probability of free observations after stages is

 ε∗=1TKΔ2.

## 5 Experiments

All experiments are performed with Gaussian rewards with unit variance.

Influence of . The goal of this first experiment is to confirm the scaling of the regret with . That is to say, the regret scales with . The experiment is performed with a passive observer with either a uniform distribution or the optimal one, as defined in Section 3.1. To do so, the experiment is performed in the passive setting associated with a uniform distribution and the optimal one, as defined in Section 4.1. Also, when free observations are scarce, , the average number of those is approximately during the experience. Therefore, the regret is similar to the one suffered by an UCB algorithm in a classic multi-armed bandit setting, a behaviour captured by the function . On Figure 1 and 2, experiments are run on four Gaussian arms with expectations , , , , the error bars are quantile at and .

Passive Observer: optimal sampling distribution. This second experiment illustrates the induced regret in the passive setting with a probability distribution . This distribution is considered to be optimal because, as mentioned in Section 3.1, it achieves the lowest lower bound. It also suggests a paradigm for algorithms in the active setting i.e sampling freely as much as possible the arm with the lowest . A way to do so is to run an UCB type algorithm to choose which arm to pull, and use another UCB type algorithm on other arms to determine which will be observed if a free observation is available. The results of this type of policy is presented in the next paragraph.

The experiment is run on the same set of arms as previously with a uniform distribution, the optimal distribution and a suboptimal one such that , referred as SubOptimal in Figure 3. Color filled regions are and quantiles.

Active Observer: comparison of algorithms. This subsection is dedicated to the comparaison of algorithms introduced earlier : UCB1-Double, ETC-OCUCB and ETC-OCUCB-2.
UCB1-Double uses a UCB algorithm and select the free observation as the second index maximising arm. The optimal allocation in the passive setting samples better arms more often, therefore we use the free observation to sample the arm next to optimal (according to its UCB index). The second algorithm, referred to as ETC-OCUCB, is the algorithm studied in the above section. In particular, its ETC subroutine checks for potentially removable arms every pulls, with a fixed parameter and the set of currently active arm. Finally, the algorithm referred to as ETC-OCUCB-2 is a variant of ETC-OCUCB where elimination checks are made every stages, thus behaving less aggressively than ETC-OCUCB. In addition, we introduced in this experiment a parameter so that the epoch length is in ETC-OCUCB. This enables us to adapt the growth of epochs to the horizon, here . Other parameters are : , , and .
The experiments is run on five Gaussian arms with expectations , , , and . Color filled regions are and quantiles.

Figure 4 illustrates that:

• UCB1-Double reaches rapidly its final regret value after a logarithmic exploration phase where informations are gathered so that the policy doesn’t pull an other suboptimal arm after this phase.

• ETC-OCUCB and ETC-OCUCB-2 algorithms have similar performances and the parameter offers a control how often the set of active arms is updated which offers a slight performance increase for lower .

ETC-OCUCB and ETC-OCUCB-2 maintain two distinct tracks of rewards, one for rewards obtained after pulling an arm and the other for rewards after sampling freely an arm. Therefore, it may be possible to increase their performance by using both sources of information in both subroutines. In the Figure below, these variants are referred as ETC-OCUCB-all-info and ETC-OCUCB-all-info-2.

This simple modification provides a clear improvement whether for the final regret or the speed at which this value is reached.

## 6 Conclusion

We analysed the multi-armed bandit problem with just a few extra free information. Interestingly, as the regret is uniformly bounded in time, standard lower bounds are void. However, a careful analysis allowed us to exhibit non-trivial guarantee that no reasonable algorithm can out-perform and we finally provided an optimal algorithm, whose regret matches the lower bound up to doubly logarithmic terms.

We would like to finally emphasize that our algorithm can be used even if the observations are not free. Since we used ETC on these observations, we get that our algorithm has a regret smaller (discarding multiplicative constants and terms) than

 K∑i=2log(εTΔ2i)Δi+K∑i=2log(1/ε)Δi

where the first term is the guarantee of ETC on samples, and the second one is the guarantee of our algorithm with “free” observations. As a consequence, no matter the value of (as long as the terms do not become dominant), its dependency vanishes, and we recover the expected performance of ETC.

#### Acknowledgements

V. Perchet has benefited from the support of the ANR (grant n.ANR-13- JS01-0004-01), of the FMJH Program Gaspard Monge in optimization and operations research (supported in part by EDF), from the Labex LMH and from the CNRS, PEPS project Lacreme.

## Appendix A Lower Bound Proof

Consider a bandit problem with Gaussian arms with distributions with , denoted by problem 1. We define other bandit problems in which an arm is changed to bring it above the optimal arm. Formally, problem with distributions is such that for all , and . The distributions of are the same in all problems.

Let . The Kullback-Leibler divergence between the observations up to time coming from problem 1 and problem is

 KL(PIT1,PITi)=E1Oi(T)(Δ+Δi)22

By showing lower bounds on this divergence, we prove constraints on , leading to a lower bound on the regret. By using the principle of contraction of entropy [Garivier et al., 2016], we can relate the divergence between the observations in the two problems to the Kullback-Leibler divergence between Bernoulli variables. Let denote the Kullback-Leibler divergence between Bernoulli distributions with parameters and . .

 KL(PIT1,PIti) ≥kl(E1N1(T)T,EiN1(T)T) ≥E1N1(T)Tlog1EiN1(T)T−log(2)

where we used that .

The Expectations of the number of pulls will be bounded through the hypothesis of sub-logarithmic regret: as the regret must be low, the number of pulls of sub-optimal arms must also be low. The algorithm is sub-logarithmic with constants , at all stages . i.e. on all multi-armed bandit problems, for all stages , . let and .

 E1N1(T)T =1−∑i≠1E1Ni(T)T ≥1−1T(CK+CCΔlogT), EiN1(T)T ≤1TΔ(CK+C∑j≠i1Δ+ΔjlogT).

We obtain finally the constraint

 E1Oi(T) ≥2(Δ+Δi)2hi(T),

where

 hi(T) =log(TΔ22ClogT∑j≠iΔΔ+Δj)+ηi(T), ηi(T) =−1T(CK+CCΔlogT) ×log(TΔ2CKΔ+ClogT∑j≠iΔΔ+Δj) −log(1+CKClogT∑j≠i1Δ+Δj).

### a.1 Properties of hi and ηi.

The function is increasing over . Its derivative verify

 h′i(t)≥1t⎛⎜ ⎜ ⎜ ⎜⎝1−1logt+CKC∑j≠i1Δ+Δj⎞⎟ ⎟ ⎟ ⎟⎠.

If , then .

Let . then for , , such that is increasing over .

### a.2 Passive Static Setting

In this setting, for all (with ), such that . Using the constraint on , we deduce that the regret of a sub-logarithmic algorithm must be bigger than the solution of the optimization problem

 minimize in n: K∑i=2niΔi subject to ∀i≥2,ni≥2hi(T)(Δ+Δi)2−ϵTp(i), n⪰0.

The solution is given by . We see that for big enough, the lower bound is 0. This does not reflect the problem at hand since some regret is unavoidable at the beginning, when few free observations are available.

Since is non-decreasing, we can aggregate the constraints on this quantity up to stage to get the stronger constraint

 E1Ni(T) ≥sup3≤t≤T{2hi(t)(Δ+Δi)2−ϵtp(i)}

3 is taken as the starting point for to ensure that is increasing.

##### Small horizon: T≤1/(ϵp(i)(Δ+Δi)2).
 ENi(T) ≥2hi(T)(Δ+Δi)2−ϵTp(i) ≥1(Δ+Δi)2(2hi(T)−1).
##### Big horizon: T≥1/(ϵp(i)(Δ+Δi)2).
 E1Ni(T) ≥1(Δ+Δi)2[2hi(1ϵp(i)(Δ+Δi)2)−1],

where this value is obtained by taking .

In the construction of this lower bound, we can choose separately for each arm. Using in each , we get

 E1RT ≥K∑i=212Δi[log⎛⎜ ⎜⎝1ϵ18Cp(i)∑j≠iΔiΔi+Δj⎞⎟ ⎟⎠ −loglog(14ϵp(i)Δ2i)+ηi(14ϵp(i)Δ2i)−12].

### a.3 Active Setting

Using the constraints, we obtain that the regret of any sub-logarithmic algorithm must verify that is bigger than the solution of the problem

 minimize in n,f: K∑i=2niΔi subject to ∀i≥2,ni+fi≥2hi(T)(Δ+Δi)2, K∑i=2fi≤ϵT,n⪰0,f⪰0.

The solution of this optimization problem has the following structure: there exists a such that for all such that , and ; for all such that , and ; for the possible index with , and . That is, an optimal algorithm uses the free information on bad arms and uses the costly pulls on good arms. The optimal attainable expected regret is then

 ERT≥∑i≤k2Δihi(T)(Δ+Δi)2−Δk(ϵT−∑j>k2hj(T)(Δ+Δj)2),

where .

#### Increasing number of pulls

The lower bounds for increasing stages show that free information should progressively replace pulls, starting from worse arms. For big enough the lower bound on is 0. An optimal algorithm should somehow have used only free information to explore. This is impossible, since the algorithm doesn’t know at first which arm is the best. The lower bound exhibits this behaviour because it is written for fixed and ignores that both and must be non-decreasing. We can get a tighter lower bound by using this monotonicity.

 ERT ≥maxt≤T∑i≤kthi(t)2Δi−Δkt(ϵt−∑j>kthj(t)2Δ2j) ≥maxt≤T∑i

For , let , such that . We can rewrite the lower bound on the regret to introduce these stages,

 ERT≥maxk:tk≤Tk∑i=2hi(tk)Δi

and verify

 tk ≥1ϵK