Approximations of the Restless Bandit Problem

# Approximations of the Restless Bandit Problem

\nameSteffen Grünewälder \emails.grunewalder@lancaster.ac.uk \AND\nameAzadeh Khaleghi \emaila.khaleghi@lancaster.ac.uk
Lancaster University
Lancaster, UK
###### Abstract

The multi-armed restless bandit problem is studied in the case where the pay-off distributions are stationary -mixing. This version of the problem provides a more realistic model for most real-world applications, but cannot be optimally solved in practice, since it is known to be PSPACE-hard. The objective of this paper is to characterize a sub-class of the problem where good approximate solutions can be found using tractable approaches. Specifically, it is shown that under some conditions on the -mixing coefficients, a modified version of UCB can prove effective. The main challenge is that, unlike in the i.i.d. setting, the distributions of the sampled pay-offs may not have the same characteristics as those of the original bandit arms. In particular, the -mixing property does not necessarily carry over. This is overcome by carefully controlling the effect of a sampling policy on the pay-off distributions. Some of the proof techniques developed in this paper can be more generally used in the context of online sampling under dependence. Proposed algorithms are accompanied with corresponding regret analysis.

Approximations of the Restless Bandit Problem Steffen Grünewälder s.grunewalder@lancaster.ac.uk Azadeh Khaleghi a.khaleghi@lancaster.ac.uk
Department of Mathematics and Statistics
Lancaster University
Lancaster, UK

## 1 Introduction

As one of the simplest examples of sequential optimization under uncertainty, multi-armed bandit problems arise in various modern real-world applications, such as online advertisement, and Internet routing. These problems are typically studied under the assumption that the pay-offs are independently and identically distributed (i.i.d.), and the arms are independent. However, this assumption does not necessarily hold in many practical situations. Consider, for example, the problem of online advertisement in which the aim is to garner as many clicks as possible from a user. Grouping adverts into categories and associating with each category an arm, this problem turns into that of multi-armed bandits’. There is dependence over time and across the arms since, for example, we expect a user to be more likely to select adverts that are related to her selections in the recent past.

In this paper, we consider the multi-armed bandit problem in the case where the pay-offs are dependent and each arm evolves over time regardless of whether or not it is played. This is an instance of the so-called restless bandit problem (Whittle, 1988; Guha et al., 2010; Ortner et al., 2014). Since in this setting an optimal policy can leverage the inter-dependencies between the samples and switch between the arms at appropriate times, it can obtain an overall pay-off much higher than that given by playing the best arm, i.e. the distribution with the highest expected pay-off, see Example 1 in (Ortner et al., 2014). However, finding the best such switching strategy is PSPACE-hard, even in the case where the process distributions are Markovian with known dynamics (Papadimitriou and Tsitsiklis, 1999). Therefore, it is useful to consider relaxations of the problem with the aim to devise computationally tractable solutions that effectively approximate the optimal switching strategy. Approximations of the Markovian restless bandit problem under known dynamics have been previously considered, see, e.g. (Guha et al., 2010) and references therein. Our focus in this paper is on a more general setting, where the rewards have unknown distributions, exhibit long-range dependencies, and have Markov Chains as a special case.

We are interested in a sub-class of the restless bandit problem, where the pay-offs have long-range dependencies. Since the nature of the problem calls for finite-time analysis, we further require that the dependence weakens over time, so as to make use of concentration inequalities in this setting. To this end, a natural approach is to assume that the pay-off distributions are stationary -mixing. The so-called -mixing coefficients , of a sequence of random variables measure the amount of dependence between the sub-sequences of separated by time-steps. A process is said to be -mixing if this dependence vanishes with . This notion is more formally defined in Section 2. In Markov Chains -mixing coefficients are closely related to mixing times. The mixing time of a Markov Chain is a measure of how fast its distribution approaches the stationary distribution. In particular, it is defined to be the time that it takes for the distribution to be within of the stationary distribution as measured in total variation distance (Levin et al., 2008)[Sec 4.5]. A classical result by Davydov shows that the decrease in distance of the distribution of the Markov Chain to the stationary distribution is controlled (up to a factor of ) by the -mixing coefficients of the Markov Chain (Doukhan, 1994)[p.88]. While -mixing coefficients are related to the well-studied mixing properties of Markov Chains, -mixing processes correspond to a wide variety of stochastic processes of which Markov Chains are a special case (Doukhan, 1994).

As discussed earlier, the optimal, yet notoriously infeasible strategy for this version of the problem is to switch between the arms. In this paper, we first address the question of when a relaxation obtained by identifying the arm with the highest stationary mean would lead to a viable approximation in this setting. For this purpose, we characterize the approximation error in terms of the amount of dependence between the pay-offs, and show that if is small, the optimum of the relaxed problem is close to that given by the optimal switching strategy. Observe that this condition translates directly to the pay-off distributions being weakly dependent. Next, we address the question of how an optimistic approach can be devised to identify the best arm. To this end, we propose a UCB-type algorithm and show that it achieves logarithmic regret with respect to the highest stationary mean. Interestingly, the amount of dependence in the form of appears in the bound, and in the case where the pay-offs are i.i.d., we recover the regret bound of Auer et al. (2002).

Note that even this relaxed version of the problem is far from straightforward. The main challenge lies in obtaining confidence intervals around empirical estimates of the stationary means. Since Hoeffding-type concentration bounds exist for -mixing processes, it may be tempting to use such inequalities directly with standard UCB algorithms designed for the i.i.d. setting, to find the best arm. However, as we demonstrate in the paper, unlike in the i.i.d. setting, a policy in this framework may introduce strong couplings between past and future pay-offs in such a way that the distribution of the sampled sequence may not even be -mixing. This is the reason why a standard UCB algorithm designed for i.i.d. settings is not suitable here, even when equipped with a concentration bound for -mixing processes. In fact, this oversight seems to have occurred in previous literature, specifically in the Improved-UCB-based approach of Audiffren and Ralaivola (2015). We refer to Section 4.1 for a more detailed discussion. We circumvent these difficulties by carefully taking random times111These correspond to random variables which determine the time at which an arm is sampled. into account and controlling the effect of a policy on the pay-off distributions. Some of our technical results can be more generally used to address the problem of online sampling under dependence.

Finally, while the study of the multi-armed bandit problem with strongly dependent pay-offs at its full generality is beyond the scope of this paper, we provide a complementary example for this regime. Specifically, we consider a setting where the bandit arms are governed by stationary Gaussian processes with slowly decaying covariance functions. Such high-dependence scenarios are quite common in practice. For instance, the throughput of radio channels changes slowly over time and the problem of choosing the best channel can be modelled by a bandit problem with strongly dependent pay-offs. The intuitive reason why in this setting it may also be possible to efficiently obtain approximately optimal solutions is that the strong dependencies can allow for the prediction of future rewards even from scarce observations. We give a simple switching strategy for this instance of the problem and show that it significantly outperforms a policy that aims for the best arm. Our regret bound for this algorithm directly reflects the dependence between the pay-offs: the higher the dependence the lower the regret.

A summary of our main contributions is listed below.

• In an attempt to derive a computationally tractable solution for the restless bandit problem, we first identify a case where the optimal switching strategy can be approximated by playing the arm with the highest stationary mean. To this end, we show that the loss of settling for the highest stationary mean as opposed to finding the best switching strategy is controlled by the amount of inter-dependence as reflected by . This is shown in Proposition 7.

• We provide a detailed example, namely Example 2, where we demonstrate the challenges in the non-i.i.d. bandit problem. In particular, we show that a policy in this framework may introduce strong couplings between past and future pay-offs in such a way that the resulting pay-off sequence may have a completely different dependency structure.

• We develop technical machinery to circumvent the difficulties introduced by the inter-dependence between the rewards, and allow us to control the effect of a policy on the pay-off distributions. Some of our derivations concerning sampling under dependence can be of independent interest. We further propose a UCB-type algorithm, namely Algorithm 1 that deploys these tools to identify the arm with the highest stationary mean.

• We provide an upper bound on the regret of Algorithm 1 (with respect to the highest stationary mean). This is provided in Theorem 10. The regret bound is a function of the amount of inter-dependence as reflected by the -mixing coefficients, and in the case where the pay-offs are i.i.d., we recover the regret bound of Auer et al. (2002)[Thm. 1]. This result along with Proposition 7 allow us to argue that in the case where dependence is low Algorithm 1 can be used to approximate the best switching strategy.

The remainder of the paper is organized as follows. In Section 2 we introduce preliminary notation and definitions. We formulate the problem in Section 3 and give our main results in Section 4. We conclude in Section 5 with a discussion of open problems.

## 2 Preliminaries

We start with some useful notation before discussing basic definitions concerning stochastic processes. Since the bandit problem involves multiple arms (processes), we extend the definition of a -mixing process to what we call a jointly -mixing process, to be able to model the multi-armed bandit problem. Indeed, in many natural settings, the process is jointly -mixing. For example, as we demonstrate in Proposition 5, independent Markov Chains are jointly -mixing.

##### Notation.

Let and denote the set and extended set of natural numbers respectively. We introduce the abbreviation , for sequences . Given a finite subset and a sequence , we let denote the set of elements of indexed by . If is a sequence of random variables indexed by , we denote by the smallest -algebra generated by .

##### Notion of φ-dependence.

Part of our results concern the so-called -dependence between -algebras defined as follows, see, e.g. (Doukhan, 1994).

###### Definition 1

Consider a probability space and let and denote -subalgebras of respectively. The -dependence between and is is given by

 φ(U,V):=sup{|P(V)−P(V|U)|:U∈U,P(U)>0,V∈V}.

If and are two random variables measurable with respect to we simplify notation by letting denote the -dependence between their corresponding -algebras; distinction will be clear from the context. Similarly, if and are finite sequences of random variables, with their -dependence can be similarly defined as

 φ(XA,XB):=φ(σ(XA),σ(XB)).

In words, measures the maximal difference between the probability of an event and its conditional probability given an event , where and are determined by random variables indexed by and respectively. The notion of -dependence carries over from probability measures to expectations. In particular, consider a real-valued random variable defined on some probability space , and denote by some collected information in the form of a -subalgebra of . Let denote Kolmogorov’s conditional expectation, i.e. a -measurable random variable such that for all . As follows from Theorem 2 below, due to Bradley (2007)[vol. 1 pp. 124], the difference between and is effectively upper-bounded by .

Let be a probability space, let be a real-valued random variable with and let be some -subalgebra of then

 2φ(G,σ(X))=sup∥E(Y|G)−E(Y)∥1/∥Y∥1,

where the supremum is taken over all -measurable random variables with . Furthermore, for any it holds that

 ∫B|E(X|G)−E(X)|dP≤2P(B)∥X∥∞φ(G,σ(X)).

Observe that because is trivially measurable the first equality given in the theorem implies that

 ∥E(X|G)−E(X)∥1≤2∥X∥1φ(G,σ(X)).
##### Stochastic Processes & φ-mixing Properties.

Let be a measurable space; we let 222More generally can be a finite set or a closed interval , for , . and denote by the Borel -algebra on . We denote by the set of all -valued infinite sequences indexed by . A stochastic process can be modelled as a probability measure over the space where denotes the -algebra on generated by the cylinder sets. Associated with the stochastic process is a sequence of random variables , where is the projection onto the ’th element, i.e. for and . A process is stationary if for all Borel sets , . The term stochastic process refers to either the process distribution or the associated sequence of random variables ; reference will be clear from the context.

###### Definition 3 (Stationary φ-mixing Process)

Consider a stationary stochastic process . Its -mixing coefficients are given by

 φn:=supu,v∈N+φ(X1..u,Xu+n..u+n+v−1), n∈N+

and measure the -dependence between and

 \underbracketX1,…,Xu \definecolorpgfstrokecolorrgb1,1,1\pgfsys@color@gray@stroke1\pgfsys@color@gray@fill1\underbracket\definecolorpgfstrokecolorrgb0,0,0.5\pgfsys@color@rgb@stroke000.5\pgfsys@color@rgb@fill000.5 ←gap of length n→  \underbracketXu+n,…,Xu+n+v−1

The process is said to be -mixing if

When modelling a bandit problem in this paper, we are concerned with some stochastic processes with a joint distribution that is stationary -mixing. More specifically, for a fixed , let be a probability space where with , a probability measure and obtained via the cylinder sets. Let denote the Borel -algebra on . In much the same way as with the single-process described above, associated with the joint process is a sequence of random variables where is the projection on to the ’th element, i.e. for . The measure is stationary if for every and . As above, the term stochastic process is used interchangeably to correspond to the sequence of random variables or their corresponding joint measure .

###### Definition 4 (Jointly Stationary φ-mixing Processes)

For a fixed , consider a stationary process . Its -mixing coefficients are given by

 φn:=supu,v∈N+φ(X1..u,1..k,Xu+n..u+n+v−1,1..k), n∈N+

and measure the -dependence between and . The process is said to be jointly -mixing if .

We use to denote the -mixing coefficient corresponding to a single process or to a joint process given by Definitions 3 and 4 respectively; the notion will be apparent from the context. Under the assumption that the joint process is stationary -mixing, there could be dependence between the processes, as the mixing requirement needs only to be fulfilled by the joint process. The assumption that the process is jointly -mixing is fulfilled by a variety of well-known models, including independent Markov Chains. More generally, as follows from Proposition 5 below, if we have independent -mixing processes then the joint process is -mixing. The -mixing property is defined in the same way as -mixing whereby the -dependence given by Definition 1 is replaced with , and the -mixing coefficients are defined in a manner analogous to Definitions 3. A -mixing process is also -mixing and the -mixing coefficients upper bound -mixing coefficients.

###### Proposition 5

Let be some probability space with mutually independent processes defined on it. If each of these processes is -mixing then the joint process is also -mixing and for all , , where are the mixing coefficients of the joint process and the are upper-bounds on the mixing coefficients of the individual processes.

The proof is provided in Appendix A.2. Using the above result, it can be shown that independent Markov Chains are jointly -mixing. More specifically, we have the following example.

###### Example 1 (Independent Markov Chains are jointly φ-mixing.)

Let for some be mutually independent, stationary, finite-state Markov processes. By Theorem 3.1 of Bradley (2005) each process is -mixing. Moreover, by Proposition 5 above, mutually independent -mixing processes are jointly -mixing. As a result , are jointly stationary -mixing. The significance of this observation is that jointly -mixing processes have mutually independent Markov processes as special case.

## 3 Problem Formulation: the jointly φ-mixing bandit problem.

A total of bandit arms are given, where for each , arm corresponds to a stationary process that generates a time series of pay-offs The joint process over the arms is -mixing in the sense of Definition 4, and . Each process has stationary mean and we denote by the highest stationary mean. We sometimes denote the arm with the highest stationary as the best arm. At every time-step , a player chooses one of arms according to a policy and receives a reward . The player’s objective is to maximize the sum of the pay-offs received. The policy has access only to the pay-offs gained at earlier stages and to the arms it has chosen. Let be a filtration that tracks the pay-offs obtained in the past rounds, i.e. , and . A policy is a sequence of mappings , , each of which is measurable with respect to . Note that the assumption that is measurable with respect to is equivalent to the assumption that the policy can be written as a function of the past pay-offs and chosen arms, see, e.g. (Shiryaev, 1991)[Thm. 3, p.174]. Let denote the space of all possible policies. We also let denote the filtration that keeps track of all the information available up to time (including unobserved pay-offs). More specifically, let , and for , where is the family of sets of -measure zero. We define the maximal value that can be achieved in rounds as

 v∗n=supπ∈Πn∑t=1EXt,πt. (1)

The regret that builds up over rounds for any strategy is

 Rπ(n):=v∗n−E(Xt,πt). (2)

To simplify notation, we may use when the policy is clear from the context.

## 4 Main Results

We consider the restless bandit problem in a setting where the reward distributions are jointly -mixing as formulated in Section 3. Recall that, while the optimal strategy in this case is to switch between the arms, obtaining the best switching strategy is PSPACE-hard. We address the question of when and how a good and computationally tractable approximation of the optimal policy can be obtained in this setting. The former question is answered in Section 4.2 where we characterize (in terms of ) the loss of settling for the highest stationary mean as opposed to following the best switching strategy. We show that for small -mixing coefficients, the optimum of this relaxed problem is close to that given by . To answer the latter, we devise a UCB-type algorithm in Section 4.3 to identify the arm with the highest stationary mean. The main challenge lies in building confidence intervals around empirical estimates of the stationary means. Indeed, as we demonstrate in Section 4.1, unlike in the i.i.d. setting, a policy in this framework may introduce strong couplings between past and future pay-offs in such a way that the resulting pay-off sequence may not even be -mixing. As a result, a standard UCB algorithm designed for an i.i.d. setting is not suitable here, even when equipped with a Hoeffding-type concentration bound for -mixing processes. We circumvent these difficulties in Sections 4.2 and 4.3 by controlling the effect of a policy on the pay-off distributions. Part of the analysis in these two sections relies on some technical results for -mixing processes outlined in Appendix A.1, which may be of independent interest. Finally, our results for the weakly dependent reward distributions are complemented in Section 4.4, where we study an example of a class of strongly dependent processes and give a simple switching strategy that significantly outperforms a policy that aims for the best arm.

### 4.1 Policies, Random Times, and the φ-mixing Property

Recall that a policy is a function which based on the past (observed) data samples one of bandit arms at time-step . Therefore, since this decision is based on samples generated by a random process, the times at which bandit arms are played can be naturally modelled via random times. More formally, denote by a random variable which determines the time at which the arm is sampled for the time, and , and observe that for any the event is in . We denote by the pay-off obtained from sampling arm at random time for any , where is the indicator function, i.e. is for all such that , and is otherwise.

The main challenge in devising a policy for the -mixing bandit problem is that, depending on the policy used, the dependence structure of the pay-off sequence , for may be completely different from that of . Note that this differs from the simpler i.i.d. setting where the distribution of the arm (with all its characteristics) carries over to that of the sampled sequence. This is illustrated via Example 2 below.

###### Example 2

Consider a two-armed bandit problem where the second arm is deterministically set to , i.e. and the first arm has a process distribution described by a two state Markov Chain with the following transition matrix,

 T=(1−ϵϵϵ1−ϵ),with some ϵ∈(0,1).

Observe that for this process, if is small, with high probability the Markov Chain stays in its current state. Now consider a policy , and denote by the sequence of random times at which samples the first arm according to the following simple rule. Set . For subsequent random times, if for then . Otherwise, is set to be significantly larger than to guarantee that the distribution of given is close to the stationary distribution of the Markov Chain, during which time the first arm is sampled. The sequence so generated is highly dependent on , is not -mixing. In fact, the expected pay-offs given the first observation, i.e. , are very different from the stationary mean if is small. In particular, while is at least due to Equation (13) on page 13 and, hence, for we have that .

A more detailed treatment of the above example is given in Appendix A.4.

Indeed, it is a policy’s access to the (observed) past data which can lead to strong couplings between past and future pay-offs in this framework. This point has been overlooked in the work of Audiffren and Ralaivola (2015) which relies on Improved-UCB to identify the arm with the highest stationary mean, by eliminating potentially sub-optimal arms. The elimination process depends on the data, and the time-steps at which a particular arm is played depend on the remaining arms. Hence, random times and the policy’s memory have to be carefully taken into consideration, as the process distribution of the sampled sequence is different from that of the corresponding arm. This notion has not been accounted for in their algorithm, and the confidence intervals involved correspond to the distributions of the arms, and not to those of the sampled sequences, and are therefore invalid in this non-i.i.d. setting.

### 4.2 Approximation Error

We start by translating -mixing properties to those of expectations in order to control the difference between what a switching strategy can achieve as compared to the highest stationary mean. Prior to delving into the bandit problem we consider a single bounded real-valued stationary -mixing process sampled at random times , where , is fully defined by the past observations . We control the difference between the mean of the sampled process from the stationary mean of the original process. The following proposition shows that the difference in the means is controlled by the -mixing coefficients and the increments in the stopping times; the proof is provided in Appendix A.3.

###### Proposition 6

Let be a sequence of random times such that a.s. for some and all , and assume that is a bounded stationary -mixing process with mean , upper bound on the absolute value of the process and mixing coefficients then for any

 ∣∣1nn∑i=1EXτi−μ∣∣≤2cφℓ.

In other words, when using the sample mean of the sampled process as an estimate of the stationary mean then the bias of the estimator is bounded through the -mixing coefficient. This result has further implications, in particular, it is telling us something about the leverage a switching policy has. A switching policy selects effectively random times for each arm at which the arm is played and this result is saying that the summed pay-off it can gather cannot be more than larger than the stationary mean of the arm. Now, a policy is free to play the arms at any time and we only know that for any random time , i.e. . This intuition underlies the following proposition.

###### Proposition 7

Consider the jointly stationary -mixing bandit problem formulated in Section 3. Let be the means of the stationary distributions and let . Recall the definition of the -mixing coefficient given in Definition 4. For every we have

 v∗n−nμ∗≤2nφ1.

Proof  Consider an arbitrary policy and an arbitrary then

 EXt,πt=k∑j=1t∑i=1EXt,j×χ{τi,j=t}

and since is -measurable we have with that

 EXt,j×χ{τi,j=t}=EE(Xt,j|Gt−1)×χ{τi,j=t}=∫BE(Xt,j|Gt−1).

We can extend the -mixing property of the joint process to by applying Lemma A.2 and we get

 ∣∣∫B(E(Xt,j|Gt−1)−EX1,j)∣∣≤2φ1P(B).

Since the different sets are disjoint

 EXt,πt−μ∗ ≤k∑j=1t∑i=1(EXt,j×χ{τi,j=t}−P(τi,j=t)EX1,j) ≤2φ1k∑j=1t∑i=1P(τi,j=t) ≤2φ1

Observe that this relaxation introduces an inevitable linear component to the regret as shown by Proposition 7. However, we argue that if the reward distributions are weakly dependent in the sense that is small, we may settle for the best arm instead of following the best switching strategy.

### 4.3 An Optimistic Approach

In this section we propose a UCB-type algorithm to identify the arm with the highest stationary mean in a jointly -mixing bandit problem. Consider the bandit problem described in Section 3, where we have arms each with a bounded stationary pay-off sequence such that the joint process is stationary -mixing. Suppose that the processes are weakly dependent in the sense that for some small . As discussed in Section 4.2, in this case a policy to settle for the best arm can serve as a good approximation for the best switching strategy. More specifically, let

 ¯¯¯¯¯Rπ(n):=nμ∗−n∑t=1EXt,πt

denote the regret of a policy with respect to the arm with the highest stationary mean. From Proposition 7 we have and our objective in this section is to minimize .

Recall that in light of the arguments provided in Section 4.1 it is crucial to take a policy’s access to past (observed) data into account when devising a strategy for the bandit problem in this framework. To address the challenge induced by the inter-dependent reward sequences obtained at random times, our approach relies on the following key observation. Suppose we obtain a sequence of consecutive samples from arm starting at a random time . For a long batch, i.e. large enough , the average expectations become close to the stationary mean . More formally we have Lemma 8 below.

###### Lemma 8

For a fixed and , consider the consecutive samples , where is a random time at which the arm is sampled. Let denote the stationary mean of arm . We have

 ∣∣ ∣∣μi−1mm−1∑j=0EXτ+j,i∣∣ ∣∣≤2m∥φ∥.

Proof  For simplicity of notation we denote by and by . Recall that denotes the filtration that keeps track of all the information available up to time (including unobserved pay-offs). Observe that is -measurable so that the event is in for all . As a result, for any and we have

 E(χ{τ=t}Xt+j|Gt−1)=χ{τ=t}E(Xt+j|Gt−1), (3)

see, e.g. (Shiryaev, 1991)[p.216]. We obtain,

 ∣∣ ∣∣mμi−m−1∑j=0EXτ+j,i∣∣ ∣∣ =∣∣ ∣∣∑t∈N+m−1∑j=0E(χ{τ=t}Xt+j)−Eχ{τ=t}EXt∣∣ ∣∣ (4) =∣∣ ∣∣∑t∈N+m−1∑j=0EE(χ{τ=t}Xt+j|Gt−1)−Eχ{τ=t}EXt∣∣ ∣∣ (5) =∣∣ ∣∣∑t∈N+m−1∑j=0E(χ{τ=t}E(Xt+j|Gt−1))−Eχ{τ=t}EXt∣∣ ∣∣ (6) ≤∑t∈N+m−1∑j=0E(χ{τ=t}|E(Xt+j|Gt−1)−EXt|) (7) ≤∑t∈N+m−1∑j=02φjEχ{τ=t} (8) ≤2∥φ∥

where (4) and (5) are due to stationarity and the law of total expectation respectively, (7) follows from (3), and (8) follows from Theorem 2.

Inspired by this result, we provide Algorithm 1 which, given the number of arms and the sum of the -mixing coefficients, works as follows. First, each arm is sampled once for initialization. Next, from on, arms are played in batches of exponentially growing length. Specifically, at each round arm with the highest upper-confidence on its empirical mean is selected, and played for consecutive time-steps, where denotes the number of times that arm has been selected so far. The upper confidence bound is calculated based on a Hoeffding-type bound for -mixing processes given by Corollary 2.1 of Rio (1999). The samples obtained by playing the selected arm are used in turn to calculate (from scratch) the arm’s empirical mean. The algorithm does not require the values of the individual -mixing coefficients, but only their sum . In fact, any upper-bound may be used, in which case would replace in the regret bound of Theorem 10.

To analyse the regret of Algorithm 1, first recall that in an i.i.d. setting we trivially have where is the total number of times that arm is played by the algorithm in rounds and . In our framework, this equality does not necessarily hold due to the inter-dependencies between the pay-offs. However, as shown in Proposition 9 below, an analogous result in the form of an upper-bound holds for our algorithm.

###### Proposition 9

Consider the regret of Algorithm 1 after rounds of play. We have,

 ¯¯¯¯¯R(n)≤k∑j=1ΔjETj(n)+2k(n∑l=0φl)logn

Proof  Denote by the random time at which the arm is sampled for the time. Note that for any the event is measurable with respect to the filtration that keeps track of all the information available up to time . First note that

 E(χ{τi,j=t}Xt+l,j) =EE(χ{τi,j=t}Xt+l,j|Gt−1) =E(χ{τi,j=t}E(Xt+l,j|Gt−1)) ≥(μj−2φl)P(τi,j=t) (9)

where the second equality follows from the fact that the event is -measurable and (9) follows from Theorem 2. We have,

 ¯¯¯¯¯R(n) =nμ∗−n∑t=1EXt,πt =nμ∗−n∑t=1k∑j=1logn∑m=1min{2m−1,n−t}∑l=0Eχ{τm,j=t}Xt+l,j ≤nμ∗−n∑t=1k∑j=1logn∑m=1min{2m−1,n−t}∑l=0P(τm,j=t)(μj−2φl) ≤nμ∗−n∑t=1k∑j=1logn∑m=1min{2m−1,n−t}∑l=0P(τm,j=t)μj+2k(n∑l=0φl)logn =k∑j=1ΔjETj(n)+2k(n∑l=0φl)logn

where the last inequality follows from (9).

An upper-bound on is given by Theorem 10 below with proof provided in Appendix B.

###### Theorem 10 (Regret Bound.)

For the regret of Algorithm 1 after rounds of play. We have,

 ¯¯¯¯¯R(n)
##### Remark.

Interestingly, the amount of dependence appears in the bound of Theorem 10. Indeed, in the case where the pay-offs are i.i.d., we recover (up to some constant) the regret bound of Auer et al. (2002)[Thm. 1].

### 4.4 Strongly Dependent Reward Distributions: a complementary example

At the other end of the extreme lie bandit problems with strongly dependent pay-off distributions. Our objective in this section is to give an example where a simple switching strategy can be obtained in this case to leverage the strong inter-dependencies between the samples. This approach gives a much higher overall pay-off than what would be given by settling for the best arm, and is computationally efficient. The intuition is that in many cases strong dependencies may allow for the prediction of future rewards even from scarce observations of a sample path.

We consider a class of stochastic processes for which we can easily control the level of dependency. A natural choice is to use stationary Gaussian processes on . Recall that a Gaussian process is fully specified by its covariance function . Also, for any covariance function Kolmogorov’s consistency theorem guarantees the existence of a Gaussian process with this particular covariance function, see, e.g. (Giné and Nickl, 2016). A Gaussian process is stationary if it has constant mean on and its covariance can be written as for all . We measure the degree of dependence of the process by means of Hölder-continuity of the covariance function. In particular, we assume that there exists some and such that for all , . A low and correspond to highly dependent processes since the covariance decreases slowly over time. A slowly decreasing covariance also implies large -mixing coefficients, since by Rio (1999)[Thm. 1.4]. Consider a -armed bandit problem where each arm is distributed according to a stationary Gaussian process with stationary mean . For simplicity, we assume that the processes are mutually independent with the same, unknown, covariance function . While is assumed unknown, we have access to an upper-bound on its rate of decay. That is, we are given constants and such that is -Hölder continuous. We further assume that an upper bound on the stationary means of the processes are known and that . In order to obtain direct control on the regret we would need to make inference about the best possible switching strategy. Instead, we provide guarantees for the regret of policy with respect to the best policy that can choose arms in hindsight, i.e. . Note that .

We provide a simple algorithm, namely, Algorithm 2, that exploits the dependence between the pay-offs. Starting from an exploration phase, the algorithm alternates between exploration and exploitation, denoted Phase I and Phase II respectively. In Phase I it sweeps through all arms to observe the corresponding pay-offs. In Phase II it plays the arm with the highest observed pay-off for rounds, where is a (large) constant given by (2) which reflects the degree of dependence between the samples in the processes. We need not estimate the stationary distributions in this algorithm, as bounds on the differences between the stationary means suffice. Indeed, these differences are of minor relevance unless they are high as compared to the dependence between the individual processes. We have the following regret bound whose proof is given in Appendix C.2.

###### Proposition 11

Given such that and such that for all , the regret of Algorithm 2 after rounds is at most,

 (n+m⋆)k(k−1)(Δ+√2/πm⋆+√am⋆8π(1−bm⋆)(√8π−(1−√bm⋆Δ)exp(−bm⋆Δ28))),

with .

To interpret the bound consider for simplicity and the case of a highly dependent process and, hence, a very small . If we choose to play the arm with the highest stationary mean at all rounds, then standard bounds on the normal distribution give us a bound of order on the regret. In this case, the regret of Algorithm 2 is of order

 nk2c1/4

because is insignificant for small , and the bracket on the right side is about . The gap itself is not of high importance because in Phase I the algorithm selects the arm with highest current pay-off and the value stays stable over a long period as is small. Both regret bounds are linear in because the oracle has a significant advantage in this setting: at any given time the oracle chooses the arm with the highest pay-off in hindsight. However, for a moderate number of arms is significantly smaller than unless is considerably large. The advantage of the switching algorithm vanishes if is large compared to the smoothness of the process, because eventually the exploration phase (Phase I) will dominate and the smoothness of the arms cannot be exploited by this algorithm. This example demonstrates that large dependence in the stochastic process can be exploited to build switching algorithms that have a significant edge over algorithms that aim to select a single arm and algorithms like Algorithm 1 are outperformed by simple switching algorithms.

## 5 Outlook

This paper is an initial attempt to characterize special sub-classes of the restless bandit problem where good approximate solutions can be found using simple, computationally tractable approaches. We provide a UCB-type algorithm to approximate the optimal strategy in the case where the pay-offs are jointly stationary -mixing and are only weakly dependent. A natural open problem here is the derivation of a lower-bound. Moreover, while our algorithm only requires knowledge of the sum of the -mixing coefficients as opposed to that of each individual , the online estimation of the mixing coefficients can prove useful. Specifically, in light of Proposition 7, if is estimated from data, the algorithm can have a real-time estimate of its maximum loss with respect to the best switching strategy after rounds of play. Further, the results can be strengthened if the algorithm can adaptively estimate instead of relying on it as input. Another interesting regime corresponds to strongly dependent pay-off distributions. We provide an example using stationary Gaussian Processes where a simple switching strategy can leverage the dependencies to outperform a best arm policy. An open problem would be to weaken the assumptions on the process distributions and obtain results analogous to the weakly dependent case for the strongly dependent framework.

## A Proofs for φ-mixing processes

### a.1 Technical results

The first lemma is simple but fundamental to our derivations. It allows us to control the -mixing coefficient of disjoint events.

###### Lemma A.1

Let be a probability space and let be two -subalgebras of . If there exists a such that for all and it holds that then for any disjoint sequence for all , and any , we have

 ∞∑n=0|P(Bn)P(C)−P(Bn∩C)|≤2φP(C