On Adaptive Estimation for Dynamic Bernoulli Bandits
Abstract
The multiarmed bandit (MAB) problem is a classic example of the explorationexploitation dilemma. It is concerned with maximising the total rewards for a gambler by sequentially pulling an arm from a multiarmed slot machine where each arm is associated with a reward distribution. In static MABs, the reward distributions do not change over time, while in dynamic MABs, each arm’s reward distribution can change, and the optimal arm can switch over time. Motivated by many real applications where rewards are binary counts, we focus on dynamic Bernoulli bandits. Standard methods like Greedy and Upper Confidence Bound (UCB), which rely on the sample mean estimator, often fail to track the changes in underlying reward for dynamic problems. In this paper, we overcome the shortcoming of slow response to change by deploying adaptive estimation in the standard methods and propose a new family of algorithms, which are adaptive versions of Greedy, UCB, and Thompson sampling. These new methods are simple and easy to implement. Moreover, they do not require any prior knowledge about the data, which is important for real applications. We examine the new algorithms numerically in different scenarios and find out that the results show solid improvements of our algorithms in dynamic environments.
1
Dynamic Multiarmed Bandit, Bernoulli Bandits, Adaptive Estimation, UCB, Thompson Sampling
1 Introduction
The multiarmed bandit (MAB) problem is a classic decision problem where one needs to balance acquiring new knowledge with optimising the choices based on current knowledge, a dilemma commonly referred to as the explorationexploitation tradeoff. The problem originally proposed by Robbins (1952) aims to sequentially make selections among a (finite) set of arms, , and maximise the total reward obtained through selections during a (possibly infinite) time horizon . The MAB framework is natural to model many realworld problems. It was originally motivated by the design of clinical trials (Thompson, 1933; see also Press, 2009, and Villar et al., 2015, for some recent developments). Other applications include online advertising (Li et al., 2010; Scott, 2015), adaptive routing (Awerbuch and Kleinberg, 2008), and financial portfolio design (Brochu et al., 2011; Shen et al., 2015). In stochastic MABs, each arm is characterised by an unknown reward distribution. The Bernoulli distribution is a very natural choice that appears very often in the literature, because in many real applications, the rewards can be represented by binary counts. For example, in clinical trials, we obtain a reward 1 for a successful treatment, and a reward 0 otherwise (Villar et al., 2015); in online advertising, counts of clicks are often used to measure success (Scott, 2010).
Formally, the MAB problem may be stated as follows: for discrete times , the decision maker selects one arm from and receives a reward . The goal is to optimise the arm selection sequence and maximise the total expected reward , or equivalently, minimise the total regret:
(1)  
where is the optimal arm at time . The total regret can be interpreted as the difference between the total expected reward obtained by playing an optimal strategy (selecting the optimal arm at every step) and that obtained by the algorithm. For notational convenience, we let , denote the expected reward of arm at time , i.e., . In rest of this paper, we will also use the notations like and when we introduce the methods/models that can be applied separately to different arms.
The classic MAB problem assumes the reward distribution structure does not change over time. That is to say, in this case, the optimal arm is the same for all . A MAB problem with static reward distributions is also known as the stationary, or static MAB problem in the literature (e.g., Garivier and Moulines, 2011; Slivkins and Upfal, 2008). A dynamic MAB, where changes are allowed in the underlying reward distributions, is more realistic in realworld applications such as online advertising. An agent always seeks the best web position (that is, the placement of the advertisement on a webpage), and/or advertisement content, to maximise the probability of obtaining clicks. However, due to inherent changes in marketplace, the optimal choice may change over time, and thus the assumption of static reward distributions is not adequate in this example.
Two main types of change have been studied in the literature of dynamic MAB: abrupt changes (Garivier and Moulines, 2011; Yu and Mannor, 2009), and drifting (Granmo and Berg, 2010; Gupta et al., 2011; Slivkins and Upfal, 2008). For abrupt changes, the expected reward of an arm remains constant for a some period and changes suddenly at possibly unknown time instants (Garivier and Moulines, 2011). The study of drifting dynamic bandits follows the seminal work of Whittle (1988), in which restless bandit was introduced. In Whittle’s study, the state of an arm can change according to a Markov transition function over time whether it is selected or not. Restless bandits are regarded as intractable, i.e., it is not possible to derive an optimal strategy even if the transitions are deterministic (Papadimitriou and Tsitsiklis, 1999). In recent studies of drifting dynamic bandits, the expected reward of an arm is often modelled by a random walk (e.g., Granmo and Berg, 2010; Gupta et al., 2011; Slivkins and Upfal, 2008).
In this work, we look at the problem of dynamic bandits where the expectation of the reward distribution changes over time, focusing on the Bernoulli reward distribution because of its wide relevance in real applications. In addition, we will emphasise on cases where the changes of the reward distribution can really have an effect on the decision making. As an example, for a twoarmed Bernoulli bandit, the expected reward of Arm 1 oscillates in over time, and the expected reward of Arm 2 oscillates in . The reward distributions for both arms change, but the optimal arm remains the same. We will not regard this example as a dynamic case.
Many algorithms have been proposed in the literature to perform arm selection for MAB. Some of the most popular ones include Greedy (Watkins, 1989), Upper Confidence Bound (UCB; Auer et al., 2002), and Thompson Sampling (TS; Thompson, 1933). These methods have been extended in various ways to improve performance. For example, Garivier and Cappe (2011) proposed the KullbackLeibler UCB (KLUCB) method which satisfies a uniformly better regret bound than UCB. May et al. (2012) introduced the Optimistic Thompson Sampling (OTS) method to boost exploration in TS. Some more extensions will be described in Section 3. Even in their basic forms, all the forementioned approaches can perform well in practice in many situations (e.g., Chapelle and Li, 2011; Kuleshov and Precup, 2014; Vermorel and Mohri, 2005). One thing that these methods have in common is that, they treat all the observations equally when estimating or making inference of . Specifically, Greedy and UCB use sample averages to estimate . In static cases, given that are i.i.d, this choice is a sensible one from a theoretical perspective, and one could invoke various asymptotic results in here as justification (e.g., law of large numbers, central limited theorem, Berry Essen inequality etc.). From a practical point of view, when changes significantly with time, it could become a bottleneck in performance. The problem is that, a sample average does not put more weight on more recent data , which is a direct observation of . In this paper we will consider using a different estimator for that is inspired from adaptive estimation (Haykin, 2002) and propose novel modifications of popular MAB algorithms.
1.1 Contributions and Organisation
We propose algorithms that use adaptive forgetting factors (Bodenham and Adams, 2016) in conjunction with the standard MAB methods. This results to a new family of algorithms for dynamic Bernoulli bandits. These algorithms overcome the shortcomings related to using sample averages for estimation of dynamically changing rewards. These algorithms are easy to implement and require very little tuning effort; they are quite robust to tuning parameters and their initialisation does not require assumptions or knowledge on the model structure in advance.
The remainder of this paper is structured as follows: Section 2 briefly summarises some adaptive estimation techniques, focusing on Adaptive Forgetting Factors (AFFs). Section 3 introduces the methodology for arm selection. Section 4 presents a variety of numerical results for different dynamic models and MAB algorithms. We summarise our findings in Section 5.
2 Adaptive Estimation Using Forgetting Factors
Solving the MAB problem involves two main steps: learning the reward distribution of each arm (estimation step), and selecting one arm to play (selection step). The foundation of making a good selection is to correctly and efficiently track the expected reward of the arms, especially in the context of timeevolving reward distributions. Adaptive estimation approaches are useful for this task, as they provide an estimator that follows closer a moving target, here the target is the expected reward (Anagnostopoulos et al., 2012; Bodenham and Adams, 2016). In this section, we introduce how to use an Adaptive Forgetting Factor (AFF) estimator for monitoring a single arm. For the sake of simplicity, when clear we have dropped dependence on arms in the notation.
Assume now that we select one arm all the time until and receive rewards . If the reward distribution is static, are i.i.d. Therefore, it is natural to estimate the expected reward via the sample mean: . This sample mean estimator was widely used in the algorithms designed for the static MAB problem such as Greedy and UCB. One problem with this estimator is that it often fails in the case that the reward distribution changes over time. The adaptive filtering literature (Haykin, 2002) provides a generic and practical tool to track a timeevolving data stream, and it has been recently adapted to a variety of streaming machine learning problems (Anagnostopoulos et al., 2012; Bodenham and Adams, 2016). The key idea behind adaptive estimation is to gradually reduce the weight on older data as new data arrives (Haykin, 2002). For example, a fixed forgetting factor estimator employs a discount factor , , and takes the form , where is a some normalising constant. Bodenham and Adams (2016) illustrated that the fixed forgetting factor estimator has some similarities with the Exponentially Weighted Moving Average (EWMA) scheme (Roberts, 1959) which is a basic approach in the change detection literature (Tsung and Wang, 2010).
In this paper, we will use an adaptive forgetting factor where the magnitude of the forgetting factor can be adjusted at each time step for better adaptation. One main advantage of an AFF estimator is that it can respond quickly to the changes of a target without requiring any prior knowledge about the process. In addition, by using dataadaptive tuning of , we sidestep the problem of setting a key control parameter. Therefore, it is very useful when applied to dynamic MABs where we do not have any knowledge about the dynamics of the reward distribution.
Our AFF formulation follows Bodenham and Adams (2016). We present here only the main methodology. For a data stream , the adaptive forgetting factor mean (denoted by ) is defined as follows:
(2) 
where the normalising constant is selected to give unbiased estimation when the data are i.i.d. For convenience, we set . We can update via the following recursive updating equations:
(3)  
(4)  
(5) 
The adaptive forgetting factor is a expanding sequence over time, and the forgetting factor is computed via a single gradient descent step, which is
(6) 
where () is the step size, and is a user determined cost function of the estimator . Here, we choose for good mean tracking performance, which can be interpreted as the onestepahead squared prediction error. Other choices are possible, such as the onestepahead negative loglikelihood (Anagnostopoulos et al., 2012), but this will not be pursued here. In addition, is a derivativelike function of with respect to (see Bodenham and Adams, 2016, sect. 4.2.1 for details). Note, the index of is  as only are involved in . We require the following recursions to sequentially compute :
(7)  
(8)  
(9) 
In addition to the mean, we may make use of an adaptive estimate of the variance. The adaptive forgetting factor variance is defined as:
(10) 
Note here we choose the same adaptive forgetting factor for mean and variance for convenience, though other formulations are possible. One can use a separate adaptive forgetting factor for the variance if needed. Again, can be computed recursively via the following equations:
(11)  
(12)  
(13) 
The only tuning parameter in AFF estimation is the step size used in (6), and its choice may affect the performance of estimation. In Bodenham and Adams (2016), the authors proved that when the data are i.i.d with variance . That is to say, the forgetting factors, , computed via (6) will be forced to be either 0 or 1 if is too large. Therefore, before examining the influence of , they scaled to ( can be estimated during a burnin period). However, in this paper, we are only interested in Bernoulli rewards, which means that is less than 1, so it is not essential to devise an elaborate scaling scheme. We apply the AFF estimation in standard MAB algorithms (see Section 3), and examine empirically the influence of on these algorithms in Section 4.2.1
2.1 Dealing with Missing Observations
In the MAB setting, we have at least two arms, and for each arm, we will construct an AFF estimator. However, we can only observe one arm at a time. This means that the estimations and intermediate quantities of an unobserved arm will retain their previous values, that is, if arm is not observed at time ,
Not being able to update estimators sets more challenges in dynamic cases. In static cases, the sample mean estimator will converge quickly to the expected reward with a few observations, and therefore it has little effect if the arm is not observed further. However, in dynamic cases, even if the estimator tracks the expected reward perfectly at a given moment, its precision may deteriorate quickly once it stops getting new observations. Therefore, it is more challenging to balance exploration and exploitation in dynamic cases.
3 Action Selection
Having discussed how to track the expected reward of arms in the previous section, we now move on to methods for the selection step. We will consider three of the most popular methods: Greedy (Watkins, 1989), UCB (Auer et al., 2002) and TS (Thompson, 1933). They are easy to implement and computationally efficient. Moreover, they have good performance in numerical evaluations (Chapelle and Li, 2011; Kuleshov and Precup, 2014; Vermorel and Mohri, 2005). Each of these methods uses a different mechanism to balance the explorationexploitation tradeoff. Deploying AFF in these methods, we propose a new family of MAB algorithms for dynamic Bernoulli bandits, and they are denoted with the prefix AFF to emphasise the use of AFF in estimation. Driven by Greedy, UCB and TS, the new algorithms are AFFGreedy, AFFUCB, and AFFTS/AFFOTS respectively.
In the literature of dynamic bandits, many approaches attempted to improve the performance in standard methods by choosing an estimator that uses the reward history wisely. Koulouriotis and Xanthopoulos (2008) applied exponentiallyweighted average estimation in Greedy. Kocsis and Szepesvari (2006) introduced the discounted UCB method (it was also called DUCB in Garivier and Moulines, 2011) which used a fixed discounting factor in estimation. Garivier and Moulines (2011) proposed the Sliding Window UCB (SWUCB) algorithm where the reward history used for estimation is restricted by a window. The Dynamic Thompson Sampling (DTS) algorithm applied a bound on the reward history used for updating the hyperparameters in posterior distribution of (Gupta et al., 2011). These sophisticated algorithms require accurate tuning of some input parameters, which relies on knowledge of the model/behaviour of . For example, computing the window size of SWUCB, or the discounting factor of DUCB (Garivier and Moulines, 2011) requires knowing the number of switch points (i.e., time instants that the optimal arm switches). While the idea behind our AFF MAB algorithms is similar, our approaches automate the tuning of the key parameters (i.e., the forgetting factors), and only require little effort to tune the higher level parameter in (6). Moreover, we use the AFF technique to guide the tuning of the key parameter in the DTS algorithm, which will be discussed later in this section. Other approaches for dynamic bandits include (Slivkins and Upfal, 2008). DaCosta et al. (2008) used the PageHinkley test to restart the UCB algorithm in the application of adaptive operator selection.
In what follows, we discuss each AFFdeployed method separately. We review briefly the basics of each method and refer the reader to the references for more details. In addition, we will continue to use notations like instead of when clear. In all the AFF MAB algorithms we propose below, we will use a very short initialisation (or burnin) period for the initial estimations. Normally, the length of the burnin period is , that is, selecting each arm once; for the algorithms that requires estimates of variance, we use a longer burnin period by selecting each arm times.
3.1 Greedy
Greedy (Watkins, 1989) is the simplest method for the static MAB problem. The expected reward of an arm is estimated by its sample mean, and a fixed parameter is used for selection. At each time step, with probability , the algorithm selects an arm uniformly to explore, and with probability , the arm with the highest estimated reward is picked. Greedy is simple and easy to implement, which makes it appealing for dynamic bandits. However, it can have two main issues: first, the sample average is not ideal for tracking the moving reward; second, the parameter is the key to balancing the explorationexploitation dilemma, but it is challenging to tune as an optimal strategy in dynamic environments may require varying over time.
In Algorithm 1, we propose the AFFGreedy algorithm to overcome the above weaknesses. In the algorithm, we use the AFF mean from (2) to estimate the expected reward. This estimator can respond quickly to changes, that is, for an arm that is frequently observed, it can closely follow the underlying reward; for an arm that is not observed for a long time, the estimator can capture quickly once the arm is selected again. At each time step, we first identify the arm with the highest AFF mean; if the absolute difference between this arm’s last two forgetting factors is smaller than , we select it; otherwise, we select an arm from uniformly. A threshold is used to balance exploration and exploitation. Tuning is easier than as it is related to the step size used in (6). This was confirmed in a large number of simulations. For Bernoulli dynamic bandits, we suggest to set .
We use the forgetting factors in the decision rule as their magnitudes indicate the variability of the data stream. For example, if is close to zero, it can be interpreted as a sudden change occurring at time , and if close to 1, it indicates that the data stream is stable at time . To understand the decision rule better, we illustrate it using two examples.

Variable arm example: let us say arm was selected at time , and at time , arm has the highest estimated reward and . By the decision rule, the algorithm will select this arm again. We are interested in two cases: first, both and are close to 1; second, both and are close to 0. It is easy to understand why the algorithm select it in the first case, as the arm is currently stable and it has the highest estimated reward. In the second case, seems variable in the past two steps. Even if had kept moving down (that is, the worst possibility), the estimated reward would have fallen as well, since arm still has the highest estimated reward, Algorithm 1 will select it.

Idle arm example: let us say arm has the highest estimated reward at time , and it was not selected at . By the decision rule, Algorithm 1 will select this arm for sure since .
From these examples, we can see that exploration and exploitation are balanced in a way that takes into account the variability in the estimation procedure rather than by simply flipping a coin. It boosts gaining knowledge for active but variable arms and idle arms.
3.2 Upper Confidence Bound
Another type of algorithms uses upper confidence bounds for selection. The idea is that, instead of the plain sample average, an exploration bonus is added to account for the uncertainty in the estimation, and the arm with highest potential of being optimal will be selected. This exploration bonus is typically derived using concentration inequalities (e.g., Hoeffding, 1963). The UCB1 algorithm introduced by Auer et al. (2002) is a classic method. In latter works, UCB1 was often called simply UCB. For any reward distribution that is bound in [0,1], the UCB algorithm picks the arm which maximise the quantity , where is the sample average and is the number of times this arm was played up to time . The exploration bonus was derived using the ChernoffHoeffding bound. It is proved that the UCB algorithm achieves logarithmic regret uniformly over time (Auer et al., 2002).
For better adaptation in dynamic environments, we replace with , and modify the upper bound accordingly. This results to the AFFUCB algorithm in Algorithm 2. The upper bound for selection at time takes the form . We set to:
(14) 
where is the last time instant that the arm was observed; , , and are quantities related to the AFF estimation (see Section 2; and its recursive updating is ).
From (14), is a combination of two components. It can be interpreted by considering two cases:

if an arm was observed at the previous time step, (i.e., ), ;

if an arm was not observed at the previous time step, .
In the former case, is derived via the ChernoffHoeffding bound in a similar way to the derivation of UCB (see Appendix A for details). However, for an unselected arm, if we use the same expression, its upper bound will be static since , and do not change. As a consequence, it will only be selected if the arm with current highest upper bound drops below it. This is not desirable since in a changing environment, any suboptimal arm can become optimal at any time. This motivates us to deliberately add some inflation to the upper bound of unselected arms to impose exploration, which leads to . Note here decreases with the number of arms, . This makes use of the fact that as increases, the population of arms will “fill” more the reward space and more opportunities will arise for picking high reward arms.
3.3 Thompson Sampling
Recently, researchers (e.g., Scott, 2015) have given more attention to the Thompson Sampling (TS) method which can be dated back to Thompson (1933). It is an approach based on Bayesian principles. A (usually conjugate) prior is assigned to the expected reward of each arm at the beginning, and the posterior distribution of the expected reward is sequentially updated through successive arm selection. A decision rule is constructed using this posterior distribution. At each round, a random sample is drawn from the posterior distribution of each arm, and the arm with the highest sample value is selected.
For the static Bernoulli bandit, following the approach of Chapelle and Li (2011), it is convenient to choose the Beta distribution, , as a prior. The posterior distribution is then at time , and the parameters and can be updated recursively as follows: if an arm is selected at time ,
(15)  
(16) 
otherwise,
(17)  
(18) 
The simplicity and effectiveness in real applications (Scott, 2015) make TS a good candidate for dynamic bandits. However, it has similar issues in tracking as in Greedy and UCB, For illustration, assume an arm is observed all the time, and one can rewrite the recursions in (15)(16) as:
As a result, the posterior distribution keeps full memory of all the past observations, making posterior inference less responsive to observations near time .
To modify the above updating, we use the intermediate quantities and from (4)(5). If an arm is selected at time ,
(19)  
(20) 
otherwise, and are updated via (17)(18). Using these updates, we propose in Algorithm 3 the AFFTS algorithm for dynamic Bernoulli bandits.
3.3.1 Optimistic Thompson Sampling
We now look at some popular extensions of TS. May et al. (2012) introduced the optimistic version of Thompson sampling called Optimistic Thompson Sampling (OTS), where the drawn sample value is replaced by its posterior mean if the former is smaller. That is to say, for each arm, the score used for decision will never be smaller than the posterior mean. OTS boosts further the exploration of highly uncertain arms compared to TS, as OTS increases the probability of getting a high score for arms with high posterior variance.
However, OTS has the same problem as TS when applied to a dynamic problem, that it uses the full reward history to update the posterior distribution. We propose the AFF version of OTS in Algorithm 4.
3.3.2 Tuning Parameter in Dynamic Thompson Sampling
The Dynamic Thompson Sampling (DTS) algorithm was introduced by Gupta et al. (2011) specifically for solving the dynamic Bernoulli bandit problem of interest here. The DTS algorithm uses a predetermined threshold in updating the posterior parameters and while using the standard Thompson sampling technique for arm selection. For the arm that is selected at time , if , the posterior parameters are updated via (15)(16); otherwise when ,
To understand, let denote the posterior mean, and assume an arm is observed all the time. Say at time the arm achieves the threshold, i.e., for and onwards. Following (17)(21) of Gupta et al. (2011),
(21) 
which is a weighed average of and the observation . The recursion of is similar to the EWMA scheme (Roberts, 1959). Essentially, the DTS algorithm uses the threshold to bound the total amount of reward history used for updating the posterior distribution. Once it comes to the threshold, the algorithm yields putting more weight on newer observations.
Although it was demonstrated in Gupta et al. (2011) that the DTS algorithm has the ability to track the changes in the expected reward, the performance of the algorithm is very sensitive to the choice of . In our numerical simulations (see Section 4.2.2), we found that the performance of the DTS algorithm varies a lot with different values. However, in Gupta et al. (2011), the authors did not provide tuning methods for . To address this issue, we propose below two different ways to tune adaptively at each time step using AFF estimations (AFFDTS1 & 2 resp.).
AffDts1
From the numerical results in Gupta et al. (2011, sect. IV.C), the optimal is related to the the speed of change of . This motivates us to tune according to the variance of the data stream. We can use the AFF variance, , defined in (10) as an estimation of the data variance. One option is to use ; since high indicates more dynamics in , a shorter reward history is required. For example in the numerical examples in Section 4.2.2, we will use .
AffDts2
4 Numerical Results
In this section, we illustrate the performance improvements on Greedy, UCB, and TS using AFFs. We consider two different dynamic scenarios for the expected reward : abruptly changing and drifting. For the abruptly changing scenario, instead of manually setting up change points in as in Yu and Mannor (2009) and Garivier and Moulines (2011), we set up changepoint instants for an arm by an exponential clock (see Section 4.1.1). In the drifting scenario, the evolution of the expected reward is driven by a random walk in the interval (0,1). For the random walk case we use two different models: the first model is inspired by Slivkins and Upfal (2008) where is modelled by a random walk with reflecting bounds; the second model is to use a transformation function on a random walk. For each scenario, we test the performance with 2, 50 and 100 arms; the twoarmed examples are used for the purpose of illustration, and the latter examples (50 and 100 arms) are used to evaluate the performance with a large number of arms. We also demonstrate the robustness of the AFF MAB algorithms to tuning, specifically, sensitivity to the step size, . Finally, we use a twoarmed example to show that the modified DTS algorithms, i.e., AFFDTS1 and AFFDTS2, can reduce the performance sensitivity of DTS to the input parameter .
4.1 Performance for Different Dynamic Models
We first use twoarmed examples to compare the performance of AFFGreedy, AFFUCB, and AFFTS/AFFOTS to the standard methods Greedy, UCB, and TS respectively. We consider four different cases: two cases for the abruptly changing scenario, and two for the drifting scenario; each case has 100 independent replications. The length of each simulated experiment is . For the Greedy method, we evaluate over a grid of choice of , , and report performance for the best choice. We use step size for all AFF MAB algorithms. For AFFGreedy, we set the threshold . For all Thompson sampling based algorithms, we use as the prior.
4.1.1 Abruptly Changing Expected Reward
The expected reward is simulated by the following exponential clock model:
(22) 
The parameter determines the frequency at which change point occurs. At each change point, the new expected reward is sampled from a uniform distribution . We generate two different cases, Case 1 and 2. Parameters used for generating these cases can be found in Table 1. For visualisation purposes, we display in Figure 1 a single simulated path for against . For Case 1, we distinguish the two arms by varying their frequency of change, but in the long run, for high , are the same. In Case 2, Arm 1 has a higher .
Case 1  Case 2  

Arm 1  0.001  0.0  1.0  0.001  0.3  1.0 
Arm 2  0.010  0.0  1.0  0.010  0.0  0.7 
In Figure 2, we present comparisons in each case. The bottom row of Figure 2 shows boxplots of the total regret as in (1). In addition, the top row of Figure 2 displays the cumulative regret over time; the results are averaged over 100 independent replications. The plots are good evidence that our algorithms yield improved performance over standard approaches. In particular, the improvement is distinguishable in Case 1, for which the two arms have the same . In the case that one arm’s mean dominates in the long run (Case 2), the AFF MAB algorithms perform similarly to the standard methods. However, the AFF MAB algorithms have smaller variance among replications. In both cases, AFFOTS has the best performance in terms of total regret.
4.1.2 Drifting Expected Reward
For the drifting scenario, we use two different models. The first is the random walk model with reflecting bounds introduced in Slivkins and Upfal (2008), which is:
(23) 
where , and . Slivkins and Upfal (2008) showed that generated by this model is stationary, that is in the long run, will be distributed according to a uniform distribution. The parameter used in the model controls the rate of change in an arm. In the left panel of Figure 3, we illustrate a single sample from (23) with (Case 3). Similar to Case 1, the two arms in Case 3 have the same .
The second model we use to simulate drifting arms is:
(24) 
where the expected reward is transformed from the random walk . Since a random walk diverges in the long run, any trajectory will move closer and closer to one of the boundaries 0 or 1. Again, the parameter controls the speed that evolves. In the right panel of Figure 3, we illustrate a single sample from (24) with (Case 4).
The results for the drifting scenario can be found in Figure 4. The top row of Figure 4 displays the cumulative regret averaged over 100 independent replications, and the bottom row shows boxplots of total regret. For Case 3 that is simulated from the model in (23), we can see that the AFF MAB algorithms outperform the standard approaches. For Case 4 that is simulated from the model in (24), there is a solid improvement in the performance of TS, while UCB and AFFUCB perform similarly. Similar to the abruptly changing case, AFFOTS performs very well in both drifting cases in terms of total regret. It was more challenging to deploy adaptive estimation at UCB because it was harder to interpret the estimate from AFF estimator (it is more dynamic with less memory) and modify the upper bound.
4.1.3 Large Number of Arms
Modern applications of bandits problem can involve a large number of arms. For example, in online advertising, we need to optimise among hundreds of websites. Therefore, we evaluate the performance of our AFF MAB algorithms with large number of arms. We repeat earlier experiments with 50 and 100 arms. Results can be seen from Figures 58. It can be seen that performance gains hold for a large number of arms, and are very pronounced for all methods including UCB (that was more challenging to improve). For Case 1 and 3, the results for fiftyamred and onehundredarmed examples are very similar to the twoarmed ones. For Case 2 and 4, unlike the twoarmed examples where the improvement of adaptive estimation on UCB is marginal, with 50 and 100 arms, AFFUCB performs better than UCB. In addition, AFFOTS has good performance in all cases. In summary, with a large number of arms, our algorithms perform much better than the standard methods. Interestingly, in all cases, results for 50 and 100 arms are very similar. This could be attributed to both 50 and 100 arms being numbers large enough to fill the reward space [0,1] well enough so that the decision maker in both cases finds high value arms.
4.2 Robustness to Tuning
We have already seen the improvements the AFF MAB algorithms can offer in different dynamic scenarios. We now move on to examine the sensitivity of performance to the tuning parameters.
4.2.1 Initialisation in the AFF MAB Algorithms
In this section, we examine the influence of the step size on the AFF MAB algorithms. We present only for Case 3 (see Section 4.1.2) for the sake of brevity; results for other cases are very similar and hence omitted. For each AFF MAB algorithm, we do experiments with , , , and , where is the AFF variance defined in (10). Note here , , and are fixed, while can change over time. Figures 912 display the results for AFFGreedy, AFFUCB, AFFTS, and AFFOTS respectively. From the results, we can see that the algorithms are not particularly sensitive to the step size .
4.2.2 Using Adaptive Forgetting Factors to Tune Parameter in Dynamic Thompson Sampling
In Section 3.3.2, we discussed that we can use adaptive estimation to tune the input parameter in the DTS algorithm proposed by Gupta et al. (2011), and we offered two selftuning solutions, AFFDTS1 and AFFDTS2. We use the twoarmed abruptly changing example (Case 1 in Section 4.1.1) to illustrate how the AFF version algorithms can reduce the sensitivity to .
We test , 10, 100, and 1000 for DTS, AFFDTS1, and AFFDTS2. It (the value) works as the initial value of in AFFDTS1 and AFFDTS2. Step size is used for AFF related algorithms. Figure 13 displays the boxplot of total regret. We also plotted the result of AFFOTS for benchmark since it has good performance in all cases studied in the previous section. From Figure 13, the performance of AFFDTS1 and AFFDTS2 are very stable, while DTS is very sensitive to . With a bad choice of (i.e., 100 and 1000 in this case), the total regret of DTS is much higher than AFFDTS1 and AFFDTS2.
5 Conclusion
We have seen that the performance of popular MAB algorithms can be improved significantly using AFFs. The improvements are substantial when the arms are not distinguishable in the long run, i.e., the arms have the same longterm averaged expected reward , . For the case that one arm has a higher (e.g., the twoarmed example in Case 2), gains for the AFF MAB algorithms seem marginal, but there is no loss in performance, so practitioners could be encouraged to implement our adaptive methods when they do not have knowledge of the behaviour of with time. In addition, the performance gains for a large number of arms are very pronounced for all methods. Finally, the AFF MAB algorithms we proposed are easy to implement; they do not require any prior knowledge about the dynamic environment, and seem to be more robust to tuning parameters.
Combining adaptive estimation with UCB was more challenging. The reason was that one needs to reinterpret the estimate of from a stable long run average to a “more dynamic” estimator (with less memory), and modify accordingly the upper bound. We should mention here that our algorithm AFFUCB turns out to be similar to DUCB (Garivier and Moulines, 2011). In DUCB, a fixed forgetting factor approach is used for estimation. However, it requires knowing the number of switch points to tune the the forgetting factor, which we do not require.
We conclude by mentioning some interesting avenues for future work. One extension is to apply AFFbased methods for more challenging problems, e.g., rotting bandits (Levine et al., 2017), contextual bandits (Langford and Zhang, 2008; Li et al., 2010), and applications like online advertising. Another extension could involve a rigorous analysis of how the bias in AFF estimation varies with time and how can this affect the selection in MAB problems.
Appendix A Derivations for Exploration Bonus in AFFUCB
We present how we derive the following exploration bonus (used for selection at time step ) in the AFFUCB algorithm:
The construction of are the combination of two parts: , and , hence we can separate into two cases:

if an arm was observed at time , (i.e., ), ;

if an arm was not observed at time , .
We now present how to derive the first part . According to (2)(5), the AFF mean for independent data stream is
According to Hoeffding’s inequality, we have
where . We use the fact that is bounded in since is bounded in . Whilst the use of Hoeffding’s inequality is typically for i.i.d variables, there are similar expressions for Markov chains (Glynn and Ormoneit, 2002), which fits to our framework.
Let denote , and set , we have
is the the probability that the difference between and exceeds . The form of is similar to the exploration bonus in UCB (Auer et al., 2002), and in UCB, was set to to obtain a tighter upper bound as the number of trials increases (that is, exploration is reduced over time). This is sensible in static cases as the estimates converge with . However, as we are interested in dynamic cases, we are in favour of a bound that keeps a certain level of exploration over time, and hence we take a constant , and get the first part of the exploration bonus
If we only use as the exploration bouns, for an unselected arm, its value will be static since and do not change. As a consequence, it will only be selected if the arm with current highest upper bound drops below it. This motivates us to add some inflation deliberately. A naive choice is to add the data variance, and this leads to the second part where is the AFF variance defined in (10).
References
 Anagnostopoulos et al. (2012) Christoforos Anagnostopoulos, Dimitris K. Tasoulis, Niall M. Adams, Nicos G. Pavlidis, and David J. Hand. Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(2):139–166, 2012.
 Auer et al. (2002) Peter Auer, Nicolò CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(23):235–256, 2002.
 Awerbuch and Kleinberg (2008) Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
 Bodenham and Adams (2016) Dean A. Bodenham and Niall M. Adams. Continuous monitoring for changepoints in data streams using adaptive estimation. Statistics and Computing, 27(5):1257–1270, 2016.
 Brochu et al. (2011) Eric Brochu, Matthew D. Hoffman, and Nando de Freitas. Portfolio allocation for Bayesian optimization. arXiv:1009.5419v2, 2011.
 Chapelle and Li (2011) Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257. Curran Associates, Inc., 2011.
 DaCosta et al. (2008) Luis DaCosta, Alvaro Fialho, Marc Schoenauer, and Michele Sebag. Adaptive operator selection with dynamic multiarmed bandits. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, pages 913–920, 2008.
 Garivier and Cappe (2011) Aurelien Garivier and Olivier Cappe. The KLUCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, volume 19, pages 359–376, 2011.
 Garivier and Moulines (2011) Aurélien Garivier and Eric Moulines. On upperconfidence bound policies for switching bandit problems. In Algorithmic Learning Theory, volume 6925 of Lecture Notes in Artificial Intelligence, pages 174–188. SpringerVerlag Berlin, 2011.
 Glynn and Ormoneit (2002) Peter W. Glynn and Dirk Ormoneit. Hoeffding’s inequality for uniformly ergodic Markov chains. Statistics and Probability Letters, 56(2):143–146, 2002.
 Granmo and Berg (2010) OleChristoffer Granmo and Stian Berg. Solving nonstationary bandit problems by random sampling from sibling Kalman filters. In Proceedings of Trends in Applied Intelligent Systems , PT III, volume 6098 of Lecture Notes in Artificial Intelligence, pages 199–208. SpringerVerlag Berlin, 2010.
 Gupta et al. (2011) Neha Gupta, OleChristoffer Granmo, and Ashok Agrawala. Thompson sampling for dynamic multiarmed bandits. In Proceedings of the 10th International Conference on Machine Learning and Applications (ICMLA), volume 1, pages 484–489, 2011.
 Haykin (2002) Simon S. Haykin. Adaptive Filter Theory. PrenticeHall, Upper Saddle River, N.J., 4th edition, 2002.
 Hoeffding (1963) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
 Kocsis and Szepesvari (2006) Levente Kocsis and Csaba Szepesvari. Discounted UCB. In 2nd PASCAL Challenges Workshop, Venice, 2006. URL https://www.lri.fr/~sebag/Slides/Venice/Kocsis.pdf.
 Koulouriotis and Xanthopoulos (2008) D. E. Koulouriotis and A. Xanthopoulos. Reinforcement learning and evolutionary algorithms for nonstationary multiarmed bandit problems. Applied Mathematics and Computation, 196(2):913–922, 2008.
 Kuleshov and Precup (2014) Volodymyr Kuleshov and Doina Precup. Algorithms for the multiarmed bandit problem. arXiv:1402.6028v1, 2014.
 Langford and Zhang (2008) John Langford and Tong Zhang. The epochgreedy algorithm for multiarmed bandits with side information. In Advances in Neural Information Processing Systems 20, pages 817–824. Curran Associates, Inc., 2008.
 Levine et al. (2017) Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. arXiv:1702.07274v3, 2017.
 Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661–670. ACM, 2010.
 May et al. (2012) Benedict C. May, Nathan Korda, Anthony Lee, and David S. Leslie. Optimistic Bayesian sampling in contextualbandit problems. The Journal of Machine Learning Research, 13(1):2069–2106, 2012.
 Papadimitriou and Tsitsiklis (1999) Christos H. Papadimitriou and John N. Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999.
 Press (2009) William H. Press. Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research. Proceedings of the National Academy of Sciences of the United States of America, 106(52):22387–22392, 2009.
 Robbins (1952) Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–536, 1952.
 Roberts (1959) S. W. Roberts. Control chart tests based on geometric moving averages. Technometrics, 1(3):239–250, 1959.
 Scott (2010) Steven L. Scott. A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
 Scott (2015) Steven L. Scott. Multiarmed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
 Shen et al. (2015) Weiwei Shen, Jun Wang, YuGang Jiang, and Hongyuan Zha. Portfolio choices with orthogonal bandit learning. In Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence (IJCAI2015), pages 974–980, 2015.
 Slivkins and Upfal (2008) Aleksandrs Slivkins and Eli Upfal. Adapting to a changing environment: The Brownian restless bandits. In 21st Conference on Learning Theory (COLT), pages 343–354, 2008.
 Thompson (1933) William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 Tsung and Wang (2010) Fujee Tsung and Kaibo Wang. Adaptive charting techniques: Literature review and extensions, pages 19–35. PhysicaVerlag HD, Heidelberg, 2010.
 Vermorel and Mohri (2005) Joannès Vermorel and Mehryar Mohri. Multiarmed bandit algorithms and empirical evaluation. In Proceedings of the 16th European Conference on Machine Learning, volume 3720, pages 437–448, 2005.
 Villar et al. (2015) Sofía S. Villar, Jack Bowden, and James Wason. Multiarmed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science, 30(2):199–215, 2015.
 Watkins (1989) Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, Cambridge University, 1989.
 Whittle (1988) Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25(A):287–298, 1988.
 Yu and Mannor (2009) Jia Yuan Yu and Shie Mannor. Piecewisestationary bandit problems with side observations. In Proceedings of the 26th International Conference on Machine Learning, pages 1177–1184, 2009.