Stochastic Bandits with Delay-Dependent Payoffs

Stochastic Bandits with Delay-Dependent Payoffs


Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by , where is the number of arms and is time. Our algorithm uses only switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instances.

Improvement Condition
Regularized Least Squares
Transfer Learning
Hypothesis Transfer Learning
Empirical Risk Minimization
Target Empirical Accuracy Maximization
Reproducing kernel Hilbert space
Domain Adaptation
High Probability
Regularized Subset Selection
Forward Regression
Positive Semi-Definite
Stochastic Gradient Descent
Online Gradient Descent
Exponentially Weighted Average
Effective Metric Dimension
Partial Differential Equation
Stochastic Differential Equation
Frequent Directions
Optimism in the Face of Uncertainty
Thompson Sampling

1 Introduction

Multiarmed bandits —see, e.g., (Bubeck and Cesa-Bianchi, 2012)— are a popular mathematical framework for modeling sequential decision problems in the presence of partial feedback; typical application domains include clinical trials, online advertising, and product recommendation. Consider for example the task of learning the genre of songs most liked by a given user of a music streaming platform. Each song genre is viewed as an arm of a bandit problem associated with the user. A bandit algorithm learns by sequentially choosing arms (i.e., recommending songs) and observing the resulting payoff (i.e., whether the user liked the song). The payoff is used by the algorithm to refine its recommendation policy. The distinctive feature of bandits is that, after each recommendation, the algorithm gets only a feedback for the selected arm (i.e., the single genre that was recommended).

In the simplest stochastic bandit framework (Lai and Robbins, 1985) rewards are realizations of i.i.d. draws from fixed and unknown distributions associated to each arm. In this setting the optimal policy is to consistently recommend the arm with the highest reward expectation. On the other hand, in scenarios like song recommendation, users may grow tired of listening to the same music genre over and over. This is naturally formalized as a nonstationary bandit setting, where the payoff of an arm grows with the time since the arm was last played. In this case policies consistently recommending the same arm are seldom optimal. E-learning applications, where arms corresponds to questions that students have to answer, are other natural examples of the same phenomenon, as asking again immediately the same question that the student has just answered is not very effective.

In this paper we introduce a simple nonstationary stochastic bandit model, B2DEP, in which the expected reward of an arm is a bounded nondecreasing function of the number of rounds that have passed since the arm was last selected by the policy. More specifically, we assume each arm has an unknown baseline payoff expectation (equal to the expected payoff when the arm is pulled for the first time) and an unknown delay parameter . If the arm was pulled recently (that is, ), then the expected payoff may be smaller that its baseline value: . Vice versa, if , then is guaranteed to match the baseline value . In the song recommendation example, the delays model the extent to which listening to a song of genre affects how much a user is willing to listen to more songs of that same genre.

Since can be viewed as a notion of state for arm , our model can be compared to nonstationary models, such as rested bandits (Gittins, 1979) and restless bandits (Whittle, 1988) —see also (Tekin and Liu, 2012). In restless bandits the reward distribution of an arm changes irrespective of the policy being used, whereas in rested bandits the distribution changes only when the arm is selected by the policy. Our setting is neither rested nor restless, as our reward distributions change differently according to whether the arm is selected by the policy or not.

In Section 4 we make a reduction to the Periodic Maintenance Scheduling Problem (Bar-Noy et al., 2002) to prove that the optimization problem of finding an optimal periodic policy in our setting is NP-Hard. In order to circumvent the hardness of computing the optimal periodic policy, in Section 5 we identify a simple class of periodic policies that are efficiently learnable, and whose expected reward is provably to within a constant factor of that of the optimal policy. Our approximating class is pretty natural: it contains all ranking policies that cycle over the best arms (where is the parameter to optimize) according to the unknown ordering based on the arms’ baseline payoff expectations. As it turns out, learning the best ranking policy can be formulated in terms of minimizing the standard notion of regret. This is unlike the problem of learning the best periodic policy, which instead requires minimizing the harder notion of policy regret (Arora et al., 2012).

Consider the task of learning the best ranking policy. In our music streaming example, a ranking policy is a playlist for the user. As changing the playlist streamed to the user may be costly in practice, we also introduce a switching cost for selecting a different ranking policy. Controlling the number of switches could also have a good effect in our nonstationary setting, when the expected reward of a ranking policy may depend on which other ranking polices were played earlier. The learning agent should ensure that a ranking policy is played many times consecutively (i.e., infrequent switches), so that estimates are calibrated (i.e., computed in the same context of past plays).

A standard bandit strategy like UCB (Auer et al., 2002), which guarantees a regret of irrespective of the size of the suboptimality gaps between the expected reward of the optimal ranking policy and that of the other policies, performs a number of switches growing with the squared inverse of these gaps. In Section 6 we show how to learn the best ranking policy using a simple variant of a learning algorithm based on action elimination proposed in (Cesa-Bianchi et al., 2013). Similarly to UCB, this algorithm has a distribution-free regret bound of bounded by . However, a bound on the number of switches is also guaranteed irrespective of the size of the gaps.

In Section 7 we turn to the problem of constructing the class of ranking policies, which amounts to learning the ordering of the arms according to their baseline payoff expectations . Assuming , this can be reduced to the problem of learning the ordering of reward expectations in a standard stochastic bandit with i.i.d. rewards. We show that this is possible with a number of pulls bounded by (ignoring logarithmic factors), where is the smallest gap between and . Note that this bound is not significantly improvable, because samples of arm each are needed to verify that .

Finally, in Section 8 we describe experiments comparing our low-switch algorithm against UCB in both large-gap and in small-gap settings.

2 Related works

Our setting is a variant of the model introduced by Kleinberg and Immorlica (2018). In that work, are concave, nondecreasing functions satisfying . Note that this setting and ours are incomparable. Indeed, unlike (Kleinberg and Immorlica, 2018) we assume a specific parametric form for the functions , which are nondecreasing and bounded by . On the other hand, we do not assume concavity, which plays a key role in their analysis.

Pike-Burke and Grunewalder (2019) consider a setting in which the expected reward functions are sampled from a Gaussian Process with known kernel. The main result is a bound of order on the Bayesian -step lookahead regret, where is a user-defined parameter. This notion of regret is defined by dividing time in length- blocks, and then summing the regret in each block against the greedy algorithm optimizing the next pulls given the agent’s current configuration of delays (i.e., how long ago each arm was last pulled). Similarly to (Pike-Burke and Grunewalder, 2019), we also compete against a greedy block strategy. However, in our case the block length is unknown, and the greedy strategy is not defined in terms of the agent’s delay configuration.

A special case of our model is investigated in the very recent work by Basu et al. (2019). Unlike B2DEP, they assume for all and complete knowledge of the delays . In fact, they even assume that every arm cannot be selected in the next time steps after a pull. Their main result is a regret bound for a variant of UCB competing against the greedy policy. They also show NP-hardness of finding the optimal policy through a reduction similar to ours. It is not clear how their learning approach could be extended to prove results in our more general setting, where could be positive even when and the delays are unknown.

A different approach to nonstationary bandits in recommender systems considers expected reward functions that depend on the number of times the arm was played so far, (Levine et al., 2017; Cortes et al., 2017; Bouneffouf and Féraud, 2016; Heidari et al., 2016; Seznec et al., 2019; Warlop et al., 2018). These cases correspond to a rested bandit model, where each arm’s expected reward can only change when the arm is played.

The fact that we learn ranking strategies is reminiscent of stochastic combinatorial semi-bandits (Kveton et al., 2015b), where the number of arms in the schedule is a parameter of the learning problem. In particular, similarly to (Radlinski et al., 2008; Kveton et al., 2015a; Katariya et al., 2016) our strategies learn rankings of the actions, but unlike those approaches in our case the optimal number of elements in the ranking must be learned too.

3 The B2DEP setting

In the classical stochastic multiarmed bandit model, at each round the agent pulls an arm from and receives the associated payoff, which is a -valued random variable independently drawn from the (fixed but unknown) probability distribution associated with the pulled arm. The payoff is the only feedback revealed to the agent at each round. The agent’s goal is to maximize the expected cumulative payoff over any number of rounds.

In the B2DEP (Bandits with DElay DEpendend Payoff) variant introduced here, when the agent plays an arm the -valued payoff has expected value


where is the unknown baseline reward expectation for arm , is an unknown nonincreasing function, and is the number of rounds that have passed since that arm was last pulled (conventionally, means that an arm is pulled for the first time). When is identically zero, B2DEP reduces to the standard stochastic bandit model with payoff expectations . The unknown arm-dependent delay parameters control the number of rounds after which the arm’s expected payoff is guaranteed to return to its baseline value .

A policy maps a sequence of past observed payoffs to the index of the next arm to pull. Let be the payoff collected by policy at round . Given an instance of B2DEP, the optimal policy maximizes, over all policies , the long term expected average payoff

Note that, the payoff expectations at any time step are fully determined by the current delay vector , where each integer counts how many rounds have passed since was last pulled (setting if was never pulled or if it was last pulled more than steps ago). Hence, any delay-based policy —e.g., any deterministic function of the current delay vector— is eventually periodic, meaning that for all , where is the period and is the length of the transient.

Consider the greedy policy  defined as follows: At each round ,  pulls the arm with the highest expected reward according to current delays


where if was never pulled before. It is easy to see that  is not always optimal. For example consider the following instance of with : for all , , , . Then  always pulls arm and achieves , whereas where alternates between arm and arm . Hence .

In the next section we show that the problem of finding the optimal periodic policy for B2DEP is intractable.

4 Hardness results

We show that the optimization problem of finding an optimal policy for B2DEP is NP-hard, even when all the instance parameters are known. Our proof relies on the NP-completeness of the Periodic Maintenance Scheduling Problem (PMSP) shown by Bar-Noy et al. (2002). Although a very similar result can also be proven using the reduction of Basu et al. (2019), introduced for a special case of our B2DEP setting, we give our proof for completeness.

A maintenance schedule on machines is any infinite sequence over , where indicates that no machine is scheduled for service at that time. An instance of the PMSP decision problem is given by integer service intervals such that . The question is whether there exists a maintenance schedule such that the consecutive service times of each machine are exactly times apart. The following result holds (proof in the supplementary material).

Theorem 1.

It is NP-hard to decide whether an instance of B2DEP has a periodic policy achieving

5 Approximating the optimal policy

In order circumvent the computational problem of finding the best periodic policy, we introduce a simple class of periodic ranking policies whose best element  has a cumulative expected payoff not too far from that of . Without loss of generality, assume that . Let , where each policy cycles over the arm sequence . The average reward of policy is defined by

Since  maximizes over , where


We now bound in terms of .

Theorem 2.

where is the largest arm index such that

and if .

The definition of is better understood in the context of the more intuitive delay-based policy . Note indeed that is the first round in which  prefers to pull one of the arms that were played in the first rounds rather than the next arm .


Since maximizes (3),

where the term takes into account that may not divide , and the fact that in the first rounds the expected reward is instead of . Now split the time steps in blocks of length . Because is —by definition— the largest expected reward any policy can achieve in consecutive steps, the expected reward of in any of these blocks is at most . Therefore

where, as before, the term takes into account that may not divide . This concludes the proof. ∎

The proof of Theorem 2 actually shows that both and achieve the claimed approximation. However, by definition is bigger than the total reward of the policy that cycles over . Also, learning is relatively easy, as we show in Section 6.

It is easy to see that is not monotone due to the presence of the coefficients . For example, consider the B2DEP instance defined by , , , , , and . Then .

6 Learning the ghost policy

In this section we deal with the problem of learning assuming the correct ordering of the arms (such that ) is known. In the next section, we consider the problem of learning this ordering.

Our search space is the set of ranking policies , where each policy cycles over the arm sequence . Note that, by definition, . The average reward of policy is defined by . Note that every time the learning algorithm chooses to play a different policy , an extra cost is incurred due to the need of calibrating the estimates for . In fact, if we played a policy different from in the previous round, the reward expectation associated with the play of in the current round is potentially different from . This is due to the fact that we cannot guarantee that each arm in the schedule used by was pulled exactly steps earlier. This implies that we need to play each newly selected policy more than once, as the first play cannot be used to reliably estimate .

We now introduce the policy  (Algorithm 1), a simple variant of a learning algorithm based on action elimination proposed in  (Cesa-Bianchi et al., 2013). This policy has a regret bound similar to UCB while guaranteeing a bound on the number of switches, irrespective of the size of the gaps. In Section 8 we compare with UCB.

1:Policy set , confidence , horizon
2:Let be the initial set of active policies
3:repeat indexes the stage number
4:     for  do
5:         Play for times
6:         Compute discarding the first play
7:     end for
8:     Let
10:until overall number of pulls exceeds
Algorithm 1 ()

In each stage , algorithm  plays each policy in the active set for times, where . Then, the algorithm computes the sample average reward based on these plays, excluding the first one because of calibration (lines 47). After that, the empirically best policy is selected (8). Finally, the active set is recomputed (line 9) excluding all policies whose sample average reward is significantly smaller than that of the empirically best policy. The quantity is derived from a standard Chernoff-Hoeffding bound and is equal to where

implying . The terms account for the extra calibration pull each time we switch to a new policy in . We can prove the following bound on the regret of  with respect to .

Theorem 3.

When run on an instance of B2DEP with parameters and , with probability at least Algorithm 1 guarantees


with probability at least .

Note that this bound is distribution-free. That is, it does not depend on the gaps (which in general could be arbitrarily small). The rate , as opposed to the rate of distribution-dependent bounds, cannot be improved upon in general Bubeck and Cesa-Bianchi (2012).


The proof is an adaptation of (Cesa-Bianchi et al., 2013, Theorem 6). Note that by construction. Also, our choice of and Chernoff-Hoeffding bound implies that


simultaneously for all with probability at least . To see this, note that in every stage the estimates are computed using plays. Since a play of consists of pulls, we have that each is estimated using realizations of a sequence of random variables whose expectations have average exactly equal to .

We now claim that, with probability at least , and for all .

We prove the claim by induction on . We first show that the base case holds with probability at least . Then we show that if the claim holds for , then it holds for with probability at least over all random events in stage . Therefore, using a union bound over we get that the claim holds simultaneously for all with probability at least .

For the base case note that by definition, and thus holds. Moreover: , , and , where the two first inequalities hold with probability at least because of (5). This implies as required. We now prove the claim for . The inductive assumption
directly implies that . Thus we have , because maximizes over a set that contains . The rest of the proof of the claim closely follows that of the base case .

We now return to the proof of the theorem. For any and for any we have that

holds with probability at least . Hence, recalling that the number of switches between two different policies in is deterministically bounded by , the regret of the player can be bounded as follows,

where the term accounts for the regret suffered in the plays where we switched between two policies in and paid maximum regret due to calibration for at most steps (as each policy in is implemented with at most pulls). Now, since , and , we obtain that with probability at least the regret is at most of order . ∎

7 Learning the ordering of the arms

In this section we show how to recover, with high probability, the correct ordering of the arms. Initially, we ignore the problem of calibration, and focus on the task of learning the arm ordering when each pulls of arm returns a sample from the true baseline reward distribution with expectation .

2:A permutation of .
3:Let be the initial set of active arms
4:repeat indexes the round number
5:     Sample once all arms in sampling round
6:     Sort the empirical means
7:     for  to  do
8:         if  then
9:              if  then
10:                  Remove from
11:                  Rank before all arms in
12:                  Rank after all arms in
13:              end if
14:         end if
15:     end for
Algorithm 2 (BanditRanker)

BanditRanker (Algorithm 2) is an action elimination procedure. The arms in the set of active arms are sampled once each (line 5), and their average rewards are kept sorted in decreasing order (line 6). We use to denote the sample average of rewards obtained from arm after sampling rounds, and define the indexing be such that , where ties are broken according to the original arm indexing.

When the confidence interval around the average reward of an arm is not overlapping anymore with the confidence intervals of the other arms (lines 89), is removed from and not sampled anymore (line 10). Moreover, the set of all arms such that (if any) is ranked before (line 11). Similarly, the set let of all arms such that (if any) is ranked after (line 12). The algorithm ends when all arms are removed (line 16).

The parameter determining the confidence interval after sampling rounds is defined by


The sequence of removed arms can be stored in a binary tree whose root is the first removed arm and whose left (resp., right) leaf contain all arms whose average reward was bigger (resp., smaller) when the first arm was removed. When a new arm is removed, the leaf to which it belongs is split using the same logic that we used for the root. Eventually, all nodes contain a single arm and the in-order traversal of the tree provides the desired ordering.

We introduce the following quantity, measuring the suboptimality gaps between arm that are adjacent in the correct ordering,

where .

We are now ready to state and prove the main result of this section.

Theorem 4.

If Algorithm 2 is run with parameter on a -armed stochastic bandit problem, the correct ordering of the arms is returned with probability at least after a number of pulls of order


Note that, up to logarithmic factors, the bound stated in Theorem 4 is of the same order as the sample used by an ideal procedure that knows and uses the optimal order of samples to determine the position of each arm in the correct ordering.


The proof is an adaptation of Even-Dar et al. (2006, Theorem 8). Using Chernoff-Hoeffding bounds, the choice of ensures that


If an action is eliminated after sampling rounds, then it must be that for all and all . Condition (8) then ensures that, with probability at least , for all such and . This implies that the current ordering of for is correct with respect to . Since , every action is eventually eliminated. Therefore, with probability at least the sequence of eliminated arms and their corresponding sets provide the correct arm ordering.

We now proceed to bounding the number of samples. Under condition (8), for all ,

Therefore, if , then . Recalling the definition (6) of and solving by we get

Thus, after sampling rounds, with probability at least . Similarly, after sampling rounds, with probability at least .

This further implies that after many sampling rounds, action is eliminated and not sampled any more.

Re-define the indexing so that . Hence by definition. We now compute a bound on the overall number of pulls based on our bound on the number of sampling rounds. With probability at least , we have that: pulls are needed to eliminate arm , pulls are needed to eliminate arm , and so on. Hence, with probability at least the total number of pulls needed to eliminate all arms is

where we set conventionally . This concludes the proof of the theorem. ∎

In order to apply BanditRanker to an instance of B2DEP, we assume that an upper bound be available in advance to the algorithm. This ensures that for all . In each sampling round , we partition the arms in in groups of size and make pulls for each group by cycling twice over the arms in an arbitrary order. Then, the first pulls in each group are discarded, while the last pulls are used to estimate the expectations (when does not divide we can add to arms that were already removed, or arms from previous groups, just for the purpose of calibrating). The sample size bound (7) remains of the same order (because the extra pulls only add a factor of two).

8 Experiments

In this section we present an empirical evaluation of our policy  in a synthetic environment with Bernoulli rewards. In order to study the impact the switching cost on ranking policies when the suboptimality gap is small, we also define a setting in which there are two distinct ranking policies that are both optimal —see Figure 1.

Figure 1: Transitions between policies and assuming , where the notation stands for . The expected reward obtaining by switching between policies is different from the expected reward obtaining by cycling over the same policy.

We plot regrets against the policy . Our policy  is run without any specific tuning (other than the knowledge of the horizon ) and with set to in all experiments. The benchmark  consists of running UCB1 —with the same scaling factor as in the original article by Auer et al. (2002)— over the class of ranking policies, where calibration is addressed by rolling out twice each ranking policy selected by UCB1 and using only the second roll-out to compute reward estimates. Since both  and  are run over , we implicitly assume that BanditRanker successfully ranked the arms in a preliminary stage.

Figure 2: Comparing regrets of  and  against  with arms and baseline expectations and . A unit cost is charged for switching between ranking policies. Curves are averages of runs each using a different sample of delays uniformly drawn from . We plot expectations of sampled arms rather than realized rewards.

Figure 2 shows that when the gap between the best and the second best ranking policy is not too small ( on average in these experiments), then  is competitive against  even in the presence of unit switching costs. This happens because, in order to minimize the number of switches,  samples a suboptimal policy more frequently than . Although this oversampling does not affect the distribution-free regret bound of , it hurts performance unless the suboptimality gap is small enough to cause the switching costs to prevail, a case which is addressed next. Note also that  eventually stops exploration because all policies but one have been eliminated, while  keeps on exploring, albeit at a logarithmic rate.

Figure 3: Comparing regrets of  and  against  with arms such that with unit cost charged for switching between the two policies (upper part) and without any cost for switching (lower part).

In the second experiment we consider two arms with , , , , and chosen so that to simulate a vanishing suboptimality gap between and . Figure 3 (upper part) shows that  performs better than  due to its low switch regime. On the other hand, Figure 3 (lower part) shows that when the switching cost is zero, switching between two good policies becomes more advantageous than using a single good policy, and the regret of both  and  becomes negative (in this case , which has no control over the number of switches, outperforms ). The reason for this advantage is explained by Fact 1 below (proof in the supplementary material), see also Figure 1.

Fact 1.

If an instance of B2DEP admits two optimal ranking policies, then consistently switching between these two policies achieves an average expected reward higher than sticking to either one.

To summarize, the experiments confirm that, in the presence of switching costs,  works better than  only when the suboptimalty gap is very small. The advantage of  over  is however reduced by the fact that switching between two good policies is better than consistently playing either one of the two (Fact 1). Note also that  stops exploring because is known. This preliminary knowledge can be dispensed with using a doubling trick, or some more sophisticated method. Also, it would be interesting to design a method that achieves the best between the performance of  and , according to the size of the suboptimality gap.

9 Conclusions

Motivated by music recommendation in streaming platforms, we introduced a new stochastic bandit model with nonstationary reward distributions. To cope with the NP-hardness of learning the optimal policy caused by nonstationarity, we introduced a restricted class of ranking policies approximating the optimal performance. We then proved sample and regret bounds on the problem of learning the best ranking policy in this class. One of the main problems left open by our work is that of deriving more practical learning algorithms, able to simultaneously learn the ranking of the arms and the best cutoff value , while minimizing their regret with respect to the best ranking policy.


Nicolò Cesa-Bianchi acknowledges partial support by the Google Focused Award Algorithms and Learning for AI (ALL4AI) and by the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR).

Supplementary Material for Bandits with Delay-Dependent Payoffs

1 Proof of Theorem 1


Given an instance of PMSP, we construct a B2DEP instance with arms such that and for all , , and . The long-term average reward for a periodic policy in this setting is

where is the number of times the policy plays arm in a period and is the number of time steps between when arm was played for the -th time in the cycle and the last time it was played (in the same cycle or in the previous cycle, excluding the transient). Clearly, if the PMSP instance has a feasible schedule, then we can design a bandit policy that replicates that schedule (playing arm at all time steps where no machines are scheduled for maintenance). The long-term average reward of this policy is at most . Moreover, if we have a periodic bandit policy with long-term average reward exactly equal to , this means that each arm is eventually played after exactly rounds.Indeed, the only way to have

is by setting for all . ∎

2 Proof of Fact 1


We use the following notation: , where , stands for . Consider two optimal ranking policies and with . Then , where and similarly for . The expected total reward of playing after is , and the expected total reward of playing after is . We want to prove

Rearranging gives . Since , we have

Observing that , the above is equivalent to

which is always true since in our model expected rewards are non-decreasing with delays. ∎


  1. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning, pp. 1747–1754. Cited by: §1.
  2. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3). Cited by: §1, §8.
  3. Minimizing service and operation costs of periodic scheduling.. Mathematics of Operations Research 27, pp. 518–544. Cited by: §1, §4.
  4. Blocking bandits. In Advances in Neural Information Processing Systems, pp. 4785–4794. Cited by: §2, §4.
  5. Multi-armed bandit problem with known trend. Neurocomputing 205, pp. 16–21. Cited by: §2.
  6. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5 (1), pp. 1–122. Cited by: §1, §6.
  7. Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, pp. 1160–1168. Cited by: §1, §6, §6.
  8. Discrepancy-based algorithms for non-stationary rested bandits. arXiv preprint arXiv:1710.10657. Cited by: §2.
  9. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7, pp. 1079–1105. Cited by: §7.
  10. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological) 41 (2), pp. 148–164. Cited by: §1.
  11. Tight policy regret bounds for improving and decaying bandits. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1562–1570. Cited by: §2.
  12. DCM bandits: learning to rank with multiple clicks. In International Conference on Machine Learning, pp. 1215–1224. Cited by: §2.
  13. Recharging bandits. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 309–319. Cited by: §2.
  14. Cascading bandits: learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767–776. Cited by: §2.
  15. Tight regret bounds for stochastic combinatorial semi-bandits. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 535–543. Cited by: §2.
  16. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1.
  17. Rotting bandits. In Advances in Neural Information Processing Systems, pp. 3074–3083. Cited by: §2.
  18. Recovering bandits. In Advances in Neural Information Processing Systems, pp. 14122–14131. Cited by: §2.
  19. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pp. 784–791. Cited by: §2.
  20. Rotting bandits are no harder than stochastic ones. In The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 2564–2572. Cited by: §2.
  21. Online learning of rested and restless bandits. IEEE Transactions on Information Theory 58 (8), pp. 5588–5611. Cited by: §1.
  22. Fighting boredom in recommender systems with linear reinforcement learning. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 1764–1773. Cited by: §2.
  23. Restless bandits: activity allocation in a changing world. Journal of applied probability 25 (A), pp. 287–298. Cited by: §1.