###### Abstract

Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by , where is the number of arms and is time. Our algorithm uses only switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instances.

- IC
- Improvement Condition
- RLS
- Regularized Least Squares
- TL
- Transfer Learning
- HTL
- Hypothesis Transfer Learning
- ERM
- Empirical Risk Minimization
- TEAM
- Target Empirical Accuracy Maximization
- RKHS
- Reproducing kernel Hilbert space
- DA
- Domain Adaptation
- LOO
- Leave-One-Out
- HP
- High Probability
- RSS
- Regularized Subset Selection
- FR
- Forward Regression
- PSD
- Positive Semi-Definite
- SGD
- Stochastic Gradient Descent
- OGD
- Online Gradient Descent
- EWA
- Exponentially Weighted Average
- EMD
- Effective Metric Dimension
- PDE
- Partial Differential Equation
- SDE
- Stochastic Differential Equation
- FD
- Frequent Directions
- OFU
- Optimism in the Face of Uncertainty
- TS
- Thompson Sampling

Stochastic Bandits with Delay-Dependent Payoffs

Leonardo Cella &Nicolò Cesa-Bianchi

Dipartimento di Informatica, Università degli Studi di Milano

## 1 Introduction

Multiarmed bandits —see, e.g., (RegretBandits)— are a popular mathematical framework for modeling sequential decision problems in the presence of partial feedback; typical application domains include clinical trials, online advertising, and product recommendation. Consider for example the task of learning the genre of songs most liked by a given user of a music streaming platform. Each song genre is viewed as an arm of a bandit problem associated with the user. A bandit algorithm learns by sequentially choosing arms (i.e., recommending songs) and observing the resulting payoff (i.e., whether the user liked the song). The payoff is used by the algorithm to refine its recommendation policy. The distinctive feature of bandits is that, after each recommendation, the algorithm gets only a feedback for the selected arm (i.e., the single genre that was recommended).

In the simplest stochastic bandit framework (lai1985asymptotically) rewards are realizations of i.i.d. draws from fixed and unknown distributions associated to each arm. In this setting the optimal policy is to consistently recommend the arm with the highest reward expectation. On the other hand, in scenarios like song recommendation, users may grow tired of listening to the same music genre over and over. This is naturally formalized as a nonstationary bandit setting, where the payoff of an arm grows with the time since the arm was last played. In this case policies consistently recommending the same arm are seldom optimal. E-learning applications, where arms corresponds to questions that students have to answer, are other natural examples of the same phenomenon, as asking again immediately the same question that the student has just answered is not very effective.

In this paper we introduce a simple nonstationary stochastic bandit model, B2DEP, in which the expected reward of an arm is a bounded nondecreasing function of the number of rounds that have passed since the arm was last selected by the policy. More specifically, we assume each arm has an unknown baseline payoff expectation (equal to the expected payoff when the arm is pulled for the first time) and an unknown delay parameter . If the arm was pulled recently (that is, ), then the expected payoff may be smaller that its baseline value: . Vice versa, if , then is guaranteed to match the baseline value . In the song recommendation example, the delays model the extent to which listening to a song of genre affects how much a user is willing to listen to more songs of that same genre.

Since can be viewed as a notion of state for arm , our model can be compared to nonstationary models, such as rested bandits (gittins1979bandit) and restless bandits (whittle1988restless) —see also (TekinRestedRestless). In restless bandits the reward distribution of an arm changes irrespective to the policy being used, whereas in rested bandits the distribution changes only when the arm is selected by the policy. Our setting is neither rested nor restless, as our reward distributions change differently according to whether the arm is selected by the policy or not.

In Section 4 we make a reduction to the Periodic Maintenance Scheduling Problem (hardScheduling) to prove that the optimization problem of finding an optimal periodic policy in our setting is NP-Hard. In order to circumvent the hardness of computing the optimal periodic policy, in Section 5 we identify a simple class of periodic policies that are efficiently learnable, and whose expected reward is provably to within a constant factor of that of the optimal policy. Our approximating class is pretty natural: it contains all ranking policies that cycle over the best arms (where is the parameter to optimize) according to the unknown ordering based on the arms’ baseline payoff expectations.

We focus first on the task of learning the best element in the class of all ranking policies. In our music streaming example, a ranking policy is a playlist for the user. As changing the playlist streamed to the user may be costly in practice, we also introduce a switching cost for selecting a different ranking policy. Controlling the number of switches could also have a good effect in our nonstationary setting, when the expected reward of a ranking policy may depend on which other ranking polices were played earlier. The learning agent should ensure that a ranking policy is played many times consecutively (i.e., infrequent switches), so that estimates are calibrated (i.e., computed in the same context of past plays).

A standard bandit strategy like UCB (UCB1), which guarantees a regret of irrespective to the size of the suboptimality gaps between the expected reward of the optimal ranking policy and that of the other policies, performs a number of switches growing with the squared inverse of these gaps. In Section 6 we show how to learn the best ranking policy using a simple variant of a learning algorithm based on action elimination proposed in (cesa2013online). Similarly to UCB, this algorithm has a distribution-free regret bound of bounded by . However, a bound on the number of switches is also guaranteed irrespective to the size of the gaps.

In Section 7 we turn to the problem of constructing the class of ranking policies, which amounts to learning the ordering of the arms according to their baseline payoff expectations . Assuming , this can be reduced to the problem of learning the ordering of reward expectations in a standard stochastic bandit with i.i.d. rewards. We show that this is possible with a number of pulls bounded by (ignoring logarithmic factors), where is the smallest gap between and . Note that this bound is not significantly improvable, because samples of arm each are needed to verify that .

Finally, in Section 8 we describe experiments comparing our low-switch algorithm against UCB in both large-gap and in small-gap settings.

## 2 Related works

Our setting is a variant of the model introduced by kleinberg2018recharging. In that work, are concave, nondecreasing functions satisfying . Note that this setting and ours are incomparable. Indeed, unlike (kleinberg2018recharging) we assume a specific parametric form for the functions , which are nondecreasing and bounded by . On the other hand, we do not assume concavity, which plays a key role in their analysis.

RecoveringBandits consider a setting in which the expected reward functions are sampled from a Gaussian Process with known kernel. The main result is a bound of order on the Bayesian -step lookahead regret, where is a user-defined parameter. This notion of regret is defined by dividing time in length- blocks, and then summing the regret in each block against the greedy algorithm optimizing the next pulls given the agent’s current configuration of delays (i.e., how long ago each arm was last pulled). Similarly to (RecoveringBandits), we also compete against a greedy block strategy. However, in our case the block length is unknown, and the greedy strategy is not defined in terms of the agent’s delay configuration.

A special case of our model is investigated in the very recent work by blockingBandits. Unlike B2DEP, they assume for all and complete knowledge of the delays . In fact, they even assume that every arm cannot be selected in the next time steps after a pull. Their main result is a regret bound for a variant of UCB competing against the greedy policy. They also show NP-hardness of finding the optimal policy through a reduction similar to ours. It is not clear how their learning approach could be extended to prove results in our more general setting, where could be positive even when and the delays are unknown.

A different approach to nonstationary bandits in recommender systems considers expected reward functions that depend on the number of times the arm was played so far, (levine2017rotting; cortes2017discrepancy; bouneffouf2016multi; Heidari_DecayingBandits; Rotting_as_stochastic; linUCRL). These cases correspond to a rested bandit model, where each arm’s expected reward can only change when the arm is played.

The fact that we learn ranking strategies is reminiscent of stochastic combinatorial semi-bandits (kveton2015tight), where the number of arms in the schedule is a parameter of the learning problem. In particular, similarly to (radlinski2008learning; kveton2015cascading; katariya2016dcm) our strategies learn rankings of the actions, but unlike those approaches in our case the optimal number of elements in the ranking must be learned too.

## 3 The B2DEP setting

In the classical stochastic multiarmed bandit model, at each round the agent pulls an arm from and receives the associated payoff, which is a -valued random variable independently drawn from the (fixed but unknown) probability distribution associated with the pulled arm. The payoff is the only feedback revealed to the agent at each round. The agent’s goal is to maximize the expected cumulative payoff over any number of rounds.

In the B2DEP (Bandits with DElay DEpendend Payoff) variant introduced here, when the agent plays an arm the -valued payoff has expected value

(1) |

where is the unknown baseline reward expectation for arm , is an unknown nonincreasing function, and is the number of rounds that have passed since that arm was last pulled (conventionally, means that an arm is pulled for the first time). When is identically zero, B2DEP reduces to the standard stochastic bandit model with payoff expectations . The unknown arm-dependent delay parameters control the number of rounds after which the arm’s expected payoff is guaranteed to return to its baseline value .

A policy maps a sequence of past observed payoffs to the index of the next arm to pull. Let be the payoff collected by policy at round . Given an instance of B2DEP, the optimal policy maximizes, over all policies , the long term expected average payoff

Note that, the payoff expectations at any time step are fully determined by the current delay vector , where each integer counts how many rounds have passed since was last pulled (setting if was never pulled or if it was last pulled more than steps ago). Hence, any delay-based policy —e.g., any deterministic policy of the current delay vector— is eventually periodic, meaning that for all , where is the period and is the length of the transient.

Consider the greedy policy defined as follows: At each round , pulls the arm with the highest expected reward according to current delays

(2) |

where if was never pulled before. It is easy to see that is not always optimal. For example consider the following instance of with : for all , , , . Then always pulls arm and achieves , whereas where alternates between arm and arm . Hence .

In the next section we show that the problem of finding the optimal periodic policy for B2DEP is intractable.

## 4 Hardness results

We show that the optimization problem of finding an optimal policy for B2DEP is NP-hard, even when all the instance parameters are known. Our proof relies on the NP-completeness of the Periodic Maintenance Scheduling Problem (PMSP) shown by hardScheduling. Although a very similar result can also be proven using the reduction of blockingBandits, introduced for a special case of our B2DEP setting, we give our proof for completeness.

A maintenance schedule on machines is any infinite sequence over , where indicates that no machine is scheduled for service at that time. An instance of the PMSP decision problem is given by integer service intervals such that . The question is whether there exists a maintenance schedule such that the consecutive service times of each machine are exactly times apart. The following result holds (proof in the supplementary material).

###### Theorem 1.

It is NP-hard to decide whether an instance of B2DEP has a periodic policy achieving

## 5 Approximating the optimal policy

In order circumvent the computational problem of finding the best periodic policy, we introduce a simple class of periodic ranking policies whose best element has a cumulative expected payoff not too far from that of . Without loss of generality, assume that . Let , where each policy cycles over the arm sequence . The average reward of policy is defined by

Since maximizes over , where

(3) |

We now bound in terms of .

###### Theorem 2.

where is the largest arm index such that

and if .

The definition of is better understood in the context of the more intuitive delay-based policy . Note indeed that is the first round in which prefers to pull one of the arms that were played in the first rounds rather than the next arm .

###### Proof.

Since maximizes (3),

where the term takes into account that may not divide , and the fact that in the first rounds the expected reward is instead of . Now split the time steps in blocks of length . Because is —by definition— the largest expected reward any policy can achieve in consecutive steps, the expected reward of in any of these blocks is at most . Therefore

where, as before, the term takes into account that may not divide . This concludes the proof. ∎

The proof of Theorem 2 actually shows that both and achieve the claimed approximation. However, by definition is bigger than the total reward of the policy that cycles over . Also, learning is relatively easy, as we show in Section 6.

It is easy to see that is not monotone due to the presence of the coefficients . For example, consider the B2DEP instance defined by , , , , , and . Then .

## 6 Learning the ghost policy

In this section we deal with the problem of learning assuming the correct ordering of the arms (such that ) is known. In the next section, we consider the problem of learning this ordering.

Our search space is the set of ranking policies , where each policy cycles over the arm sequence . Note that, by definition, . The average reward of policy is defined by . Note that every time the learning algorithm chooses to play a different policy , an extra cost is incurred due to the need of calibrating the estimates for . In fact, if we played a policy different from in the previous round, the reward expectation associated with the play of in the current round is potentially different from . This is due to the fact that we cannot guarantee that each arm in the schedule used by was pulled exactly steps earlier. This implies that we need to play each newly selected policy more than once, as the first play cannot be used to reliably estimate .

We now introduce the policy (Algorithm 1), a simple variant of a learning algorithm based on action elimination proposed in (cesa2013online). This policy has a regret bound similar to UCB while guaranteed a bound on the number of switches, irrespective to the size of the gaps. In Section 8 we compare with UCB.

In each stage , algorithm plays each policy in the active set for times, where . Then, the algorithm computes the sample average reward based on these plays, excluding the first one because of calibration (lines 4–7). After that, the empirically best policy is selected (8). Finally, the active set is recomputed (line 9) excluding all policies whose sample average reward is significantly smaller than that of the empirically best policy. The quantity is derived from a standard Chernoff-Hoeffding bound and is equal to where

implying . The terms account for the extra calibration pull each time we switch to a new policy in . We can prove the following bound on the regret of with respect to .

###### Theorem 3.

When run on an instance of B2DEP with parameters and , with probability at least Algorithm 1 guarantees

(4) |

with probability at least .

Note that this bound is distribution-free. That is, it does not depend on the gaps (which in general could be arbitrarily small). The rate , as opposed to the rate of distribution-dependent bounds, cannot be improved upon in general RegretBandits.

###### Proof.

The proof is an adaptation of (cesa2013online, Theorem 6). Note that by construction. Also, our choice of and Chernoff-Hoeffding bound implies that

(5) |

simultaneously for all with probability at least . To see this, note that in every stage the estimates are computed using plays. Since a play of consists of pulls, we have that each is estimated using realizations of a sequence of random variables whose expectations have average exactly equal to .

We now claim that, with probability at least , and for all .

We prove the claim by induction on . We first show that the base case holds with probability at least . Then we show that if the claim holds for , then it holds for with probability at least over all random events in stage . Therefore, using a union bound over we get that the claim holds simultaneously for all with probability at least .

For the base case note that by definition, and thus holds. Moreover,

where the two first inequalities hold with probability at least because of (5). This implies

as required. We now prove the claim for . The inductive assumption

directly implies that . Thus we have , because maximizes over a set that contains . The rest of the proof of the claim closely follows that of the base case .

We now return to the proof of the theorem. For any and for any we have that

holds with probability at least . Hence, recalling that the number of switches between two different policies in is deterministically bounded by , the regret of the player can be bounded as follows,

where the term accounts for the regret suffered in the plays where we switched between two policies in and paid maximum regret due to calibration for at most steps (as each policy in is implemented with at most pulls). Now, since , and , we obtain that with probability at least the regret is at most of order . ∎

## 7 Learning the ordering of the arms

In this section we show how to recover, with high probability, the correct ordering of the arms. Initially, we ignore the problem of calibration, and focus on the task of learning the arm ordering when each pulls of arm returns a sample from the true baseline reward distribution with expectation .

BanditRanker (Algorithm 2) is an action elimination procedure. The arms in the set of active arms are sampled once each (line 5), and their average rewards are kept sorted in decreasing order (line 6). We use to denote the sample average of rewards obtained from arm after sampling rounds, and define the indexing be such that , where ties are broken according to the original arm indexing.

When the confidence interval around the average reward of an arm is not overlapping anymore with the confidence intervals of the other arms (lines 8–9), is removed from and not sampled anymore (line 10). Moreover, the set of all arms such that (if any) is ranked before (line 11). Similarly, the set let of all arms such that (if any) is ranked after (line 12). The algorithm ends when all arms are removed (line 16).

The parameter determining the confidence interval after sampling rounds is defined by

(6) |

The sequence of removed arms can be stored in a binary tree whose root is the first removed arm and whose left (resp., right) leaf contain all arms whose average reward was bigger (resp., smaller) when the first arm was removed. When a new arm is removed, the leaf to which it belongs is split using the same logic that we used for the root. Eventually, all nodes contain a single arm and the in-order traversal of the tree provides the desired ordering.

We introduce the following quantity, measuring the suboptimality gaps between arm that are adjacent in the correct ordering,

where .

We are now ready to state and prove the main result of this section.

###### Theorem 4.

If Algorithm 2 is run with parameter on a -armed stochastic bandit problem, the correct ordering of the arms is returned with probability at least after a number of pulls of order

(7) |

Note that, up to logarithmic factors, the bound stated in Theorem 4 is of the same order as the sample used by an ideal procedure that knows and uses the optimal order of samples to determine the position of each arm in the correct ordering.

###### Proof.

The proof is an adaptation of ActionElimination. Using Chernoff-Hoeffding bounds, the choice of ensures that

(8) |

If an action is eliminated after sampling rounds, then it must be that for all and all . Condition (8) then ensures that, with probability at least , for all such and . This implies that the current ordering of for is correct with respect to . Since , every action is eventually eliminated. Therefore, with probability at least the sequence of eliminated arms and their corresponding sets provide the correct arm ordering.

We now proceed to bounding the number of samples. Under condition (8), for all ,

Therefore, if , then . Recalling the definition (6) of and solving by we get

Thus, after sampling rounds, with probability at least . Similarly, after sampling rounds, with probability at least .

This further implies that after many sampling rounds, action is eliminated and not sampled any more.

Re-define the indexing so that . Hence by definition. We now compute a bound on the overall number of pulls based on our bound on the number of sampling rounds. With probability at least , we have that: pulls are needed to eliminate arm , pulls are needed to eliminate arm , and so on. Hence, with probability at least the total number of pulls needed to eliminate all arms is

where we set conventionally . This concludes the proof of the theorem. ∎

In order to apply BanditRanker to an instance of B2DEP, we assume that an upper bound be available in advance to the algorithm. This ensures that for all . In each sampling round , we partition the arms in in groups of size and make pulls for each group by cycling twice over the arms in an arbitrary order. Then, the first pulls in each group are discarded, while the last pulls are used to estimate the expectations (when does not divide we can add to arms that were already removed, or arms from previous groups, just for the purpose of calibrating). The sample size bound (7) remains of the same order (because the extra pulls only add a factor of two).

## 8 Experiments

In this section we present an empirical evaluation of our policy in a synthetic environment with Bernoulli rewards. In order to study the impact the switching cost on ranking policies when the suboptimality gap is small, we also define a setting in which there are two distinct ranking policies that are both optimal —see Figure 1.

We plot regrets against the policy . Our policy is run without any specific tuning (other than the knowledge of the horizon ) and with set to in all experiments. The benchmark consists of running UCB1 —with the same scaling factor as in the original article by UCB1— over the class of ranking policies, where calibration is addressed by rolling out twice each ranking policy selected by UCB1 and using only the second roll-out to compute reward estimates. Since both and are run over , we implicitly assume that BanditRanker successfully ranked the arms in a preliminary stage.

Figure 2 shows that when the gap between the best and the second best ranking policy is not too small ( on average in these experiments), then is competitive against even in the presence of unit switching costs. This happens because, in order to minimize the number of switches, samples a suboptimal policy more frequently than . Although this oversampling does not affect the distribution-free regret bound of , it hurts performance unless the suboptimality gap is small enough to cause the switching costs to prevail, a case which is addressed next. Note also that eventually stops exploration because all policies but one have been eliminated, while keeps on exploring, albeit at a logarithmic rate.

In the second experiment we consider two arms with , , , , and chosen so that to simulate a vanishing suboptimality gap between and . Figure 3 (upper part) shows that performs better than due to its low switch regime. On the other hand, Figure 3 (lower part) shows that when the switching cost is zero, switching between two good policies becomes more advantageous than using a single good policy, and the regret of both and becomes negative (in this case , which has no control over the number of switches, outperforms ). The reason for this advantage is explained by Fact 1 below (proof in the supplementary material), see also Figure 1.

###### Fact 1.

If an instance of B2DEP admits two optimal ranking policies, then consistently switching between these two policies achieves an average expected reward higher than sticking to either one.

To summarize, the experiments confirm that, in the presence of switching costs, works better than only when the suboptimalty gap is very small. The advantage of over is however reduced by the fact that switching between two good policies is better than consistently playing either one of the two (Fact 1). Note also that stops exploring because is known. This preliminary knowledge can be dispensed with using a doubling trick, or some more sophisticated method. Also, it would be interesting to design a method that achieves the best between the performance of and , according to the size of the suboptimality gap.

## References

Supplementary Material for Bandits with Delay-Dependent Payoffs

## 1 Proof of Theorem 1

###### Proof.

Given an instance of PMSP, we construct a B2DEP instance with arms such that and for all , , and . The long-term average reward for a periodic policy in this setting is

where is the number of times the policy plays arm in a period and is the number of time steps between when arm was played for the -th time in the cycle and the last time it was played (in the same cycle or in the previous cycle, excluding the transient). Clearly, if the PMSP instance has a feasible schedule, then we can design a bandit policy that replicates that schedule (playing arm at all time steps where no machines are scheduled for maintenance). The long-term average reward of this policy is at most . Moreover, if we have a periodic bandit policy with long-term average reward exactly equal to , this means that each arm is eventually played after exactly rounds.Indeed, the only way to have

is by setting for all . ∎

## 2 Proof of Fact 1

###### Proof.

We use the following notation: , where , stands for . Consider two optimal ranking policies and with . Then , where and similarly for . The expected total reward of playing after is , and the expected total reward of playing after is . We want to prove

Rearranging gives . Since , we have

Observing that , the above is equivalent to

which is always true since in our model expected rewards are non-decreasing with delays. ∎