Trend Detection based Regret Minimization for Bandit Problems
We study a variation of the classical multi-armed bandits problem. In this problem, the learner has to make a sequence of decisions, picking from a fixed set of choices. In each round, she receives as feedback only the loss incurred from the chosen action. Conventionally, this problem has been studied when losses of the actions are drawn from an unknown distribution or when they are adversarial. In this paper, we study this problem when the losses of the actions also satisfy certain structural properties, and especially, do show a trend structure. When this is true, we show that using trend detection, we can achieve regret of order with respect to a switching strategy for the version of the problem where a single action is chosen in each round and when actions are chosen each round. This guarantee is a significant improvement over the conventional benchmark. Our approach can, as a framework, be applied in combination with various well-known bandit algorithms, like Exp3. For both versions of the problem, we give regret guarantees also for the anytime setting, i.e. when length of the choice-sequence is not known in advance. Finally, we pinpoint the advantages of our method by comparing it to some well-known other strategies.
Consider the following problem: Suppose you own an apparel store and have purchased a fixed number of ad slots on some website, say Facebook. For every time someone visits the website, you can choose a set of ad impressions to display. Let’s assume that an ad here consists of an image of a clothing item and that each image is associated with a click-through-rate unknown to you. Your goal is to choose images to display such that cumulative click-through-rate is maximized. How would you choose these images? This problem comes under the domain of reinforcement learning and more specifically, multi-armed bandit learning. Contrary to supervised learning (and most of current research in statistical pattern recognition and artificial neural networks), multi-armed bandit learning is characterized by its interactive nature between an agent and an uncertain environment. Such a learning algorithm makes its next move based on the history of its past decisions and their outcomes.
More specifically, a multi-armed bandit problem is a sequential learning problem where the learner chooses an action from a set of actions in every round. Associated with each action is a loss unknown to the learner111The case with rewards is symmetric. The goal of the learner is to minimize the loss incurred. Performance of the learning algorithm is measured by regret, compared to a certain benchmark strategy. Conventionally, in multi-armed bandit problems, the benchmark strategy is to always choose the single best action in hindsight, i.e. an action with minimum cumulative loss. This problem has been thoroughly studied in a variety of settings [5, 4, 2, 16]. A distinguishing feature of such problems is the inherent exploration-exploitation trade-off. When the losses are generated from a fixed but unknown distribution, there exist algorithms [4, 16, 14] that can achieve regret guarantee of . On the other hand, when losses for the actions are generated under no statistical assumption, or alternately when losses are generated by an adversary, best possible regret guarantee that can be achieved is . Recently, interest has been developing [15, 9] in the question of achieving non-trivial regret guarantees when the loss model is semi-structured. Intuitively, more structure in the losses should enable more exploitation and hence allow for better regret guarantees. Along the lines of some of the recent work , we also define models exhibiting a certain degree of structure.
Often the real world problems do not exhibit adversarial behaviour and in many cases, the losses of different actions follow a trend structure, i.e. when one action is consistently better than others in a certain interval. For such more specialized models, the standard techniques prove insufficient since they do not take advantage of these properties. In this paper, we address this deficiency using the paradigm of trend detection. Broadly, we propose a strategy that keeps track of the current trend and restarts the regret minimization algorithm whenever a trend change is detected. This allows us to give regret guarantees with respect to a strategy that chooses the best action in each trend. This is a significantly stronger benchmark than the one conventionally considered. The regret guarantee with respect to this benchmark is also called switching regret.
More importantly, our proposed strategy is not specific to a particular regret minimization algorithm unlike the approaches in some recent works. In this paper, we use Exp3 as the underlying regret minimizing algorithm for its simplicity and almost optimal regret guarantee . However, one can use any other algorithm and analyze it in a similar way. Because of this modular structure of the algorithm, we can extend the arguments and proofs for the conventional multi-armed bandits problem to a more general setting where instead of a single action, the learner chooses multiple actions in each round . This problem has been studied in stochastic  and adversarial  setting, but to the best of our knowledge, there are no prior works giving a switching regret guarantee for it.
One of the primary motivations for studying these bandit problems comes from the domain of recommender systems. Many web tasks such as ad serving and recommendations in e-commerce systems can be modeled as bandit problems. In these problems, the system only gets feedback for the actions chosen, for example whether the user selects the recommended items or not. Notice that these systems may recommend one or more items in each round. Motivation for using the paradigm of trend detection comes from the general observation that in many cases, the performance of actions follow a trend structure. In the abovementioned case of an apparel store, for example, swimsuits might be the best choice during the hottest weeks of the year, or for certain time periods, it might be best to show an item a famous celebrity was recently seen wearing.
Summary of Contribution: For the standard -armed bandit problem, we propose a new algorithm called Exp3.T. This algorithm guarantees switching regret of where is the number of trend changes and not known to the learner. indicates the degree of structure in loss model. This guarantee also holds for the anytime setting i.e. when the duration of the run, , is not known in advance. We extend the analysis of this problem to the case when instead of a single action, the learner chooses a basis of uniform matroid in each round. The underlying regret minimization algorithm used in this case is OSMD . The resulting algorithm achieves switching regret of . Finally, we provide empirical evidence for this algorithm’s performance in the standard multi-armed bandit setting.
In general, our algorithm is particularly effective, i.e. gives better regret guarantees when little is known about the loss structure of actions except that the changes in the best action are not too frequent and actions are likely to be well-distinguishable. We argue that our loss models are more general and reasonable compared to the models conventionally studied: In most real world cases, we would expect to see a mixture of purely stochastic and purely adversarial data. We show that even such mixture of models allows us to give tight regret guarantees as long as the structural assumptions still hold.
Ii Previous Work
The problem of giving regret guarantees with respect to a switching strategy has been considered previously in several works (albeit in more restricted settings), all of which consider the case when the learner chooses exactly one action in each round. Auer et al proposed Exp3.S  along the same lines as Exp3 by choosing an appropriate regularization factor for the forecaster. This enables the algorithm to quickly shift focus on to better performing actions. For abruptly changing stochastic model, Discounted-UCB and SW-UCB  have been proposed along the lines of UCB. In the former algorithm, a switching regret bound is achieved by progressively giving less importance to old losses while in SW-UCB, authors achieve the same by considering a fixed size sliding window. Both these algorithms achieve a regret bound of , where is the number of times the distribution changes.
Our work is closest to the algorithm Exp3.R proposed by Feraud et al . They also follow a paradigm very similar to trend detection and the high level ideas used in their paper are similar to ours. However, their algorithm is specific to Exp3 and only for the version of bandit problem where one chooses a single action in each round. Further, the algorithm assumes a certain gap in the performance of actions that depends on the knowledge of run time of the algorithm. This makes it inapplicable for a number of real-world scenarios.
The trend detection idea used in our algorithm is similar to the change detection problem studied in statistical analysis. Similar ideas have also been used for detection of concept drift in online classification [11, 7]. Common applications include fraud detection, weather prediction and in advertising. In this context, the statistical properties of target variable changes over time and the system tries to detect this change and learn the new parameters.
Iii Problem Setting
We consider a multi-armed bandit problem with losses for distinct actions. Let the set of these actions be denoted by . The losses of these actions can be represented by a sequence of loss vectors where . The loss sequence is divided into trends. A trend is defined as a sequence of rounds where a set of actions is significantly better than others for the duration of this trend. We say that the trend has changed when this set of actions changes. Within each trend the losses of actions in set are “separated” from all others by a certain gap. Particularly, we consider a finer characterization of loss models than just stochastic or adversarial within a trend. Similar to the loss model introduced by Seldin et al , we focus on models exhibiting a “gap” in losses. Although this model is weaker than the adversarial model it still covers a large class of possible loss models. We express the gap in our loss models by an abstract term , the separation parameter. Although the exact definition of this parameter changes depending on the actual model, in each case it conveys the same idea that a larger value of this parameter implies a larger gap between losses of actions in set and every other action.
Dynamic Stochastic Regime (DSR): For the stochastic loss model, the loss of each action at round is drawn from an unknown distribution with mean . Let and be any actions in sets and respectively. Then for all rounds in trend , and the separation parameter is defined as:
The loss model is stochastic with separation parameter , when . The identity of best action changes times.
Adversarial Regime with Gap (ARG): We use a modified version of the loss model introduced in . Within each trend , there exists a set of actions which is the best set for any interval of (sufficiently large) constant size, . More precisely, let be the cumulative loss of an action in interval consisting of rounds. Then for any action and we define the separation parameter for trend as:
It is the smallest average gap between any sub-optimal action and any action in set for any interval of size . As in the above model, we say that a model satisfies ARG property with separation parameter when .
Notice that the first trend, spanning from the first round till some round , each action satisfies the gap conditions defined above for all the constituent rounds (DSR) or intervals of size (ARG), for the respective setting. We define to be the last such round, i.e. these conditions are violated at round , indicating the start of a new trend.
We study two variants of this problem. In the first variant, the algorithm chooses exactly one action every round while in the other, the algorithm can choose any set of actions. For both the variants, the algorithm observes losses only of the actions chosen (or the single action chosen for the former variant). We assume the presence of an oblivious adversary which decides on the exact loss sequences before the start of the game. The sequence is of course not known to the algorithm. We also make the standard assumption that losses are bounded in the interval.
For the problem setting as described, our goal is to design an algorithm to minimize the cumulative loss incurred in the rounds that the game is played. For the case when the algorithm chooses exactly one action every round, its performance is measured with respect to a strategy that chooses the best action in each trend. Specifically, let denote the action chosen by the algorithm in round and let denote the corresponding loss incurred by this action. Then the cumulative loss incurred by the algorithm is:
Let be the best action in trend , then the loss incurred by the switching strategy described above is:
where trend occurs in the interval . We define regret incurred by algorithm as follows:
Exactly analogous definitions apply to the case when the algorithm chooses multiple actions in each round.
Assumption: For the algorithm considered in this paper, we assume that the loss model, either stochastic or adversarial regime with gap, has separation parameter lower bounded by , a constant known to us i.e. .
Iv The Algorithm
The algorithm Exp3.T is composed of two primary ideas: The Exp3 algorithm and a trend detection routine. Exp3 gives almost optimal regret bound with respect to the single best action in hindsight when the loss model is adversarial. However, when the losses exhibit certain structure or when regret with respect to a stronger benchmark is desired, Exp3 proves to be insufficient. In this algorithm, we overcome this problem by identifying trends in losses and resetting the Exp3 algorithm whenever a change in trend is detected. One advantage of using Exp3 when losses exhibit trend structure is that Exp3 is robust to changes in the losses of actions as long as the best action remains same. We exploit this property in our algorithm so that it is applicable to a large class of loss models. In the analysis we use the following regret bound given by 
For any non-increasing sequence , the regret of Exp3 algorithm with actions satisfies
Algorithm 1 shows the skeleton of the procedure to achieve the desired switching regret bound. At a high level, the algorithm divides the total run into runs on smaller intervals. Within each interval the algorithm runs Exp3 (parameter ) with loss monitoring(LM) plays randomly interspersed among all rounds. The length of this interval is controlled by parameter . These loss monitoring plays choose different actions for a fixed number of rounds without regards to regret. The loss values collected from this process are used to give an estimation of the mean loss of each action in a given interval. The number of such plays required to give a good estimation of loss depends on the actual model under consideration and is captured by the parameter . Based on this estimation, the trend detection module outputs with probability at least whether the best action has changed or not, alternatively whether the trend has changed or not.
The procedure assigns Exp3 plays and fixed action plays to monitor loss (exactly many per action) randomly to rounds at the start of an interval and returns the randomly generated schedule. The random generation of schedule protects the algorithm from making biased estimates of actual losses.
In any interval, the loss monitoring component of Algorithm 1 chooses each action a sufficient number of times and these choices are randomly distributed over the interval. The samples obtained from these plays are used to give a bound on the deviation of the empirical mean of losses from the true mean. Particularly, we use the following lemma by Hoeffding  for sampling without replacement from a finite population.
Let be a finite population of real points, denote random sample without replacement from . Then, for all ,
where is the mean of .
For each interval we maintain information about the empirical mean of losses for each action, i.e. mean over loss values actually seen by the algorithm. By Lemma 2, all of these estimates are close to the actual mean with probability at least where is a parameter of the algorithm. In case of change in trend within an interval , naturally these guarantees are void as the losses do not maintain a uniform pattern. Therefore, a change in trend can be detected by comparing the empirical estimates obtained at the end of the next interval to those obtained prior to the trend change. This idea is represented in Algorithm 2.
V Regret Analysis
For ease of notation in the analysis, we define the detector complexity, , as the number of loss monitoring samples required for each action so that the trend detection procedure works with probability at least , provided there is no trend change in the actual interval. In what follows, we give detector complexity bounds for different models and in regret computation use as an abstract parameter.
The detector complexity in dynamic stochastic regime satisfies
Fix an action and an interval . Let the expected reward of action on interval be given by the sequence and the actual realization of rewards be given by . First we observe that the expected reward of over the interval is given by
Let the set of loss monitoring samples collected by our algorithm for action be denoted by . The algorithm uses these samples to calculate the empirical mean of rewards for the action . We denote it by .
Step 1: First we show that the empirical mean of losses over the entire interval is close to the expected mean, . Let be the sequence of actual reward realizations for arm in interval . Denote by the mean of these actual realizations. Applying Hoeffding’s inequality,
i.e. the empirical mean of losses for action over the interval is close to the actual mean with probability at least .
Step 2: Now we show that the empirical mean of loss-monitoring samples collected for action is close to the mean of the actual realizations, . This follows from Lemma 2:
Therefore, with probability at least the mean of loss monitoring samples for any action is within of the actual mean. By applying a union bound over all actions, with probability at least the same guarantee holds over all actions, which in turn implies that the trend detection module can detect whether the best action has changed with the same probability. ∎
The detector complexity in the adversarial regime with gap satisfies
when the losses in the given trend are drawn from interval .
The proof for this Lemma goes along the same lines as for Lemma 3 except that in this case we do not need step 1. Further, in this case, we can allow the empirical mean of collected samples to be within of the actual mean of all losses in the interval instead of just . For this particular loss model, if additional information about the range of losses within a trend is available, then using the generalized version of Hoeffding’s inequality we achieve a tighter detector complexity bound. We note if not defined otherwise, our losses are always drawn from range .
In the rest of the analysis, instead of or we use the model-oblivious-parameter .
The expected regret of Exp3.T is
We divide the regret incurred by Exp3.T in three distinct components; first is the regret incurred just by running and restarting of Exp3. To bound this component of total regret we use the regret bound as in Lemma 1. Let denote the number of false trend detections i.e. number of times when there was no change in detection but the detection algorithm still indicated a change. Then the regret incurred due to Exp3 is
As trend detection fails with probability at most , the expected number of false detections is at most
The second component of the total regret incurred is on account of intervals wasted due to delay in detection of trend change. Specifically, if the trend changes in a given interval , the regret guarantee obtained as part of Exp3 is not with respect to the best action before and after trend change. As we cannot give the required guarantee for this interval, we count this interval as wasted and account it towards regret. Secondly, since the trend detection algorithm detects the change with probability at least , the expected number of trend detection calls required (or alternatively the expected number of intervals) is at most . Therefore, the total number of wasted rounds is at most
The third and final component of regret incurred is due to the loss monitoring plays in each interval. No guarantee can be given about the regret incurred in these rounds and hence all such rounds are also accounted in regret. Since in each interval there are exactly number of such plays, the total number of such rounds is at most
Putting all together, the total regret is
Setting , and , regret incurred by Exp3.T is
Alternatively, . ∎
Extension to Anytime Version
The parameters derived to achieve the desired regret bound in Theorem 5 depend on the knowledge of T, the length of the total run of the algorithm. This dependency can be circumvented by using a standard doubling trick. Particularly, we can divide the total time into periods of increasing size and run the original algorithm on each period. Since the guarantee of this algorithm rests crucially on the probability of correct trend detection, in our case we need to modify the parameter as well.
The expected regret of Anytime Exp3.T with , and is .
We follow the same steps as in the proof of Theorem 5. We divide the regret incurred into three different components: regret due to Exp3 algorithm, due to the wasted intervals during detection and due to the loss monitoring plays. Compared to the proof in Theorem 5 the only difference is that here we have to sum regret of Exp3.T over multiple runs. If is the actal length of play, then the number of times we run Exp3.T is at most . Regret due to Exp3 algorithm (running and restarting) is:
where and are the number of changes in trend and number of false detections in th run of Exp3.T respectively. As before,
Using this bound in above inequality
The inequalities follow by using parameters and as defined in the algorithm. For ease of representation, we capture all constants with a single constant . Regret incurred due to wasted intervals is
Here we use the fact that , the detector complexity had we known apriori. All the constants involved in the above inequality are captured by . Similarly, regret due to loss monitoring plays is:
where the constant captures the constants involved. Combining the above mentioned bounds we get the desired claim. This bound is only a constant factor worse than the bound proved in Theorem 5.
It is easy to verify that the above analysis holds if is of the order of and this condition is met when is of order at least . If, however, is not a good estimate of in the above sense, the output of trend detection procedure in initial runs will not be correct with sufficiently high probability and hence aforementioned guarantees do not hold. We account for the regret incurred in the first few runs (till ) by simply disregarding all of them and consider them as wasted rounds. ∎
The principle of trend detection and restarting of a base algorithm (Exp3 in our context) according to changes in the trend can be extended to any multi-armed bandit algorithm for adversarial setting. The final regret guarantee obtained naturally depends on the performance of the base algorithm. We notice however that due to the necessary number of exploration rounds, no base algorithm can allow us to achieve regret . In particular, by choosing an appropriate base algorithm, our framework can be adjusted to a number of different loss structures and problem settings. In the following section, we use exactly this principle to design an algorithm to minimize regret with respect to the best actions.
Vi Extension to Top- Actions
In this section, we show how to extend the ideas introduced above to a setting where in each round we choose actions out of the available. For this variant of the problem, the Exp3 algorithm cannot be used and hence we use a more general approach proposed by Audibert et al . This approach, named Online Stochastic Mirror Descent (OSMD) is based on a powerful generalization of gradient descent for sequential decision problems. Similar to Exp3, the regret guarantee given by this technique is with respect to the best combination of actions in hindsight and holds even for adversarial losses. We refer the reader to  for a thorough treatment of the technique. In our proposed algorithm, OSMD.T, we use the technique as a black box and only need the final guarantee.
The regret of OSMD algorithm in the -set setting with and learning rate satisfies
Here is a Legendre function and is a parameter used within the OSMD technique. The trend detection algorithm in this case uses the same idea as in Algorithm 2 except that instead of a single action we now check if the set of best actions have changed with probability at least . Even in this case, we denote by the number of samples needed for each action to ensure that trend detection works with above mentioned probability. Bounds derived in Lemma 3 and Lemma 4 apply in this case too.
There are only a few differences in Algorithm 4 as compared to Algorithm 1. Firstly, instead of using Exp3 for regret minimization we use the more sophisticated technique of OSMD. This algorithm gives tight regret guarantees and is polynomial time computable222The OSMD technique can also be used when there are more generic combinatorial constraints on the set of actions chosen each round. For these generic cases, the algorithm need not be poly time computable. However, for the uniform matroid case (under consideration here) it is in fact poly time computable . Secondly, the trend detection algorithm changes slightly as mentioned above. Finally, since we choose actions in every round, we need a factor of lesser number of loss monitoring plays. Alternately, the size of an interval is chosen to be .
The expected regret of OSMD.T is
The main steps of analysis in this setting are exactly the same as Theorem 5. The component of regret due to OSMD algorithm is
where is the number of false detections as before and given by . This inequality follows by Lemma 7 and considering the fact that the algorithm is restarted at most times. Following the same arguments as in Theorem 5, the regret incurred on account of wasted intervals is at most:
Unlike Theorem 5, each wasted round incurs regret of instead of since we can’t guarantee regret for any of the chosen actions. Finally, since both the number of loss monitoring plays and the length of an interval is reduced by a factor of , the regret incurred on account of loss monitoring plays is:
Putting the above bounds together,
By setting , and we get
Since our proposed algorithm comes under the domain of active learning, it is not possible to reliably use any fixed data set. Instead, to assess the performance of our algorithm we shall use artificially constructed loss generation models; a standard approach for problems of this nature.
For each of the two models introduced, we compare the performance of Exp3.T algorithm with Exp3.R, an algorithm closest in spirit to our work. To emphasize that we obtain switching regret guarantee, a stronger benchmark than conventionally used, we also compare our algorithm with Exp3 i.e. the performance, measured in terms of the cumulative loss, is with respect to a switching strategy that chooses the best action in each trend. Each experiment is run independently 10 times and the mean of the results is shown in figures.
Experiment 1: DSR model Within each trend, we set the bias of the best action to and biases of other actions for the case when is set to while for the case when , they are set to . For each of the loss models, we run the experiment with and actions respectively. We have constructed the dynamic stochastic loss model in our experiments as a representative of a worst case scenario i.e. we do not assume any information about the loss structure except for the separation parameter (refer Fig. 1). The performance of Exp3.T is almost identical to Exp3.R, an algorithm specifically designed for stochastic model. For a smaller gap, however, our algorithm still manages to do marginally better than Exp3.R. We note here that the parameters of Exp3.R algorithm are set such that the assumptions required for the algorithm hold.
Experiment 2: ARG model We design the semi-structured property of ARG model as follows: For case, within each trend the loss of best action is a sequence of 100 consecutive 0s followed by 100 consecutive 1s. In the same rounds, losses of sub-optimal actions are 1 and 0.6 respectively. For case, losses of the best action are same as before but losses of sub-optimal actions are kept constant at 0.9. These loss structures are chosen as representatives of the possible instances of the ARG model. The advantage of our algorithm is clearly highlighted in this more general model. The worse performance of Exp3.R is expected since it assumes more structure than provided by the model; Exp3.T in contrast is able to exploit the little structure available and detect changes much faster.
There exists a subtle case when the guarantees presented in this paper do not hold. This happens when the length of the interval is comparable to the total run time of algorithm i.e. . For example, if the length of interval is , then Exp3.T does not provide any switching regret guarantee since for the first two intervals Exp3.T behaves exactly like Exp3. Therefore in worst case, the regret bounds presented here are void but the bounds of Exp3 still apply.
We have proposed a new paradigm for regret minimization and defined a broader class of loss models where our algorithm is applicable. We have used this paradigm for the regret minimization problem when one chooses either a single action or a basis of a uniform matroid in each round. For these problems we proposed algorithms and gave switching regret bounds of and respectively. Such a paradigm is particularly suitable for regret minimization algorithms where one cannot distinguish exploration and exploitation steps, for example OSMD. Extension of this paradigm to more general problems like online linear optimization is currently in progress.
-  Robin Allesiardo and Raphaël Féraud. Exp3 with drift detection for the switching bandit problem. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, pages 1–7. IEEE, 2015.
-  Jean-Yves Audibert and Sébastien Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(Oct):2785–2836, 2010.
-  Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2013.
-  Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
-  Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
-  Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, volume 5. 2012.
-  Joao Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. Learning with drift detection. In Brazilian Symposium on Artificial Intelligence, pages 286–295. Springer, 2004.
-  Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415, 2008.
-  Elad Hazan and Satyen Kale. Better algorithms for benign bandits. Journal of Machine Learning Research, 12(Apr):1287–1311, 2011.
-  Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
-  T Ryan Hoens, Robi Polikar, and Nitesh V Chawla. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, 1(1):89–101, 2012.
-  Levente Kocsis and Csaba Szepesvári. Discounted ucb. In 2nd PASCAL Challenges Workshop, pages 784–791, 2006.
-  Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pages 535–543, 2015.
-  Herbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pages 169–177. Springer, 1985.
-  Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287–1295, 2014.
-  William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
-  Taishi Uchiya, Atsuyoshi Nakamura, and Mineichi Kudo. Algorithms for adversarial bandit problems with multiple plays. In Algorithmic learning theory, pages 375–389. Springer, 2010.