Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces
Abstract
We consider sequential decision making under uncertainty, where the goal is to optimize over a large decision space using noisy comparative feedback. This problem can be formulated as a armed Dueling Bandits problem where is the total number of decisions. When is very large, existing dueling bandits algorithms suffer huge cumulative regret before converging on the optimal arm. This paper studies the dueling bandits problem with a large number of arms that exhibit a lowdimensional correlation structure. Our problem is motivated by a clinical decision making process in large decision space. We propose an efficient algorithm CorrDuelwhich optimizes the exploration/exploitation tradeoff in this large decision space of clinical treatments. More broadly, our approach can be applied to other sequential decision problems with large and structured decision spaces. We derive regret bounds, and evaluate performance in simulation experiments as well as on a live clinical trial of therapeutic spinal cord stimulation. To our knowledge, this marks the first time an online learning algorithm was applied towards spinal cord injury treatments. Our experimental results show the effectiveness and efficiency of our approach.
1 Introduction
In many online learning settings, particularly those that involve human feedback, reliable feedback is often limited to pairwise preferences instead of real valued feedback. Examples include implicit or subjective feedback for information retrieval and recommender systems, such as clicks on search results, and subjective feedback on the quality of recommended care [\citeauthoryearChapelle et al.2012, \citeauthoryearSui and Burdick2014]. This setup motivates the dueling bandits problem [\citeauthoryearYue and Joachims2009], which formalizes the problem of online regret minimization via preference feedback (e.g., choosing a pair of arms to be compared at each time step). Many dueling bandits algorithms [\citeauthoryearYue and Joachims2009, \citeauthoryearYue and Joachims2011, \citeauthoryearZoghi et al.2014, \citeauthoryearAilon et al.2014, \citeauthoryearKomiyama et al.2015, \citeauthoryearWu and Liu2016] have been developed for efficiently computing this problem with independent arms. However, these algorithms are not efficient in situations involving a large number of dependent arms. Specifically, when the time horizon is smaller than the number of arms , it is hopeless to achieve low regret without leveraging structure among arms.
Our problem is motivated by clinical research for recovering motor function after severe spinal cord injury. Previous research [\citeauthoryearHarkema et al.2011] has shown that electrical stimulation applied to the spinal cord via electrode arrays implanted in the epidural space over the lumbosacral area enables paralyzed patients to achieve full weightbearing standing, improvements in stepping, and partial recovery of lost autonomic functions. Stimulation consists of electrical pulse trains applied to selected electrodes. The challenge is that the optimal stimulus pattern (the choice of active electrodes and their polarities, the pulse amplitude and width, and the pulse train frequency) varies significantly across patients. And even for the same patient, the response to the same stimulus has some variation across trials. Hence, clinicians must determine the optimal stimulus for each patient under noisy conditions, which currently a laborious and adhoc approach.
Figure 1 shows the clinical treatment procedure for standtraining of paraplegics. During a treatment/optimization session, new stimuli are recommended by the algorithm to be applied to the electrode implanted in the patient. The patient then attempts to stand using the given stimuli, and the observing clinicians compare the patient’s standing performance. The total number of different stimulating configurations is due to the complexity of electrodes, and so it is not feasible to search through the whole space. The goal is to develop a algorithm that can automatically select stimuli in order to quickly converge to good treatments.
Motivated by this application, we consider the problem of finding optimal stimuli based on the general setting of the multiarmed bandit problem. The classical bandit problem trades off between exploration and exploitation among a number of different arms, each having a quantifiable but stochastic reward with an initially unknown distribution. In contrast, for our clinical problem, the patient’s motor response to stimulation is hard to quantify. Neither video motion capture nor electromyographic (EMG) recordings of muscle activity can yet provide a consistent and satisfactory measure of motor skill under stimulation. One reasonably reliable measure is that of pairwise comparisons, e.g., whether one stimulus more effective than another. While the patient’s performance under a specific stimulus is hard to quantify in the clinical setting, we can obtain comparisons of stimuli which are tested within the short time period of one training session.
Our Contributions.
In this paper, we show how to cast the problem of online learning of personalized clinical treatment as a dueling bandits problem with a correlated action space, which we call correlational dueling bandits . We present an algorithm which meets the demands of such clinical settings, and can effectively model such correlation dependencies to achieve good performance. Our algorithm takes advantage of the correlations among different arms to update the whole active set of arms instead of only updating the two dueling arms. This approach achieves fast convergence to the (near) optimal decisions regardless of the large decision space. We deployed CorrDuelas the first algorithmic approach to the control of spinal cord stimulation in clinical experiments. We find that CorrDuelcan identify a group of optimal stimuli and help paraplegic human patients to achieve fullweight standing.
2 Related Work
2.1 MultiArmed Bandits
The stochastic multiarmed bandits problem [\citeauthoryearRobbins1952] refers to an iterative decision making problem in which one repeatedly chooses among K options, such as pulling one of K arms of a bandit machine. In each round, we receive a reward that depends on the arm being selected. Without loss of generality, assume that every reward is bounded between . The goal then is to minimize the cumulative regret compared to the best arm.
Popular algorithms for the stochastic setting include UCB (upper confidence bound) algorithms [\citeauthoryearAuer et al.2002a, \citeauthoryearBubeck and CesaBianchi2012], and Thompson Sampling [\citeauthoryearChapelle and Li2011, \citeauthoryearRusso and Van Roy2014].
In the adversarial setting, the rewards are chosen in an adversarial fashion, rather than sampled independently from some underlying distribution. In this case, regret is rephrased as the difference in the sum of rewards. The predominant algorithm for the adversarial setting is EXP3 [\citeauthoryearAuer et al.2002b].
2.2 Correlated Bandits
The set of candidate actions is very large (or even infinite) in many applications. When that is the case, one must exploit dependencies between payoffs of different decisions in order to arrive at an efficient algorithm.
In some applications, the underlying problem comes equipped with a correlational structure. Various methods of introducing dependence include bandits on trees [\citeauthoryearKocsis and Szepesvári2006], bandits with linear correlations [\citeauthoryearDani et al.2008, \citeauthoryearAbernethy et al.2008, \citeauthoryearAbbasiYadkori et al.2011, \citeauthoryearGentile et al.2014] or Lipschitz continuous payoffs [\citeauthoryearKleinberg et al.2008, \citeauthoryearBubeck et al.2008], and Gaussian payoffs [\citeauthoryearSrinivas et al.2010].
2.3 Dueling Bandits
Dueling bandits problem [\citeauthoryearYue et al.2012], as a variant of the multiarmed bandits, takes (noisy) comparative feedback instead of realvalued feedback. It is under the general framework of preference learning (learning with preferential feedback). The dueling bandits problem can also be viewed as a special case of partial monitoring problems [\citeauthoryearCesaBianchi et al.2006]. Its problem setting naturally fits in with many applications such as information retrievals and recommender systems. The stochastic dueling bandits problem has been extensively studied in [\citeauthoryearYue et al.2012, \citeauthoryearAilon et al.2014, \citeauthoryearZoghi et al.2014, \citeauthoryearKomiyama et al.2015, \citeauthoryearWu and Liu2016].
Beyond the stochastic armed dueling bandits setting, other dueling bandit settings include multiway preference feedback [\citeauthoryearSui and Burdick2014], continuousarmed convex dueling bandits [\citeauthoryearYue and Joachims2009], contextual dueling bandits which also introduces the von Neumann winner solution concept [\citeauthoryearDudík et al.2015], sparse dueling bandits that focus on the Borda winner solution concept [\citeauthoryearJamieson et al.2015], Copeland dueling bandits that focus on the Copeland winner solution concept [\citeauthoryearZoghi et al.2015], and adversarial dueling bandits [\citeauthoryearGajane et al.2015]. It would be interesting to study how to extend our analysis to these other settings as well.
3 Problem Statement
In the classical dueling bandits problem, at each iteration , the following happens:

The algorithm chooses a pair of actions and from a set of possible actions.

The algorithm duels and and receives (noisy) feedback corresponding to the winner.
Our procedure can be described as follows. There is a set of arms , and a total number of tests to be performed. At each time step, a pair of arms are chosen from the set and a (noisy) comparison of them is observed. is determined before we run the algorithm. The set of arms are correlated and in general.
We follow the original notation of the dueling bandit problem. For two arms and sampled from , we write the comparison factor as
where is the probability that dominates and represents the priority between and . We define . We use the notation for convenience. Note that and . We assume the distribution of reward for each arm is stationary so that all comparison factors converge in [1/2,1/2]. We also assume that the arms are indexed in preferential order so that there is one preferred arm.
Our goal is to minimize the total regret:
The total regret if we constantly choose during the experiment. is linear if we constantly choose .
We also inherit two properties of the comparison factors from the original dueling bandit problem:
Strong Stochastic Transitivity. For any triplet of arms , we assume .
Stochastic Triangle Inequality. For any triplet of arms , we assume . This can be viewed as a diminishing returns property.
Correlational Dueling Bandits. When the size of the decision set, , is large, it is unavoidable to carry out a very large number of tests before the algorithm converges to its optimal solution. In some applications like our clinical example, each test is expensive and timeconsuming. The number of tests – the time horizon of an algorithm – is often predetermined by clinical conditions. We thus augment the dueling bandits problem into correlational dueling bandits, which takes the correlations among arms into consideration. For any pair of arms , we consider the dependence between them are captured by some similarity function , and it satisfies:

;

and are not correlated;

.
For all tuples , if we play pair once and observe , we define to be the update of wins of and to be the update of plays of . and represent the dependent structure of the tuple arms. They could be functions of , , and .
In our synthetic experiments, we assume the input space (set of arms) has dependent structure and there exists an underlying utility function over input space which we cannot observe directly. Our observations are the noisy comparisons between pairs of arms (e.g., and ) which can be viewed as the noisy comparison of utility values (e.g., and ). The properties of strong stochastic transitivity, stochastic triangle inequality, and the dependency assumptions on generally hold for a wide range of applications. In the clinical experiments, we extract comparisons from physician’s online judgment.
4 Algorithm
Our algorithm, CorrDuel as shown in Algorithm 1, is a correlational dueling bandits algorithm based on the BeattheMean algorithm [\citeauthoryearYue and Joachims2011]. It uses observational feedback and the correlational structure to successively remove suboptimal arms, while keeping the optimal one(s) in the sample space with high probability. The inputs to CorrDuel are the set of arms , the total number of iterations , and the correlational structure .
ParametersInitialization (Algorithm 2) defines the set of active arms , whose size shrinks as more tests are completed. For each arm , let be the total number of comparisons between and other arms, and let be the total number of wins against all other arms. Let be the empirical average of for all in , and let be the value of after comparisons between arm and any other arms. Set the confidence interval of as:
where , and is the confidence that lies in . The function decreases as the number of comparisons increases. By properly setting parameter , the optimal reward can be reached within the fixed time horizon as shown in Proposition 1.
ActiveElimination (Algorithm 3) is the key part of CorrDuel. For each pair of tests, two arms are randomly chosen from . The randomized selection method enjoys lowvariance total regret in general. For each arm , the values of , and are updated, as is the corresponding confidence radius . An arm dominates another arm , if their confidence intervals do not overlap, and the inferior arm is eliminated from . The algorithm runs until the time horizon is reached, or only one active arm remains.
CorrUpdate (Algorithm 4) is the subroutine of ActiveElimination (Algorithm 3) which updates the weights of by rules and . In the classical dueling bandits setting, we assume arms are independent. For independent arms, if we have one comparison between and and gets , we only update the weights for arm and :
(1) 
(2) 
For a large decision space, existing dueling bandits algorithms are extremely slow if one does not exploit dependencies among arms, even if they can achieve provably optimal cumulative regret (w.r.t. independent arms). When the arms are correlated and the correlation between any pair of arms and is measured properly by , we can update all active arms at each iteration.
As shown in Algorithm 4, we update every arm after comparing arms and ( assume ) via:
(3) 
(4) 
where and represent the correlational structure, which is assumed to satisfy:

;

if , ;

if , , .
These updates are based on the assumption that is an unbiased estimation of the dependent structure. The CorrUpdate subroutine (Algorithm 4) can efficiently update all arms at each iteration. So CorrDuel enjoys fast convergence towards the near optimal arms.
Definition 1.
optimal arm. If arm satisfies , then is an optimal arm.
Proposition 1.
If such that for every , then with high probability, the cumulative time to achieve purely optimal arms is bounded by:
Proof.
Proposition 1 holds based on the Theorem 1 of [\citeauthoryearYue and Joachims2011]. After iterations, since , we have . Then . Notice is a function of time step .
Notice, the iteration time in Propositions 1 does not depend on , which suggests the fast convergence of CorrDuel in large decision spaces.
In our application of CorrDuel to selection of optimal multielectrode stimulating parameters for paraplegic, we define the similarity of different configurations to be the correlation coefficient of electrical potential fields generated by the two different electrode stimulating configurations. Since the correlation coefficient function has support on , we only update with the CorrUpdate rule when . The existence of negative values is based on clinical observations. The correlational property arises from analysis of electric fields applied by the array as shown in Figure 2.
The standard notion of correlation coefficient, , is used in our experiments. However, one can use any measure as a basis for as long as , when , and when has an “irrelevant” relation to . The coefficient can take negative values, but the algorithm doesn’t use negative values for its updates.
For correlated arms, we perform an update for every arm for which as follows:
(5) 
(6) 
Proposition 2.
If such that for every pair , then with high probability, the cumulative time to achieve purely optimal arms satisfies:
Proof.
If for every pair , since , for every tuple . The result follows from substituting it into Proposition 1. ∎
The CorrUpdate subroutine above updates the dueling pair in the same way as if they are independent since (5) and (6) will collapse to (1) and (2) for and . For extreme cases, if and arm is very close to . We have and , the updating rules for arm will be close to the updates of arm . If is far from both and , (5) and (6) guarantees that the update for is very small since we acquire little information about from far away comparisons. Also, if and are less dependent (with smaller ), we would expect to acquire larger updates for the points in between.
One can also consider a Bayesian version, e.g., by using Gaussian processes. In this paper, we focus on a frequentist approach, which is a better model of the clinical application.
5 Experiments
We evaluated our approach in two settings, synthetic simulations and a real clinical application of online optimization for spinal cord stimulation therapy. In our controlled synthetic experiments, we seek to address the following questions:

How does the algorithm compare against standard dueling bandit algorithms?

How effective is it in terms of convergence?
We compare the algorithm against BeattheMean, RUCB, and Sparring algorithm with UCB1. These three algorithms are the representative dueling bandits algorithms designed for independent arms, which do not, however, leverage the correlations between arms.
5.1 Simulation Experiments
Setup. We first evaluate the algorithm with simulation experiments. The purpose of this experiment is to validate our algorithm, and demonstrate its quick convergence when the arms are dependent. To generate the underlying utility function over correlated arms, we sampled random functions from a zeromean Gaussian Process with squared exponential kernel over the sample space , uniformly discretized into points (set of arms) and used this function as the mean function for the arms. We chose as the standard deviation of arms. One evaluation of the mean functions is shown in Figure 3. The utility function is not necessarily convex or simple. Within each iteration, we sample 2 points in the active set and compare their (noisy) sampling values to get the feedback of the duel. We run the duel for iterations for 10000 trials for each of the 4 comparing algorithms.
Results. We report a notion of regret as the stepwise regret instead of the cumulative regret. It converges to zero as iteration number goes to infinity for every noregret algorithm. As seen in Figure 4, CorrDuel converges much faster than the other three algorithms since it takes the advantage of the dependent arms. The independentarmed dueling bandits algorithms require an exhaustive searching period which is significantly larger than the time horizon we use here before concentrating on the (near) optimal arms.
5.2 Human Experiments
Background. As depicted in Figure 1 from before, our human clinical experiments involve optimizing a system for stand training under spinal cord stimulation with spinal cord injury patients. The subject practices standing under spinal stimulation using a stand frame for assistant in balance. The training processes largely follow the procedures in [\citeauthoryearRejc et al.2015]. Two trainers on the subject’s left and right protect and assist the subject. Within each experiment, a specific stimulating pattern (a combination of active electrode selections, the polarity of the actively selected electrodes, and the stimulation amplitude and frequency) is applied through the implanted electrode array and its controlling circuitry. An anonymous short video
The participants are under stable medical condition and have no musculoskeletal dysfunction that might interfere with stand training. They have no motor response present in leg muscles during transcranial magnetic stimulation, indicating that there are no strongly active neural pathways connecting cortex and lower limb muscles. No volitional control can be achieved during voluntary movement attempts in leg muscles as measured by EMG activity.
Setup. We use clinical knowledge to restrict the decision space from around to be on the order of . It is still a very large decision space considering the number of trials, or arm pulls, are on the order of .
A total of 414 experimental comparisons were done with two patients under the CorrDuel algorithm. Each trial lasted for about 5 minutes. Within each trial, one stimulating pattern was generated by the 16channel electrode. The patterns were unchanged within each trial. For a fixed electrode configuration, the stimulation frequency and amplitude were modulated synergistically in order to find the best values for effective weightbearing standing. We optimized the electrode patterns with CorrDuel and performed exhaustive search for stimulation frequency and amplitude over a narrow range.
Stimulation began while the patient was seated. Then the participant initiated the sit to stand transition by positioning his feet shoulder width apart and shifting his weight forward to begin loading the legs.
Results. For the clinical experiments, we cannot create a direct plot for regrets since the ground truth optimal stimulation is unknown. In the experiments, we observed the convergence of CorrDuel, which is not possible for independentarmed dueling bandits algorithms. The set of (near) optimal configurations found by CorrDuel is shown in Figure 5. We compared the performance of CorrDuel to the optimal selections found heuristically for each patient by clinicians, which are shown in Figure 6. We found that the manual selection is a subset of the algorithm’s selection, and there exist high performing configurations (e.g., the 2nd in Figure 5) found by the algorithm which are not in the manual selection. This shows that CorrDuel is performing no worse than specialized physicians.
6 Conclusion and Discussion
Our analysis and simulation demonstrate that CorrDuel indeed exhibits fast convergence properties compared to independentarmed dueling bandits algorithms when correlation information is available. We deployed this algorithm in clinical experiments for the control of spinal cord stimulation and showed that CorrDuel performs no worse than specialized physicians. We believe that our result provides an important step towards employing machine learning algorithms in many problems with a large volume of parameter selection and sequential decision making. These problems could be facilitated by our algorithm, which simultaneously delivers effective decisions and explores the decision space based on comparative feedback.
The CorrUpdate subroutine is easy to incorporate with BeattheMean algorithm to achieve efficient CorrDuel. Although we developed CorrDuel specifically based on BeattheMean, CorrUpdate is a more general approach which has potential to incorporate with the existing dueling bandits algorithms. For instance, it can incorporate with RUCB to realize a variant of RUCB for dependent arms by updating the wins with CorrUpdate.
To our knowledge, our work is the first to apply an algorithmic approach towards spinal cord injury treatments. The algorithm could find a proper set of optimal stimulating configurations within the test time horizon. We achieved good performance in both simulations and human experiments. The paraplegic human patients could achieve fullweight standing under the stimulation provided by our algorithm.
Acknowledgements
This research was supported in part by Caltech/JPL PDF IAMS100224, NIHU01EB00761508, NIHU01EB01552105, and a gift from Northrop Grumman.
Footnotes
References
 Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pages 2312–2320, 2011.
 Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 263–274, 2008.
 Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning (ICML), 2014.
 Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
 Sébastien Bubeck and Nicolo CesaBianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5:1–122, 2012.
 Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. Online optimization in Xarmed bandits. In Advances in Neural Information Processing Systems (NIPS), 2008.
 Nicolo CesaBianchi, Gábor Lugosi, and Gilles Stoltz. Regret minimization under partial monitoring. Mathematics of Operations Research, 31(3):562–580, 2006.
 Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems (NIPS), 2011.
 Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. Largescale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems (TOIS), 30(1):6:1–6:41, 2012.
 Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory (COLT), 2008.
 Miroslav Dudík, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Conference on Learning Theory (COLT), 2015.
 Pratik Gajane, Tanguy Urvoy, and Fabrice Clérot. A relative exponential weighing algorithm for adversarial utilitybased dueling bandits. In International Conference on Machine Learning (ICML), 2015.
 Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In ICML, pages 757–765, 2014.
 Susan Harkema, Yury Gerasimenko, Jonathan Hodes, Joel Burdick, Claudia Angeli, Yangsheng Chen, Christie Ferreira, Andrea Willhite, Enrico Rejc, Robert G Grossman, et al. Effect of epidural stimulation of the lumbosacral spinal cord on voluntary movement, standing, and assisted stepping after motor complete paraplegia: a case study. The Lancet, 377(9781):1938–1947, 2011.
 Kevin Jamieson, Sumeet Katariya, Atul Deshpande, and Robert Nowak. Sparse dueling bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
 Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multiarmed bandits in metric spaces. In ACM Symposium on Theory of Computing (STOC). Association for Computing Machinery, Inc., May 2008.
 Levente Kocsis and Csaba Szepesvári. Bandit based MonteCarlo planning. In Machine Learning: ECML, pages 282–293, 2006.
 Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In COLT, 2015.
 Enrico Rejc, Claudia Angeli, and Susan Harkema. Effects of lumbosacral spinal cord epidural stimulation for standing after chronic complete paralysis in humans. PloS one, 10(7):e0133998, 2015.
 Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952.
 Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
 Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010.
 Yanan Sui and Joel Burdick. Clinical online recommendation with subgroup rank feedback. In ACM Conference on Recommender Systems (RecSys), 2014.
 Huasen Wu and Xin Liu. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657, 2016.
 Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In International Conference on Machine Learning (ICML), 2009.
 Yisong Yue and Thorsten Joachims. Beat the mean bandit. In International Conference on Machine Learning (ICML), 2011.
 Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The karmed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
 Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten de Rijke. Relative upper confidence bound for the karmed dueling bandit problem. In International Conference on Machine Learning (ICML), 2014.
 Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315, 2015.