# KLUCB Approach to Copeland Bandits

###### Abstract

Multi-armed bandit(MAB) problem is a reinforcement learning framework where an agent tries to maximise her profit by proper selection of actions through absolute feedback for each action. The dueling bandits problem is a variation of MAB problem in which an agent chooses a pair of actions and receives relative feedback for the chosen action pair. The dueling bandits problem is well suited for modelling a setting in which it is not possible to provide quantitative feedback for each action, but qualitative feedback for each action is preferred as in the case of human feedback. The dueling bandits have been successfully applied in applications such as online rank elicitation, information retrieval, search engine improvement and clinical online recommendation. We propose a new method called Sup-KLUCB for K-armed dueling bandit problem specifically Copeland bandit problem by converting it into standard MAB problem. Instead of using MAB algorithm independently for each action in a pair as in Sparring and in Self-Sparring algorithms, we combine a pair of action and use it as one action. Previous UCB algorithms such as Relative Upper Confidence Bound(RUCB) can be applied only in case of Condorcet dueling bandits, whereas this algorithm applies to general Copeland dueling bandits, including Condorcet dueling bandits as a special case. Our empirical results outperform state of the art Double Thompson Sampling(DTS) in case of Copeland dueling bandits.

## 1 Introduction

A classic Multi Armed Bandit (MAB) problem is a reinforcement learning problem wherein an agent learns to play actions in order to maximise her profit. Initially agent is uninformed about any stochastic information about the actions. She learns to play through some feedback associated with previous actions. MAB problems possess dilemma of exploration and exploitation. Since true parameters are unknown, any algorithm can maintain only estimated parameters. Inadequate exploration might result in playing of sub-optimal actions which increase loss and excess exploration will lead to slow convergence which is also undesirable. For bandit problems, performance of any algorithm can be measured using cumulative regret. In MAB problems regret for any action played at a time instant is defined as the gap between expected reward of best action and expected reward of current action. Thus, each algorithm will try to minimise cumulative regret. MAB problems have very wide range of application. They have been successfully applied in fields of online advertisements, clinical trials, adaptive routing and communication systems.

The dueling bandit problem [1] is a variation of MAB problem in which an agent chooses a pair of actions and receives relative feedback about preference of action in chosen pair. Unlike MAB problem in which agent receives quantitative feedback for her actions, only qualitative feedback is received in dueling bandit problem. Study of these kinds of problems is important when dealing with feedback which are naturally relative (e.g. feedback given by humans) or it is inefficient to provide absolute feedback. The dueling bandits have been successfully applied in applications such as information retrieval [2], search engine improvement, clinical online recommendation [3] and online rank elicitation [4].

In a k-armed dueling bandit problem, when an action defeats rest of the actions, it is called Condorcet winner. But this might not always be the case e.g. the best football team might not defeat rest of the teams. In absence of Condorcet winner, there can be several other criteria for judging the winner. Copeland winner [5] is the action which defeats maximum number of other actions. Borda Winner [6] is the winner with largest Borda score defined by , where is probability of action defeating action . We will assume unique winner throughout our discussion. We will refer to an action as an arm hereafter in our discussion.

Our paper is organised as follows. In following sub-section we discuss related works. In section 2, we formally define MAB problem and k-armed bandit problem. In section 3, we provide detailed description of Sup-KLUCB algorithm and key intuitions behind it. Section 4, we present our results and comparison with various algorithms. We conclude the report in section 5.

### 1.1 Related Work

Standard MAB problem have been studied quite extensively in past. Most notable work has been done by Lai and Robbins [7] where they developed asymptotic lower bounds for regret to be of order . Algorithms which follow above rules are called uniformly good and are asymptotically efficient. Graves and Lai [8] proved the bounds by applying bandits in a controlled Markov chain setting. Various algorithms with varying success have been put forth to solve MAB problem with the most important ones as Upper Confidence Bound (UCB) [9] and Kullback-Leibler UCB (KL-UCB) [10].

KL-UCB algorithm is an online, horizon free index policy for stochastic bandit problems. Horizon is the number of steps told in advance to any algorithm before which it has to produce single best arm and henceforth continue to exploit that arm. Horizon free algorithms do not require any specified horizon and evaluation process continuous indefinitely. Thus horizon free algorithm have to minimise regret across all horizons by rapidly converging to selection of optimal arm. Authors show that regret of KL-UCB algorithm is upper bounded by

(1) |

where denotes the Kullback-Leibler divergence between Bernoulli distributions of parameter and respectively. is expected reward for action and is the action with highest expected reward. Authors also show a non-asymptotic upper bound on number of draws of sub optimal arm : for all there exists and such that

(2) |

Several algorithms have been proposed for k-armed dueling bandit problem. We can briefly categorise them into 2 categories viz. Asymmetric and Symmetric [11]. Asymmetric type of algorithms consider both arms independently. Usually it selects first arm which has best performance (exploitation arm) and then it selects second arm to duel against the first arm with aim to identify an arm which can outperform the first (exploration arm). Interleaved Filter (IF), Beat the Mean (BtM), SAVAGE, Doubler, Relative Upper Confidence Bound (RUCB) [12], MergeRUCB and Double Thompson Sampling (DTS) [13] are few examples of asymmetric type of algorithms. Symmetric type of algorithms treat the choice of two arms symmetrically. Sparring and Self-Sparring algorithms are few examples of symmetric algorithms.

RUCB algorithm [12] extends UCB to dueling bandit problems by using a upper confidence bound on preference probabilities. Cumulative regret of RUCB after time steps is bounded by . Cumulative regret after iterations (for some ), is bounded by

(3) |

where and where is the best arm. Double Thompson Sampling (DTS) [13] as the name suggests uses Thompson Sampling twice, once it is used to break ties while selecting first arm in RUCB and then it used to sample second arm. Cumulative regret of DTS is bound by . DTS is state of art algorithm for small scale dueling bandits whereas MergeRUCB (variant of RUCB) is state of art algorithm for large scale bandits. However the scope of RUCB type of algorithms are limited to Condorcet type problems whereas DTS extends to Copeland case as well.

Our algorithm reduces dueling bandit problem to standard MAB problem. Previously also, algorithms which converts dueling bandit problem to conventional MAB problem have been proposed. Doubler [14] is first approach in this direction. Doubler assumes that probability of an arm winning over another is the linear function of the underlying utility of each arm. This utility association assumption requires total ordering of arms. Sparring [14] algorithm also assumes total ordering as in Doubler. Sparring algorithm uses separate MAB algorithms to choose different arms, reducing dueling bandit problem into two MAB problems which can be related to adversarial bandit problem. Self-Sparring [15] performs better than Sparring algorithm and it is upper bounded by . Self-Sparring independently chooses arm by calling stochastic MAB algorithms like Thompson sampling as a subroutine. Self-Sparring can used independent MAB to duel arms simultaneously and dueling bandit problem is a special case with .

### 1.2 Our Contribution

We propose an algorithm called Sup-KLUCB to solve k-armed dueling bandit problem specially Copeland bandit problem. Sup-KLUCB is horizon free, stochastic reinforcement learning algorithm like KLUCB. Unique feature of Sup-KLUCB algorithm is its flexibility, with minor changes in objective function, it can be used to solve various type of dueling bandit problems which has single unique winner such as Copeland, Condorcet(special case of Copeland problem), Borda etc. In this paper we focus on Copeland problems. We finally present Monte Carlo simulations demonstrating superior performance of our algorithm compared to existing methods.

## 2 Problem Setting

First, we discuss standard k-armed MAB problem and then move to k-armed dueling bandit problem.

### 2.1 Multi-Armed Bandit model

We consider a stochastic multi-armed bandit problem with arms such that is finite and . We define as set of arms. Time proceeds in round indexed by . In each round, reward is received for playing arm . These rewards are bounded in . Sequences for all arms are i.i.d with common expectation . Rewards across arms are also assumed to be independent. We denote as number of times arm was played till round , i.e. . At each round, a decision rule or algorithm plays an arm depending on past decisions and rewards observed. The set of all possible decision rules consists of policies such that event (play arm at round ) belongs to field generated by . We denote as the best arm and as expected reward associated with it. Regret and expected regret for a policy at round is defined as:

(4) |

(5) |

where is indicator function. Performance of any decision rule is measured by expected cumulative regret. Expected cumulative regret till round for policy is given by:

(6) |

Any MAB algorithm aims to find a policy that minimises regret, formally:

(7) |

Bernoulli Kullback-Leibler divergence for as mentioned earlier is defined as:

(8) |

with, by convention, and for .

### 2.2 Dueling Bandits Problem

We consider a armed dueling bandit problem such that and finite. We define as set of arms. Time proceeds in round indexed by . In each round , an arm pair is played and a noisy comparison is obtained. If arm was preferred over arm then else . This comparison is characterised by a preference matrix , where is probability of arm being preferred over arm . We assume comparisons are independent and remain stationary over time. Also we assume order of comparison does not affect outcome i.e. and would lead to same outcome. Thus . We assign . When we say arm beats , we mean .

#### 2.2.1 Copeland dueling bandits

Condorcet winner may not often exist in practise. One of the straight forward way to declare a winner in such cases is the player or action which secures maximum wins. For example, a football team winning a league has not necessarily defeated all other teams but has defeated maximum number of teams. Copeland winner in dueling bandit problem is an arm which defeats maximum number of arms. Copeland score for any arm is defined as . Thus an arm with maximum Copeland score is Copeland winner. Formally, we say arm an is Copeland winner if . Now, let us assume arm is Copeland winner, then regret at round for policy is defined as:

(9) |

Throughout our discussion we assumed existence of unique winner. Like in any standard MAB algorithm, any decision rule plays an arm pair depending on previously played arm pair and observed rewards. Any decision policy such that event belongs to field generated by

.
As in MAB problem, any k-armed dueling bandit algorithm aims to find a policy which minimises the cumulative regret, formally:

(10) |

where can be any type of problem like Copeland, Condorcet, Borda etc. In next section we discuss our Sup-KLUCB algorithm to solve k-armed dueling bandit problem by converting it into standard MAB problem.

## 3 Sup-KLUCB Algorithm

We now introduce Sup-KLUCB algorithm which is applicable to any armed bandit problem with a single winner. We first define few notations required in our algorithm.

We define to be any score for arm which is used to define a single winner i.e., for Borda problems is Borda score for arm, for Condorcet and Copeland problems is Copeland score for arm. In other words Sup(short for Superior) is a measure to rank various arms based on any fixed criteria. We require , if it is in some general interval , it can be normalised to . We have assumed that our problem only has a single winner, i.e. assuming w.l.o.g. that arm 1 is winner, i.e. and only when .

Let us define with cardinality as , where is cardinality of a set. We define a bijective function where . Note that we have allowed any bijective function without considering the exact order of mapping between elements in and in .
We denote inverse of function by . Now for , we define , , where represent scalar product of and .

Sup-KLUCB converts -armed dueling bandit problem to single MAB problem with arms. Each arm pair in is considered as a single arm with expected mean . Note for , if then . This is because including arm pair will not bring any new information as . One can argue that arm pair also does not bring any new information as is fixed, but any decision making rule like RUCB or DTS after enough exploration plays (say arm is winner). We are now ready to explain Sup-KLUCB algorithm.

We use following notations: is horizon, is round index, are hyper parameters. For , is the number of times arm pair has been played, is the number of times arm won over arm . is reward for played arm pair at each round . We denote our selected arm in each round as .

After giving a broad picture of algorithm and defining several notations, we now show that for any -armed bandit problem, any competent algorithm which declares an arm (say ) as winner is also a winner by Sup-KLUCB algorithm.

###### Theorem 1.

Given there exists a unique winner, an arm is winner of Copeland -armed dueling bandit problem if and only if it is also winner by Sup-KLUCB algorithm.

###### Proof.

Winner is unique i.e. if and only if for all . If an arm is the winner of -armed dueling bandit problem then and for Copeland problem . Since for all , we have . Thus we have an arm such that and . For any stochastic bounded MAB problem, KLUCB has been proved to declare true winner asymptotically. Thus for our problem, considering all arms in to be independent, KLUCB will declare the winner with maximum reward i.e. . Thus a winner in -armed dueling bandit problem is also winner in Sup-KLUCB algorithm. Now we prove the other way by contraposition argument. By our uniqueness assumption, if arm is not a winner, then and thus for an arm such that , . Hence will not be declared as winner by KLUCB being sub-optimal. This concludes our proof. ∎

We define a function where that defines a unique winner such that (like Copeland winner, Borda winner etc.) or (when criteria is arm with minimum number of losses).

###### Lemma 1.1.

Given any function (winner criteria) as defined above, an arm is winner of -armed dueling bandit problem if and only if it is also winner by Sup-KLUCB algorithm.

###### Proof.

If function is such that winner acquires maximum value i.e., , then for any arm we can define . Now we have . Hence proved from Theorem 1. If function is such that winner acquires minimum value i.e., , then for any arm we can define . Again we have . Hence proved from Theorem 1. ∎

We now explain Sup-KLUCB algorithm. In Algorithm 1, for Copeland problems, we experimentally found and . Now we explain each step of algorithm 1. Steps 1-4 are run once for each arm. In step 6 and 7, we calculate for all and for all respectively. In step 8, we select arm with highest upper confidence calculated using KL divergence . Arm pair is played and we declare winner at round with the arm having maximum value.

## 4 Experiments

We performed Monte Carlo simulations to prove the performance of our algorithm. For Copeland problem, we have compared our algorithm with state of art DTS algorithm and RUCB algorithm because it is an UCB based algorithm. For our Monte Carlo experiment, we randomly chose number of arms from 3 to 36. We played 25 games with 25 iteration each with each game played upto 100,000 time steps. Preference or Probability matrix was generated randomly. We only have one assumption that winner must be unique. In figure below, we show average cumulative regret of different algorithms. We only show confidence interval as higher number would engulf almost whole graph.

In figure (a) and (b) dashed line is for RUCB, dotted line is for current state of the art DTS and solid line is for our algorithm Sup-KLUCB. From the figure (a), we can infer that Sup-KLUCB outperform RUCB and DTS algorithm.

Further, we analysed performance with respect to number of arms. For this we simulated for 100 different games, each iterated 25 times up to 100,000 time steps. Figure (b)
shows that performance of Sup-KLUCB with respect to DTS increases as we increase number of arms and for smaller number of arms Sup-KLUCB performs at par with DTS.

## 5 Conclusion

In this paper we proposed a general framework for conversion of dueling bandit problem to MAB problem. This framework has very wide scope of application in terms of type of problem,and with minor changes in objective function it can used for variety of problems like Copeland, Condorcet, and Borda. Using KLUCB, we further extended UCB to dueling bandits and outperformed current state of the art algorithm for Copeland bandit problem.

Our future works includes detailed mathematical analysis of regret and upper and lower bounds and using this analysis, we need to firmly establish hyper-parameters used in our algorithms.

## References

- [1] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 2012.
- [2] Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201–1208. ACM, 2009.
- [3] Yanan Sui and Joel Burdick. Clinical online recommendation with subgroup rank feedback. In Proceedings of the 8th ACM Conference on Recommender systems, pages 289–292. ACM, 2014.
- [4] Balázs Szörényi, Róbert Busa-Fekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for plackett-luce: A dueling bandits approach. In Advances in Neural Information Processing Systems, pages 604–612, 2015.
- [5] Masrour Zoghi, Zohar Shay Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. CoRR, abs/1506.00312, 2015.
- [6] Kevin Jamieson, Sumeet Katariya, Atul Deshpande, and Robert Nowak. Sparse dueling bandits. AISTATS, 2015.
- [7] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- [8] Todd L Graves and Tze Leung Lai. Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715–743, 1997.
- [9] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- [10] Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual Conference On Learning Theory, pages 359–376, 2011.
- [11] Yanan Sui, Masrour Zoghi, Katja Hofmann, and Yisong Yue. Advancements in dueling bandits. In IJCAI, pages 5502–5510, 2018.
- [12] Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten de Rijke. Relavtive upper confidence bound for the k-armed dueling bandit problem. ICML, 2014.
- [13] Huasen Wu and Xin Liu. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657, 2016.
- [14] Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856–864, 2014.
- [15] Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Multi-dueling bandits with dependent arms. arXiv preprint arXiv:1705.00253, 2017.