PAC-Battling Bandits with Plackett-Luce: Tradeoff between Sample Complexity and Subset Size
We introduce the probably approximately correct (PAC) version of the problem of Battling-bandits with the Plackett-Luce (PL) model – an online learning framework where in each trial, the learner chooses a subset of arms from a pool of fixed set of arms, and subsequently observes a stochastic feedback indicating preference information over the items in the chosen subset; e.g., the most preferred item or ranking of the top most preferred items etc. The objective is to recover an ‘approximate-best’ item of the underlying PL model with high probability. This framework is motivated by practical settings such as recommendation systems and information retrieval, where it is easier and more efficient to collect relative feedback for multiple arms at once. Our framework can be seen as a generalization of the well-studied PAC-Dueling-Bandit problem over set of arms Szörényi et al. (2015); Busa-Fekete et al. (2013). We propose two different feedback models: just the winner information (WI), and ranking of top- items (TR), for any . We show that with just the winner information (WI), one cannot recover the ‘approximate-best’ item with sample complexity lesser than , which is independent of , and same as the one required for standard dueling bandit setting (). However with top- ranking (TR) feedback, our lower analysis proves an improved sample complexity guarantee of , which shows a relative improvement of factor compared to WI feedback, rightfully justifying the additional information gain due to the knowledge of ranking of topmost items. We also provide algorithms for each of the above feedback models, our theoretical analyses proves the optimality of their sample complexities which matches the derived lower bounds (upto logarithmic factors).
The problem of Dueling Bandits has recently gained much attention in the machine learning community Yue et al. (2012); Zoghi et al. (2013); Ailon et al. (2014); Zoghi et al. (2014a, b, 2015); Ramamohan et al. (2016). This is an online learning problem generalizing the standard multi-armed bandit Auer et al. (2002), where learning proceeds in rounds. In each round, the learner selects a pair of arms and observes a stochastic feedback of the winner of the comparison (duel) between the selected arms.
In this work, we introduce a natural generalization of the Dueling Bandits problem where the learner, instead of choosing two arms, selects a subset of size upto arms in each round, and observes a stochastic subset-wise feedback, generated from an underlying Placket-Luce model. We term this the problem of Battling-Bandits with Plackett-Luce model. Such settings commonly occur in various application domains where it is often easier for customers or patients to give a single feedback to a set of offered options (products or medical treatments), as opposed to comparing only two options at a time.
Similar to the regret minimization objective of Dueling-Bandits Zoghi et al. (2014a); Ramamohan et al. (2016), one goal here can be to learn to play an appropriately defined best-arm as often as possible, maintaining the online exploration-exploitation tradeoff. However we focus on the ‘pure-exploration’ version of the problem, where the objective is to identify a near-optimal item with high confidence, formally introduced as the PAC-WI objective, where the goal is to find an -approximation of the ‘Best-Item’ with probability at least (Section 3.2).
The main objective of this work is to understand the influence of the subset size on the PAC-WI objective for different feedback models, e.g. winner information (WI) or top ranking (TR). More precisely, we find answers to the following questions: Does the flexibility of playing size- subsets help gather information faster? Or, what kind of subset-wise feedback helps learning faster? Is it possible to detect the best arm more quickly in a Battling framework than Dueling framework? How does the learning rate scale with the subset size ? Does ranking feedback aids in faster information accumulation compared to winner feedback?, etc. This paper takes a step towards addressing such questions for the Plackett-Luce (PL) choice model. The specific contributions of this paper are summarized below
We introduce the PAC version of the Battling-Bandits over arms, a natural generalization of the PAC-Dueling-Bandits problem Yue and Joachims (2011); Szörényi et al. (2015) where, instead of playing two arms, the learner is allowed to play a subset of arms of upto size , for some . Upon playing a subset, the environment generates a stochastic feedback – either only the winner of the subset or a ranking over top items of the subset. The objective is to find an -approximation of the ‘best-item’ with high probability , defined as the PAC-WI objective, with minimum possible sample complexity (see Section 3.2 for details).
Winner Information (WI) feedback: We first consider the subsetwise feedback model with only winner information (WI), which allows to play subsets of distinct elements, upon which the winner of is drawn according to an underlying Plackett-Luce model (with parameters unknown to the learner). Our results show that, even with the flexibility of playing larger sets of size , the sample complexity lower bound for achieving the -PAC-WI objective is still (Theorem 4.1), which in fact same as that for Dueling-Bandit case for (Szörényi et al., 2015), implying that with just the winner information one cannot hope for a faster rate of information aggregation in the current set up. We also design two algorithms (Section 4.2) for the PAC-WI objective, and analyze their sample complexity guarantees, which turn out to be optimal within a logarithmic factor of the lower bound derived earlier. We also analyzed the setting where the learner is allowed to play subsets of size upto , which of course is a slightly more flexible setting than the earlier, and in this case we provide an algorithm with optimal sample complexity.
Top Ranking (TR) feedback: We here generalize the WI feedback model to top-ranking information (TR). Same as before, it also allows the learner to play subsets of distinct elements, but upon which now the ranking of the top- items () of the subset is revealed to the learner, drawn according to an underlying Plackett-Luce model (Section 3.1). In this case, we show a sample complexity lower bound of for achieving PAC-WI objective (Theorem 5.1), which implies that with top- ranking (TR) feedback, its actually possible to aggregate information -times faster than that with just the winner information (WI). We also present two algorithms (Section 5.2) for the purpose along with their sample complexity guarantees, that matches the corresponding lower bound upto logarithmic factor, proving their optimality yet again.
Related Work: The problem is of -subset generalization of Dueling Bandits is not studied in the literature before and we are the first to introduce the framework of Battling Bandits. Hence there is almost no work in the literature that addresses the current problem setting, although the problem of recovering score parameters of the Plackett-Luce model is well studied in the offline (batch) setting Chen and Suh (2015); Khetan and Oh (2016); Jang et al. (2017).
There is also a good amount of work that study PAC ‘best-arm’ identification problem in the classical Multiarmed Bandit (MAB) setting Even-Dar et al. (2006); Audibert and Bubeck (2010); Karnin et al. (2013). Some recent developments on PAC-Dueling Bandits for different pairwise preference models have also been made: e.g. preference with stochastic triangle inequality and strong stochastic transitivity Yue and Joachims (2011), general utility based preference models Urvoy et al. (2013), Plackett-Luce model Szörényi et al. (2015), Mallows model Busa-Fekete et al. (2014a) etc. Few recent works in the PAC setting also focus on a different learning objective other than identifying the approximate single ‘best-arm’, e.g. recovering the set of top-most best arms Kalyanakrishnan et al. (2012); Busa-Fekete et al. (2013); Mohajer et al. (2017); Chen et al. (2017), or the true underlying ranking of the items Busa-Fekete et al. (2014b); Falahatgar et al. (2017). In a very recent work Chen et al. (2018) addressed the problem of recovering the set of top items of Plackett-Luce model in an active setting with subsetwise winner feedback model.
There is related literature is dynamic assortment optimization, where the goal is to offer a subset of items to the customers in order to maximize expected revenue. The demand of any item depends on the substitution behaviour of the customers, captured by a choice model. The problem has been studied under different choice models — e.g. multinomial logit Talluri and Van Ryzin (2004), Mallows and mixture of Mallows Désir et al. (2016a), Markov chain-based choice models Désir et al. (2016b), single transition model Nip et al. (2017), general discrete choice models Berbeglia and Joret (2016), etc. A related bandit setting has been studied as the MNL-Bandits problem Agrawal et al. (2016), although it takes items’ prices into account due to which the notion of the “best item” is different from ours (Condorcet winner), and the two settings are in general incomparable.
Organization: We give the necessary preliminaries in Section 2. The formal problem statement of Battling-Bandits with Plackett-Luce model is introduced in 3, for both for winner information (WI) and top ranking (TR) feedback model (Section 3.1) and the PAC learning objective (Section 3.2). In Section 4, we analyze the problem with winner information (WI) feedback: We first derive the fundamental limits of performance deriving the sample complexity lower bound of the proposed framework (Section 4.1), following which we present three algorithms along with their theoretical performance guarantees (Section 4.2). Section 4 analyses the same for the more general top ranking (TR) feedback model, we present the sample complexity lower bound in Section 5.1, along with the corresponding algorithms with matching lower bound guarantee in Section 5.2. The final conclusion and scope of future works are discussed in Section 6. All the proofs are moved to the appendix.
Notations. We denote the set . For any multi-set , let denote the cardinality of , with being its -th element, . is a permutation over items of . where for any permutation , denotes the element at the -th position in . denote an indicator variable that takes the value if the predicate is true, and otherwise.
2.1 Discrete Choice Models
A discrete choice model specifies the relative preferences of two or more discrete alternatives in a given set, such as selecting the most preferred item from a given set of item, rating top- movies of highest preferences, ranking candidates in an election etc Ben-Akiva and Lerman (1985).
RUM based Choice Models. One of the most popularly studied class of discrete choice models are Random Utility Models (RUM), which assumes an underlying ground truth of utility score , , and assigns a conditional distribution for scoring item . E.g. to model the winner information feedback, given any set , one first draws a random utility score , , and selects an item as the winner with probability of being the maximum score among all items in , i.e.
One widely used RUM is the Multinomial-Logit (MNL), or famously called Plackett-Luce model (PL), where s are independent Gumbel distributions with parameters Azari et al. (2012), such that , and . It can be shown in this case:
Clearly, any PL based choice model induces a total ordering on the arm set : If denotes the pairwise probability of item being preferred over item , then if and only if , or in other words if and then , Ramamohan et al. (2016).
Similarly, other alternative families of discrete choice models can be obtained by imposing different distributions over the utility scores , e.g. if are jointly normal with mean and covariance , the corresponding RUM based choice model reduces to Multinomial Probit (MNP), although unlike MNL, choice probabilities for MNP model do not have a closed formed solution Vojacek et al. (2010).
2.2 Choice models with Iia property
A choice model is said to have Independence of Irrelevant Attributes (IIA), if the ratio of probabilities of choosing any two items, say and from any choice set is independent of a third alternative present in Benson et al. (2016). More specifically, given are present in both and . One example of such choice model is the PL model: From (1), it is easy to note that if is a PL model with parameters , then for all , .
This property turns out very effective in estimating the unknown parameters the PL model by applying Rank-Breaking which refers to the idea of extracting a set of pairwise comparisons from the partial rankings of a given set of items and applying estimators on the obtained pairs, treating each comparison independently Khetan and Oh (2016). We exploit exactly this property of PL model in our proposed algorithms, (See Algorithms 1-3, for deriving their correctness and sample complexity guarantees).
Consider the PL choice model with parameters . Fix two distinct items . Suppose rounds of battle is played on sets such that , and denotes the winner of the -th battle. Define the random variables
then follows Bernoulli distribution .
Lemma 1 (IIA property of PL model).
3 Problem Setup
We consider the probability approximate correct (PAC) version of the sequential (online) decision making problem of finding the best item of a fixed set of items from subset-wise comparisons – a generalization of famously studied PAC Dueling Bandit problem Zoghi et al. (2014a); Szörényi et al. (2015) which learns the best item from pairwise comparisons. More formally, the learner is given a finite set of arms, . At each round , the learner builds a multi-set of distinct items, such that , upon which the environment reveals a stochastic feedback over according to an underlying Placket-Luce (PL) choice model with parameters a priori unknown to the learner. Without loss of generality we will henceforth assume , since PL choice probabilities are scale independent, as evident from (1), and also . We call this problem as Battling Bandits (BB) with Plackett-Luce (PL) model or in short BB-PL.
‘Best Item’. We define the ‘Best Item’ to be the one with highest score parameter: . Clearly, under the above defined PL model . Note that here we have , , which implies item to be the Condorcet Winner Ramamohan et al. (2016) of the PL model.
3.1 Feedback models: Winner Information (WI), Full Ranking (FR), Top- Ranking (TR)
We here define the different types of PL model based feedback model we considered here for the problem of BB-PL:
Winner Information of the selected subset (WI): At each round , the goal here is to select a set of upto (or exactly) many distinct items. Upon receiving any set , the environment returns the winning item with probability,
Full ranking selected subset of items (FR): The goal of the algorithm here is to select a set of upto (or exactly) many distinct items at each round . Upon receiving any set , the environment returns a full ranking with denoting the element at position in , such that
Further generalizing the above two feedback models, we define a feedback model which shows the ranking of only top- items, for any :
Top- ranking of items (TR): As before, the goal of the algorithm here is to select a set of upto (or exactly) many distinct items at each round . Upon receiving any set , the environment returns a (noisy) full ranking of only top most elements of , say , . Formally, the learner gets to see a ranking , with denoting the element at position in , such that Clearly one obtains back the full ranking (FR) feedback model using TR with , and winner information (WI) feedback using TR with .
3.2 Performance Objective: Correctness and Sample Complexity
We consider the following PAC learning objective which aims to find an almost (-approximate) ’Best Item’ with high probability of at least , for some . More formally:
Probably Approximate Winner Item (PAC-WI): The objective here is to identify an almost ‘Best Item’ which is a ‘close competitor’ of the ‘Best Item’ in terms of the pairwise preference . We define this as the PAC-WI objective where the goal is to find an item such that , i.e. with high probability , i.e. item could be beaten by the best item with a margin of at most an in a pairwise duel.
Note that the condition equivalently implies and vice versa.
4 Results for PAC-WI objective with Winner-Information (WI) feedback
We first approach the PAC-WI problem of finding the ‘probably approximate correct winner item’ (Sec. 3.2). We start with proving a lower bound guarantee for the sample complexity of finding the PAC-WI item as given in Thm. 4.1. Following this we propose two algorithms for the purpose along with their corresponding sample complexity bounds, which in fact match the above derived lower bound (upto logarithmic factors), proving the optimality of our proposed algorithms. (Sec. 4.2). Towards this end, we also analyzed the setting where the learner is allowed to play subsets of size upto , which of course is a slightly more flexible setting than the earlier case, and in this case we provide an algorithm with optimal sample complexity.
4.1 Lower Bound
Consider an instance of Battling Bandits with Plackett-Luce (BB-PL) parameters , defined as per WI model.
Definition 2 (-PAC Consistent Algorithms).
We define algorithm to be -PAC consistent if for all instances of bandit models , it with high probability it outputs an arm , such that is no more than worse than the best arm . More formally, , for any , denotes the pairwise preference of arm over arm , for any under any instance of BB-PL model .
We start by recalling Lemma of Kaufmann et al. (2016). We restate the result below for convenience: Consider an instance of multi-armed bandit problem with arms . At any round , and respectively denote the arm played and the corresponding observed reward. Define as the sigma algebra of the observation till round . Then if
Let and be two bandit models (set of reward distribution over ), such that denotes the reward distribution of any arm under bandit model , and for all arms , the distributions and are mutually absolutely continuous. Then for any almost-surely finite stopping time with respect to (),
Lemma 3 (Lemma , Kaufmann et al. (2016)).
where is the binary relative entropy, denotes the number of times arm is played in rounds, and denotes the probability of any event under bandit models and respectively.
We now use the above result to derive the sample complexity lower bound for the problem of BB-PL with winner information (WI) feedback model given as follows:
Given a fixed , for any -PAC consistent algorithm for BB-PL with feedback model WI, there exists a problem instance such that the expected sample complexity of algorithm is atleast for identifying the PAC-WI item, where denotes the number of rounds of battle executed by algorithm .
Theorem 4 (Lower bound on Sample Complexity with WI feedback).
Above guarantee shows that the sample complexity for identifying the PAC-WI item with WI feedback model is independent of the subset size – which implies that multiwise (-wise) comparisons over the pairwise case (where ) does not change difficulty of the underlying problem as the required sample complexity for both the PAC learning objective is same. This although might appear counter intuitive, as the learner here gets to see the winner information of a larger set of items, however could be justified information theoretically (as shown in the proof of Thm. 4.1) as the winner information of a -multiset provides a ‘ times less information per item’ than that in a pairwise setting .
4.2 Algorithms for feedback model WI
In this section, we propose the algorithms for PAC-WI objective with winner information (WI) feedback model. Our first algorithm Trace-the-Best is a simple algorithm that outputs the PAC-WI item with sample complexity of , which is in fact order wise optimal when , as follows from our derived lower bound guarantee (Thm. 4.1). For the regime of , we propose our next algorithm Divide-and-Battle which gives an almost optimal sample complexity guarantee of (Thm. 4.2) (matching the corresponding lower bound for any ) when the learner is restricted to play sets of size only. However, it performs sub-optimally (by a factor of ) when the learner is given the flexibility to play sets of any size , for which we propose our third algorithm Halving-Battle that gives an optimal sample complexity bound of (Thm. 4.3). We now present the detailed description of the algorithms along with their sample complexity guarantees.
Algorithm 1. Trace-the-Best
Our first algorithm Trace-the-Best is based on the simple idea of tracing the superiority of the empirical best item seen so far. More specifically, it maintains a ‘running winner’ at every iteration , and make it battle with a set of arbitrarily chosen items (Line ). After battling long enough (precisely for many rounds), if the empirical winner turns out to be more than favourable than the ‘running winner’ , i.e. empirically beats with a margin of more than , or equivalently , then replaces as the new ‘running winner’, otherwise retains its place as it is (Line ). The formal description of Trace-the-Bestis given in Algorithm 1.
Trace-the-Best (Algorithm 1) returns an PAC-WI item with sample complexity .
Theorem 5 (Trace-the-Best: Correctness and Sample Complexity Analysis).
The sample complexity of Trace-the-Best, is in fact order wise optimal when , as follows from our derived lower bound guarantee (Thm. 4.1).
Note that when , the sample complexity guarantee of Trace-the-Best is off by a factor of . We now propose another algorithm Divide-and-Battle, that outputs an PAC-WI item with improved sample complexity of .
Algorithm 2. Divide-and-Battle
The algorithm Divide-and-Battle is based on the idea of dividing the set of items into groups of size and play each group long enough such that the best item of each group stands out as the empirical winner, retain the empirical winners of each each group and recurse until it is left with only a single item, which is finally declared as the PAC-WI item. More specifically, it starts by diving the initial item set into mutually exclusive and collective exhaustive sets such that and . In case if the last set falls short of elements, it is stored in as the set of remaining elements to be analysed later.
Each set is then played for , and only the empirical winner (the one which wins the battle maximum number of times in rounds) of each group is retained in a set , rest are discarded (Line ). The algorithm next recurses the same procedure on until only a single best item sustains (Line ), which is declared to be the PAC-WI item. Algorithm 2 presents it in entire details.
Divide-and-Battle (Algorithm 2) returns an PAC-WI item with sample complexity .
Theorem 6 (Divide-and-Battle: Correctness and Sample Complexity Analysis).
Note that when , Divide-and-Battle runs with optimal guarantee. In fact the sample complexity of Divide-and-Battleis order wise optimal when , as follows from our derived lower bound guarantee (Thm. 4.1).
We next propose a slightly different WI feedback model which allows to play sets of any size and design an -PAC-WI algorithm for this model that runs with optimal guarantee.
4.3 A slightly different winner information (WI) feedback model for improved sample complexity guarantee
Feedback model WI-2 is formally defined as follows:
Winner information with variable sized sets (WI-2): Unlike WI model, at each round , here the learner is allowed to select a set of size upto . Upon receiving any set , the environment returns the environment returns the index of the winning item as such that,
The objective again is same as before: Recovering the -PAC-WI item. We now propose a Median-Elimination based approach Even-Dar et al. (2006) which achieves the PAC objective in samples, as shown in Theorem 4.3.
Algorithm 3. Halving-Battle
We call the algorithm Halving-Battle since it is based on the concept of dividing the set of items into two equal sized partitions and discarding the one which is not likely to contain the best item , recurse until the set becomes a singleton. More formally, it first divides the entire set of items into sets such that and . Each group is then played for many times, upon which it retains only the items that won more than the empirical median item (Line ) in a set , discards the rest with the hope that the winning count of atleast one -best item is always likely to stay above that of the median element . The algorithm henceforth proceeds recursively on set until is left with only a single element (Line ), which is then returned as the PAC-WI item. The formal description of the entire algorithm is given in Algorithm 3.
Halving-Battle (Algorithm 3) returns an PAC-WI item with sample complexity .
Theorem 7 (Halving-Battle: Correctness and Sample Complexity Analysis).
5 Results for PAC-WI objective with Top Ranking (TR) feedback
We first analyze the lower bound of BB-PL with Top Ranking (TR) feedback model (Theorem 5.1). Unlike the WI feedback, here we derived that with TR, the optimal sample complexity scales as (Thm. 5.1), implying a -times faster learning rate for the TR feedback model than WI (Thm. 4.1). This is also intuitive since here the ranking information of the set of items are revealed per rounds of battle, as opposed to just the (noisy) information of a single best item as in WI. Following this we also present two optimal algorithms with near optimal lower bound guarantees (upto logarithmic factors), as described in Section 5.2.
5.1 Lower Bound - Top- Ranking
Recall the definifion of Consistent Algorithms from Definition 2. We now derive the sample complexity lower bound for the problem of BB-PL with top- ranking (TR) feedback model given as follows:
Given a fixed and , for any -PAC consistent algorithm for BB-PL with top- ranking (TR) feedback model (), there exists a problem instance such that the expected sample complexity of algorithm is atleast for identifying the PAC-WI item, where denotes the number of rounds of battle executed by algorithm .
Theorem 8 (Sample Complexity Lower Bound for TR feedback model).
The sample complexity lower for PAC-WI objective for BB-PL with top- ranking (TR) feedback model is -times that of the WI model (Thm. 4.1). As intuitive, this implies that the difficulty (in terms of required sample complexity) of the PAC-WI problem for BB-PL reduces as the feedback model reveals more and more information about the relative preferences of the set of battling items.
Using Theorem 5.1 with , we can derive the following sample complexity lower bound for the problem of BB-PL with full ranking (FR) feedback model given as follows:
Given a fixed and , for any -PAC consistent algorithm for BB-PL with full ranking (FR) feedback model (), there exists a problem instance such that the expected sample complexity of algorithm is atleast for identifying the PAC-WI item, where denotes the number of rounds of battle executed by algorithm .
5.2 Algorithm for top- ranking feedback model
In this section, we propose algorithms for recovering PAC-WI item with TR feedback for BB-PL. We achieve this by generalizing our earlier two proposed algorithms (Sec. 4.2, Algorithm 1 and 2, for WI feedback model) to the setting of top- ranking (TR) feedback 111Our third algorithm Halving-Battle is not applicable to the setting of TR feedback as it allows the learner to play sets of sizes , whereas the TR feedback is defined only when the size of the multiset played is atleast . The lower bound analysis of Thm. 5.1 also does not apply if size of sets are less than ..
The main trick in utlizing the TR information lies in correctly utilizing the information gain that comes with top- ranking feedback. We use the idea of rank breaking for PL feedback model Soufiani et al. (2014) for the purpose, which essentially helps us to extract pairwise comparisons from multiwise feedback information. More precisely, given any set of size , if denotes a possible top- ranking of , the idea of rank breaking is to consider each item in to be beaten by its preceding items in the ranking , in a pairwise sense. Our proposed algorithms thus here maintains the empirical pairwise preferences of each pair of items by rank breaking the TR feedback into multiple pairwise preferences (see Algorithm 4 for details).
We will see next that precisely its due to this information gain with top- ranking feedback, it now suffices to observe the battles of a particular multiset only times to that used for WI feedback (as indicated in Line of Algorithm 5, and Line of Algorithm 6), which essentially leads to the times reduced sample complexity of the proposed algorithms in this case (see Thm. 5.2 and 5.2 for the sample complexity guarantees). The formal descriptions of the two algorithms, Trace-the-Best and Divide-and-Battle , generalized to the setting of TR feedback are given in Algorithm 5 and 6 respectively.
Before deriving the theoretical guarantees of the two algorithms, we state the following lemma owing to the top- ranking (TR) feedback information (see Sec. 3.1 for details), which is crucially used for obtaining the factor improvement in the sample complexity guarantees of our proposed algorithms.
Consider any subset , . Let is played for round of battle such that , denotes the top- ranking feedback information at round , and for each item , we define , i.e. the number of times item appeared in the top- ranking in rounds. Then .
Lemma 10 (Rank-Breaking Update).
With top- ranking (TR) feedback model, Trace-the-Best (Algorithm 5) returns an PAC-WI item with sample complexity .