TopRank: A practical algorithm for online stochastic ranking

TopRank: A practical algorithm for online stochastic ranking

Tor Lattimore
&Branislav Kveton
Adobe Research
&Shuai Li
The Chinese University of Hong Kong
&Csaba Szepesvári

Online learning to rank is a sequential decision-making problem where in each round the learning agent chooses a list of items and receives feedback in the form of clicks from the user. Many sample-efficient algorithms have been proposed for this problem that assume a specific click model connecting rankings and user behavior. We propose a generalized click model that encompasses many existing models, including the position-based and cascade models. Our generalization motivates a novel online learning algorithm based on topological sort, which we call . is (a) more natural than existing algorithms, (b) has stronger regret guarantees than existing algorithms with comparable generality, (c) has a more insightful proof that leaves the door open to many generalizations, (d) outperforms existing algorithms empirically.


TopRank: A practical algorithm for online stochastic ranking

  Tor Lattimore DeepMind Branislav Kveton Adobe Research Shuai Li The Chinese University of Hong Kong Csaba Szepesvári DeepMind


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Learning to rank is an important problem with numerous applications in web search and recommender systems [13]. Broadly speaking, the goal is to learn an ordered list of items from a larger collection of size that maximizes the satisfaction of the user, sometimes conditioned on a query. This problem has traditionally been studied in the offline setting, where the ranking policy is learned from manually-annotated relevance judgments. It has been observed that the feedback of users can be used to significantly improve existing ranking policies [2, 20]. This is the main motivation for online learning to rank, where the goal is to adaptively maximize the user satisfaction.

Numerous methods have been proposed for online learning to rank, both in the adversarial [15, 16] and stochastic settings. Our focus is on the stochastic setup where recent work has leveraged click models to mitigate the curse of dimensionality that arises from the combinatorial nature of the action-set. A click model is a model for how users click on items in rankings and are widely studied by the information retrieval community [3]. One of the more popular click models when learning to rank is the cascade model (CM), which assumes that the user scans the ranking from top to bottom, clicking on the first item they find attractive [8, 4, 9, 23, 12, 7]. Another model is the position-based model (PBM), where the probability that the user clicks on an item depends on its position and attractiveness, but not on the surrounding items [10].

The cascade and position-based models have relatively few parameters, which is both a blessing and a curse. On the positive side, a small model is easy to learn. More negatively, there is a danger that a simplistic model will have a large approximation error. In fact, it has been observed experimentally that no single existing click model captures the behavior of an entire population of users [6]. Zoghi et al. [22] recently showed that under reasonable assumptions a single online learning algorithm can learn the optimal list of items in a much larger class of click models that includes both the cascade and position-based models.

We build on the work of Zoghi et al. [22] and generalize it non-trivially in multiple directions. First, we propose a general model of user interaction where the problem of finding most attractive list can be posed as a sorting problem with noisy feedback. An interesting characteristic of our model is that the click probability does not factor into the examination probability of the position and the attractiveness of the item at that position. Second, we propose an online learning algorithm for finding the most attractive list, which we call . The key idea in the design of the algorithm is to maintain a partial order over the items that is refined as the algorithm observes more data. The new algorithm is simultaneously simpler, more principled and empirically outperforms the algorithm of Zoghi et al. [22]. We also provide an analysis of the cumulative regret of TopRank that is simple, insightful and strengthens the results by Zoghi et al. [22], despite the weaker assumptions.

2 Online learning to rank

We assume that and that the collection of items is . A permutation on finite set is an invertible function and the set of all permutations on is denoted by . The set of actions is the set of permutations , where for each the value should be interpreted as the identity of the item placed at the th position. Equivalently, item is placed at position . The user does not observe items in positions so the order of is not important and is included only for notational convenience. We adopt the convention throughout that and represent items while represents a position.

The online ranking problem proceeds over rounds. In each round the learner chooses an action based on its observations so far and observes binary random variables where if the user clicked on item . We assume a stochastic model where the probability that the user clicks on position in round only depends on and is given by

with an unknown function. Another way of writing this is that the conditional probability that the user clicks on item in round is .

The performance of the learner is measured by the expected cumulative regret, which is the deficit suffered by the learner relative to the omniscient strategy that knows the optimal ranking in advance.

Remark 1.

We do not assume that are independent or that the user can only click on one item.

3 Modeling assumptions

In previous work on online learning to rank it was assumed that factors into where is the attractiveness function and is the probability that the user examines position given ranking . Further restrictions are made on the examination function . For example, in the document-based model it is assumed that . In this work we depart from this standard by making assumptions directly on . The assumptions are sufficiently relaxed that the model subsumes the document-based, position-based and cascade models, as well as the factored model studied by Zoghi et al. [21]. See the appendix for a proof of this. Our first assumption uncontroversially states that the user does not click on items they cannot see.

Assumption 1.

for all .

Although we do not assume an explicit factorization of the click probability into attractiveness and examination functions, we do assume there exists an unknown attractiveness function that satisfies the following assumptions. In all classical click models the optimal ranking is to sort the items in order of decreasing attractiveness. Rather than deriving this from other assumptions, we will simply assume that satisfies this criteria. We call action optimal if for all . The optimal action need not be unique if is not injective, but the sequence is the same for all optimal actions.

Assumption 2.

Let be an optimal action. Then .

The next assumption asserts that if is an action and is more attractive than , then exchanging the positions of and can only decrease the likelihood of clicking on the item in slot . The figure on the right illustrates the two cases. The probability of clicking on the second position is larger in than in . On the other hand, the probability of clicking on the fourth position is larger in than in . The assumption is actually slightly stronger than this because it also specifies a lower bound on the amount by which one probability is larger than another in terms of the attractiveness function.

Assumption 3.

Let and be items with and let be the permutation that exchanges and and leaves other items unchanged. Then for any action ,

Our final assumption is that for any action with the probability of clicking on the th position is at least as big as the probability of clicking on the th position for the optimal action.

Assumption 4.

For any action and optimal action with it holds that .

4 Algorithm

1:   and
2:  for  do
4:     while  do
7:     Choose uniformly at random from
8:     Observe click indicators for all
9:     for all  do
11:         and
Algorithm 1

Before we present our algorithm, we introduce some basic notation that describes it. Given a relation and , let . When is nonempty and does not have cycles, then is nonempty. Let be a partition of so that and for any . We refer to each subset in the partition, for , as a block. Let be the set of actions where the items in are placed at the first positions, the items in are placed at the next positions, and so on. Specifically,

Our algorithm is presented in Algorithm 1. We call it , because it maintains a topological order of items in each round. The order is represented by relation , where . In each round, computes a partition of by iteratively peeling off minimum items according to . Then it randomizes items in each block of the partition and maintains statistics on the relative number of clicks between pairs of items in the same block. A pair of items is added to the relation once item receives sufficiently more clicks than item during rounds where the items are in the same block. The reader should interpret as meaning that collected enough evidence up to round to conclude that .

Remark 2.

The astute reader will notice that the algorithm is not well defined if contains cycles. The analysis works by proving that this occurs with low probability and the behavior of the algorithm may be defined arbitrarily whenever a cycle is encountered. creftype 1 means that items in position are never clicked. As a consequence, the algorithm never needs to actually compute the blocks where because items in these blocks are never shown to the user.

Shortly we give an illustration of the algorithm, but first introduce the notation to be used in the analysis. Let be the slots of the ranking where items in are placed,

Furthermore, let be the block with item , so that . Let be the number of blocks in the partition in round .







Suppose and and in round the relation is . This indicates the algorithm has collected enough data to believe that item is less attractive than item and that item is less attractive than items and . The relation is depicted in the figure below where an arrow from to means that . In round the first three positions in the ranking will contain items from , but with random order. The fourth position will be item and item is not shown to the user. Note that here and and .

Remark 3.

TopRank is not an elimination algorithm. In the scenario described above, item is not shown to the user, but it could happen that later and are added to the relation and then TopRank will start randomizing between items and for the fourth position.

5 Regret analysis

Theorem 1.

Let function satisfy Assumptions 14 and . Let and . Then the -step regret of is bounded from above as

Furthermore,  .

By choosing the theorem shows that the expected regret is at most

The algorithm does not make use of any assumed ordering on , so the assumption is only used to allow for a simple expression for the regret. The core idea of the proof is to show that (a) if the algorithm is suffering regret as a consequence of misplacing an item, then it is gaining information about the relation of the items so that will gain elements and (b) once is sufficiently rich the algorithm is playing optimally. Let and and . For each let to be the failure event that there exists and such that and

Lemma 1.

Let and satisfy and . On the event that and and , the following hold almost surely:


For the remainder of the proof we focus on the event that and and . We also discard the measure zero subset of this event where . From now on we omit the ‘almost surely’ qualification on conditional expectations. Under these circumstances the definition of conditional expectation shows that


where in the second equality we added and subtracted . By the design of , the items in are placed into slots uniformly at random. Let be the permutation that exchanges the positions of items and . Then using Assumption 3,

where the second equality follows from the fact that and the definition of the algorithm ensuring that . The last equality follows from the fact that is a bijection. Using this and continuing the calculation in Eq. 1 shows that

The second part follows from the first since . ∎

The next lemma shows that the failure event occurs with low probability.

Lemma 2.

It holds that .


The proof follows immediately from Lemma 1, the definition of , the union bound over all pairs of actions, and a modification of the Azuma-Hoeffding inequality in Lemma 6. ∎

Lemma 3.

On the event it holds that for all .


Let so that . On the event either or

When and are in different blocks in round , then by definition. On the other hand, when and are in the same block, almost surely by Lemma 1. Based on these observations,

which by the design of implies that . ∎

Lemma 4.

Let be the most attractive item in . Then on event , it holds that for all .


Let . Then holds trivially for any and . Now consider two cases. Suppose that . Then it must be true that and our claim holds. On other hand, suppose that for some . Then by Lemma 3 and the design of the partition, there must exist a sequence of items in blocks such that . From the definition of , . This concludes our proof. ∎

Lemma 5.

On the event and for all it holds that .


The result is trivial when . Assume from now on that . By the definition of the algorithm arms and are not in the same block once grows too large relative to , which means that

On the event and part (a) of Lemma 1 it also follows that

Combining the previous two displays shows that


Using the fact that and rearranging the terms in the previous display shows that

The result is completed by substituting this into Eq. 2. ∎

Proof of Theorem 1.

The first step in the proof is an upper bound on the expected number of clicks in the optimal list . Fix time , block , and recall that is the most attractive item in . Let be the position of item and be the permutation that exchanges items and . By Lemma 4, ; and then from Assumptions 3 and 4, we have that . Based on this result, the expected number of clicks on is bounded from below by those on items in ,

where we also used the fact that randomizes within each block to guarantee that for any . Using this and the design of ,

Therefore, under event , the conditional expected regret in round is bounded by


The last inequality follows by noting that . To see this use part (a) of Lemma 1 to show that for and Lemma 4 to show that when , then neither nor are not shown to the user in round so that . Substituting the bound in Eq. 3 into the regret leads to


where we used the fact that the maximum number of clicks over rounds is . The proof of the first part is completed by using Lemma 2 to bound the first term and Lemma 5 to bound the second. The problem independent bound follows from Eq. 4 and by stopping early in the proof of Lemma 5. The details are given in the appendix. ∎

Lemma 6.

Let be a filtration and be a sequence of -adapted random variables with and . Then with and ,

See Appendix B for the proof.

We also provide a minimax lower bound, the proof of which is deferred to Appendix D.

Theorem 2.

Suppose that with an integer and and and . Then for any algorithm there exists a ranking problem such that .

The proof of this result only makes used of ranking problems from the document-based model (or ‘slate bandit’). This also corresponds to a lower bound for -sets in online linear optimization with semibandit feedback. Despite the simple setup and overlapping literatures, we do not know of a place where this has been written before.

6 Experiments

We experiment with the Yandex dataset [19], a dataset of million search queries. In each query, the user is shown documents at positions to and the search engine records the clicks of the user. We select frequent search queries from this dataset, and learn their CMs and PBMs using PyClick [3]. The parameters of the models are learned by maximizing the likelihood of observed clicks. Our goal is to rerank most attractive items with the objective of maximizing the expected number of clicks at the first positions. This is the same experimental setup as in Zoghi et al. [22]. This is a realistic scenario where the learning agent can only rerank highly attractive items that are suggested by some production ranker [20]. is compared to [22] and [8]. We used the implementation of by Zoghi et al. [22]. We do not compare to ranked bandits [15], because they have already been shown to perform poorly in stochastic click models, for instance by Zoghi et al. [22] and Katariya et al. [7]. The parameter in is set as , as suggested in Theorem 1.

Figure 1: The -step regret of (red), (blue), and (gray) in three problems. The results are averaged over runs. The error bars are the standard errors of our regret estimates.
Figure 2: The -step regret of (red), (blue), and (gray) in two click models. The results are averaged over queries and runs per query. The error bars are the standard errors of our regret estimates.

Fig. 1 illustrates the general trend on specific queries. In the cascade model outperforms . This should not come as a surprise because heavily exploits the knowledge of the model. Despite being a more general algorithm, TopRank consistently outperforms BatchRank in the cascade model. In the position-based model, learns very good policies in about two thirds of queries, but suffers linear regret for the rest. In many of these queries, outperforms in as few as one million steps. In the position-based model typically outperforms .

The average regret over all queries is reported in Fig. 2. We observe similar trends to those in Fig. 1. In the cascade model, the regret of is about three times lower than that of , which is about three times lower than that of . In the position-based model, the regret of is higher than that of after million steps. The regret of is about lower than that of . In summary, we observe that improves over in both the cascade and position-based models. The worse performance of relative to in the cascade model is offset by its robustness to multiple click models.

7 Conclusions

We introduced a new click model for online ranking that subsumes previous models. Despite the increased generality the new algorithm enjoys stronger regret guarantees, an easier and more insightful proof and improved empirical performance. We hope the simplifications can inspire even more interest in online ranking. We also proved a lower bound for combinatorial linear semibandits with -sets that improves on the bound by Uchiya et al. [18]. We do not currently have matching upper and lower bounds. The key to understanding minimax lower bounds is to identify what makes a problem hard. In many bandit models there is limited flexibility, but our assumptions are so weak that the space of all satisfying Assumptions 14 is quite large and we do not yet know what is the hardest case. This difficulty is perhaps even greater if the objective is to prove instance-dependent or asymptotic bounds where the results usually depend on solving a regret/information optimization problem [11]. Ranking becomes increasingly difficult as the number of items grows. In most cases where is large, however, one would expect the items to be structured and this should then be exploited. This has been done for the cascade model by assuming a linear structure [23, 12]. Investigating this possibility, but with more relaxed assumptions seems like an interesting future direction.


  • Abbasi-yadkori et al. [2011] Y. Abbasi-yadkori, D. Pál, and Cs. Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, NIPS, pages 2312–2320. Curran Associates, Inc., 2011.
  • Agichtein et al. [2006] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference, pages 19–26, 2006.
  • Chuklin et al. [2015] A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Morgan & Claypool Publishers, 2015.
  • Combes et al. [2015] R. Combes, S. Magureanu, A. Proutiere, and C. Laroche. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2015.
  • Freedman [1975] D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100–118, 02 1975.
  • Grotov et al. [2015] A. Grotov, A. Chuklin, I. Markov, L. Stout, F. Xumara, and M. de Rijke. A comparative study of click models for web search. In Proceedings of the 6th International Conference of the CLEF Association, 2015.
  • Katariya et al. [2016] S. Katariya, B. Kveton, Cs. Szepesvári, and Z. Wen. DCM bandits: Learning to rank with multiple clicks. In Proceedings of the 33rd International Conference on Machine Learning, pages 1215–1224, 2016.
  • Kveton et al. [2015a] B. Kveton, Cs. Szepesvári, Z. Wen, and A. Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, 2015a.
  • Kveton et al. [2015b] B. Kveton, Z. Wen, A. Ashkan, and Cs. Szepesvári. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems 28, pages 1450–1458, 2015b.
  • Lagree et al. [2016] P. Lagree, C. Vernade, and O. Cappe. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems 29, pages 1597–1605, 2016.
  • Lattimore and Szepesvári [2017] T. Lattimore and Cs. Szepesvári. The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 728–737, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
  • Li et al. [2016] S. Li, B. Wang, S. Zhang, and W. Chen. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1245–1253, 2016.
  • Liu [2011] T. Liu. Learning to Rank for Information Retrieval. Springer, 2011.
  • Peña et al. [2008] V. H. Peña, T.L. Lai, and Q. Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.
  • Radlinski et al. [2008] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages 784–791, 2008.
  • Slivkins et al. [2013] A. Slivkins, F. Radlinski, and S. Gollapudi. Ranked bandits in metric spaces: Learning diverse rankings over large document collections. Journal of Machine Learning Research, 14(1):399–436, 2013.
  • Tsybakov [2008] A. B. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
  • Uchiya et al. [2010] Taishi Uchiya, Atsuyoshi Nakamura, and Mineichi Kudo. Algorithms for adversarial bandit problems with multiple plays. In Proceedings of the 21st International Conference on Algorithmic Learning Theory, ALT’10, pages 375–389, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3-642-16107-3.
  • [19] Yandex. Yandex personalized web search challenge., 2013.
  • Zoghi et al. [2016] M. Zoghi, T. Tunys, L. Li, D. Jose, J. Chen, C. Ming Chin, and M. de Rijke. Click-based hot fixes for underperforming torso queries. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 195–204, 2016.
  • Zoghi et al. [2017a] M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, Cs. Szepesvári, and Z. Wen. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of PMLR, pages 4199–4208, 2017a.
  • Zoghi et al. [2017b] M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, Cs. Szepesvári, and Z. Wen. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning, pages 4199–4208, 2017b.
  • Zong et al. [2016] S. Zong, H. Ni, K. Sung, N. Rosemary Ke, Z. Wen, and B. Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016.

Appendix A Proof of weaker assumptions

Here we show that Assumptions 14 are weaker than those made by Zoghi et al. [21], who assumed that the click probability factors into where is the attractiveness function and is the examination probability. They assumed that:

  1. for all actions and positions .

  2. for all positions and actions .

  3. for all actions and positions .

  4. only depends on the unordered set .

  5. When and , then where exchanges items and .

  6. is maximized by .

We now show that these assumptions are stronger than Assumptions 14. Let and be attraction and examination functions satisfying the six conditions above. We need to find choices of and such that for all actions and positions and where and satisfy Assumptions 14. First define where is any action. By item 4 this does not depend on the choice of , which also means that for any pair of items and . Then let . creftype 1 is satisfied trivially since item 1 implies that whenever . creftype 2 is also satisfied trivially by item 6. For creftype 3 we consider two cases. Let and be items with and suppose that is an action with and be the permutation that exchanges and . Then

Now suppose that , then the claim follows from item 5. creftype 4 follows easily from item 2 by noting that

Note that we did not use item 3 at all and one can easily construct a function that satisfies Assumptions 14 while not satisfying items 1–6 above. Zoghi et al. [21] showed that their assumptions were weaker than the position-based and cascade models, which therefore also holds for our assumptions.

Appendix B Proof of Lemma 6

We first show a bound on the right tail of . A symmetric argument suffices for the left tail. Let and . Define filtration by . Using the fact that we have for any that

Therefore is a supermartingale for any . The next step is to use the method of mixtures [14] with a uniform distribution on . Let . Then Markov’s inequality shows that for any -measurable stopping time with almost surely,