Abstract
We investigate the PlackettLuce (PL) model based listwise learningtorank (LTR) on data with partitioned preference, where a set of items are sliced into ordered and disjoint partitions, but the ranking of items within a partition is unknown. Given items with partitions, calculating the likelihood of data with partitioned preference under the PL model has a time complexity of , where is the maximum size of the top partitions. This computational challenge restrains most existing PLbased listwise LTR methods to a special case of partitioned preference, top ranking, where the exact order of the top items is known. In this paper, we exploit a random utility model formulation of the PL model, and propose an efficient numerical integration approach for calculating the likelihood and its gradients with a time complexity . We demonstrate that the proposed method outperforms wellknown LTR baselines and remains scalable through both simulation experiments and applications to realworld eXtreme MultiLabel classification tasks.
Ma, Yi, Tang, Zhao, Hong, Chi, and Mei
1 Introduction
Ranking is a core problem in many information retrieval systems, such as recommender systems, search engines, and online advertising. The industryscale ranking systems are typically applied to millions of items in a personalized way for billions of users. To meet the need of scalability and to exploit a huge amount of user feedback data, learningtorank (LTR) has been the most popular paradigm for building the ranking system. Existing LTR approaches can be categorized into three groups: pointwise (Gey, 1994), pairwise (Burges et al., 2005), and listwise (Cao et al., 2007; Taylor et al., 2008) methods. The pointwise and pairwise LTR methods convert the ranking problem into regression or classification tasks on single or pairs of items respectively. As the realworld ranking data are often presented as (partially) ordered lists of items, the listwise LTR methods instead directly optimize objective functions defined on ranked lists of items, in order to preserve more information of the interrelations among items in a list.
One of the most wellknown group of listwise LTR methods (Cao et al., 2007; Xia et al., 2008) are based on the PlackettLuce (PL) model (Plackett, 1975; Luce, 1959). These methods define their objective functions as the likelihood of the observed ranked list under a PL model. Despite being useful in many cases, a major limitation of such methods comes from the fact that, evaluating the likelihood of general partial rankings under a PL model is usually intractable for a large number of items. This computational challenge restricts the application of existing PLbased listwise LTR methods to limited special cases of partial rankings, such as top ranking, where the exact order of the top items is known.
In this paper, we extend PLbased listwise LTR to a more general class of partial rankings, the partitioned preference (Lebanon and Mao, 2008; Lu and Boutilier, 2014), defined as following: given items, partitioned preference slices the items into disjoint partitions, where order of items within each partition are unknown while the partitions have a global order. Partitioned preference not only is a strictly more general class of partial rankings compared to top ranking, but also better characterizes realworld ranking data. For example, in a page of recommended items, we usually only observe binary clicks or a small number of ordinal ratings (e.g., 5star rating) as user feedback but do not know the exact order among the clicked items or items with the same rating scale. However, computing the exact likelihood of data with partitioned preference under the PL model requires an intractable time complexity
To overcome this computational challenge, we propose a novel numerical integration method. The key insight of our method is that, by exploiting a random utility model formulation of the PL model with Gumbel distribution (Yellott Jr, 1977; McFadden, 1978), we find that both the loglikelihood and its gradients can be rewritten as the summation of multiple onedimensional integrals, which can be efficiently approximated by numerical integration. We formally demonstrate that, as the number of items grows, the overall time complexity of the proposed numerical approach is in order to maintain a constant level of numerical error , which is much more efficient than the naive approach with the complexity . We also discuss how our proposed approach might improve the generalized rankbreaking methods (Khetan and Oh, 2018).
We evaluate the effectiveness of the proposed method through both simulation and experiments with realworld datasets. For simulation, we show that the proposed method can better recover the groundtruth parameters of a PL model compared to baseline methods, including a method (Hino et al., 2010) that approximates the PL likelihood with a tractable lower bound. We also test the proposed method on realworld extreme multilabel (XML) classification datasets (Bhatia et al., 2016). We show that the proposed method can efficiently train neural network ranking models for items at millionlevel, and outperforms other popular listwise and pairwise LTR baselines.
2 Related Work
2.1 LearningtoRank
Our work falls in the area of LTR (Liu, 2009). The goal of LTR is to build machine learning models to rank a list of items for a given context (e.g., a user) based on the feature representation of the items and the context. The choice of the ranking objective plays an important role in learning the ranking models. Existing ranking objectives can be generally categorized in to three groups: pointwise (Gey, 1994), pairwise (Joachims, 2002; Burges et al., 2005), and listwise (Cao et al., 2007; Xia et al., 2008; Taylor et al., 2008; Christakopoulou and Banerjee, 2015; Ai et al., 2018; Wang et al., 2018). The PL model has been widely used in listwise LTR methods (Cao et al., 2007; Xia et al., 2008; Schäfer, 2018). However, to our best knowledge, existing PLbased listwise methods cannot be applied to partitioned preference data, due to the aforementioned computational complexity of evaluating the likelihood. Our work tackles the computational challenge with a novel numerical approach. Beyond the computational challenge, another major limitation of the PLbased listwise methods is that, the underlying independence of irrelevant alternatives (IIA) assumption of the PL model, is sometimes overly strong in realworld applications (Seshadri and Ugander, 2019; Wilhelm et al., 2018; Christakopoulou and Banerjee, 2015). But more detailed discussions on the IIA assumption is out of the scope of this paper.
XML classification as a ranking problem. Given features of each sample, the XML classification task requires a machine learning model to tag the most relevant subset of an extremely large label set. The XML classification tasks were initially established as a reformulation of ranking problems (Agrawal et al., 2013; Prabhu and Varma, 2014), and the performance of which is primarily evaluated by various ranking metrics such as Precision@k or nDCG@k (Bhatia et al., 2016). The XML classification tasks are special cases of ranking with partitioned preference, where the class labels are considered items, and for each document its relevant labels form one partition and irrelevant labels form a second, lowerranked partition. In this work, we apply the proposed method for ranking with partitioned preference to the XML classification datasets, and we find it achieving the stateoftheart performance on datasets where the first partition, i.e., the set of relevant labels, is relatively large.
2.2 Rank Aggregation
Rank aggregation aims to integrate multiple partial or full rankings into one ranking. The multiple rankings are considered as noisy samples from underlying ground truth ranking preferences. Rank aggregation is a broader research area that includes LTR as a subproblem. Statistical modeling is a popular approach for rank aggregation. Various statistical models (Mallows, 1957; Luce, 1959; Plackett, 1975) are proposed to model the rank generation process in the real world. Among them, the PL model (Luce, 1959; Plackett, 1975) is one of the most widelyused. Evaluating the likelihood of the PL model on various types of partial rankings has been widely studied (Hunter et al., 2004; Maystre and Grossglauser, 2015; Liu et al., 2019; Yıldız et al., 2020; Zhao and Xia, 2020). However, we note that many of these studies (Maystre and Grossglauser, 2015; Liu et al., 2019) are designed for ranking data without any features, and thus are not suitable for LTR tasks. The ones (Yıldız et al., 2020; Zhao and Xia, 2020) that can leverage sample features are not directly applicable to largescale partitionedpreference data. It is worth noting that, our proposed method shares the motivation of approximating the intractable PL likelihood using sampling methods (Liu et al., 2019), as numerical integration is a special case of sampling. However, the integral form of the PL likelihood inspired by the connection between the PL model and Gumbel distribution makes our method more efficient than a general sampling method.
3 Approach
3.1 Problem Formulation: Learning PL Model from Partitioned Preference
Suppose there are different items in total and we denote the set by . The PL model and the partitioned preference are formally defined below.
Definition 1 (PlackettLuce Model (Plackett, 1975; Luce, 1959)).
Given the utility scores of the items, , the probability of observing a certain ordered list of these items, , is defined as
(1) 
Definition 2 (Partitioned Preference (Lebanon and Mao, 2008; Lu and Boutilier, 2014)).
A group of disjoint partitions of , , is called a partitioned preference if (a) , where indicates that any item in th partition has a higher rank than items in the th partition; (b) the rank of items within the same partition is unknown.
Clearly, and for any . We also denote the size of each partition as , . Under a PL model parameterized by as defined in Eq. (1), the probability of observing such a partitioned preference is given by
(2) 
where is a function that maps a partial ranking to the set of all possible permutations of that are consistent with the given partial ranking.
Typically, the utility scores are themselves parameterized functions, e.g. neural networks, of the feature representation of the items and the context of ranking (e.g., a particular user). Suppose the features for item are denoted as and the itemindependent context features are denoted as . Then the utility score of for a given context (e.g., user) can be written as , where represents the neural network parameters
However, evaluating the likelihood function naively by Eq. (2) requires a time complexity of , which is made clear in the form of Eq. (3) given by Lemma 1. This implies that the likelihood becomes intractable as long as one partition is mildly large (e.g., ).
Lemma 1.
Let be a function that maps a set of items to the set of all possible permutations of these items. Then Eq. (2) can be rewritten as
(3) 
which happens to be
where is the set of items that do not belong to the top partitions, i.e. .
3.2 Efficient Evaluation of the Likelihood and Gradients
Next, we present how to efficiently the likelihood Eq. (5) and its gradients. We derive a numerical integral approach based on the random utility model formulation of the PL model with Gumbel distribution (Yellott Jr, 1977; McFadden, 1978).
The random utility model formulation of PL.
A random utility model assumes that, for each context, the utility of preferring the item is a random variable . In particular, is the aforementioned parameterized utility function, and is a random noise term that contains all the unobserved factors affecting the individual’s utility. When each independently follows a standard Gumbel distribution (or equivalently, each independently follows , a Gumbel distribution with the location parameter set to ), we have the following fact (Yellott Jr, 1977), for any permutation of items ,
It implies that, after sampling independent Gumbel variables, the ordered indices returned by sorting the Gumbel variables follow the PL model. Following this result, we have developed Proposition 1 that characterizes the preference for a given context between any two disjoint partitions.
Proposition 1.
Given a PL model parameterized by , for any and , the probability of is given by
(4) 
where denotes a random variable following and .
Kool et al. (2020) have shown a weaker version of Proposition 1 where . Here we extend it to the case where , whose proof is given in Appendix A.2. In particular, the second equality in Eq. (1) provides an efficient way of computing the likelihood of the preference for a given context between two disjoint partitions.
Compute the likelihood and gradients by numerical integral.
Following the random utility model formulation of PL, we can compute the loglikelihood of the preference for a given context among partitions and its gradient efficiently by onedimensional numerical integration. For the likelihood function, it directly follows from Lemma 1 and Proposition 1 that
(5) 
where . Therefore, we can compute the loglikelihood function through onedimensional numerical integration. From Lemma 2, we can see that the gradients of the loglikelihood can also be obtained by numerical integrations using Eq. (6).
Lemma 2.
The gradients of the loglikelihood w.r.t. can be written as
(6) 
Analysis of the computation cost. Suppose the number of numerical integration intervals is set as . Evaluating the likelihood (5) requires a time complexity and evaluating the gradients (6) requires a time complexity . Next we quantify the minimum required by certain desired numerical error. The numerical error consists of the discretization error, which is caused by the discretization of the integral, and the roundoff error, which is caused by the finite precision of the computer. While there is nonnegligible roundoff error if we calculate the numerical integration directly using formula (5) and (6), this can be largely alleviated using common numerical tricks (see Appendix A.5). So we mainly focus on the analysis of the discretization error, which is given in Theorem 1.
Theorem 1.
Assume that there exist some constants , and and , such that, for any ,
(7) 
Then for any and ,

we need at most intervals for the th integral in the likelihood (5) to have a discretization error smaller than ;

if there exists some constant and , such that, for each , defining and , the following conditions hold,
(8) then we need at most intervals for the th integral in gradients (6) to have a discretization error smaller than .
We note that the assumptions (7) and (8) in Theorem 1 respectively require the largest and the smallest neural network output logit, plus a constant , to be not too far away from zero
Theorem 1 implies that the computation cost of the proposed method is overall to maintain a discretization error at most for both the likelihood and its gradients, which is clearly much more efficient than the naive approach with factorial terms. We further highlight several computational advantages of the proposed numerical approach. 1) The whole computation is highly parallelizable: the computation of the integrands and the product over within each integrand can all be done in parallel. 2) The number of intervals can be adjusted to control the tradeoff between computation cost and accuracy. 3) In largescale ranking data, we often have , thus will be negligible for large , resulting a linear complexity w.r.t. .
3.3 Improving the Computational Efficiency of Generalized RankBreaking Methods
We discuss the potential application of the proposed numerical approach to generalized rankbreaking methods (Khetan and Oh, 2018) as a final remark of this section.
While partitioned preference is a general class of partial rankings for a set of items, it is not able to represent the class of all possible partial rankings, which is also known as arbitrary pairwise preference (Lu and Boutilier, 2014; Liu et al., 2019). It is challenging to learn arbitrary pairwise preference using a listwise method. To the best of our knowledge, there is no scalable listwise method that is able to learn industryscale PLbased ranking models. Pointwise and pairwise methods are able to deal with any types of partial rankings at the expense of lower statistical efficiency. Generalized rankbreaking methods (Khetan and Oh, 2018) are recently proposed to better tradeoff the computational and statistical efficiency in LTR.
An arbitrary pairwise preference of items can be represented as a directed acyclic graph (DAG) of nodes, where each node is an item and each directed edge represents the preference over a pair of items. The generalized rankbreaking methods first apply a graph algorithm to extract a maximal ordered partition of , : a group of disjoint partitions of with largest possible M, such that the item preference in the M partitions is consistent with that of the DAG. One difference between data with partitioned preference and data with arbitrary pairwise preference is that the maximal ordered partition is not unique for the latter, as the maximal ordered partition does not preserve all relationships in the DAG. With the extracted partitions, we can maximize the likelihood of these partitions under a PL model to learn the model parameters. Khetan and Oh (2018) propose to calculate the likelihood as shown in Eq. (2), which has a time complexity involves factorials of the partition sizes. To overcome this challenge for learning largescale data, existing methods need to approximate the likelihood by dropping the top partitions with large sizes. In contrast, the proposed numerical approach in this paper can be directly applied to the likelihood evaluation step of generalized rankbreaking methods to significantly improve the computational efficiency.
4 Experiments
In this section, we report empirical results on both synthetic and realworld datasets. We compare the proposed method, denoted as PLPartition, with two groups of baseline methods that can be applied to largescale partitioned preference data.
First, we consider two softmaxbased listwise methods: PLLB (Hino et al., 2010) and AttRank (Ai et al., 2018). PLLB optimizes a lower bound of the likelihood of partitioned preference under the PL model. In particular, for each , the term in Eq. (5) is replaced by its lower bound . AttRank optimizes the crossentropy between the softmax outputs and an empirical probability based on the item relevance given by training labels.
For the second group of baselines, we consider two popular pairwise methods: RankNet (Burges et al., 2005) and RankSVM (Joachims, 2002). RankNet optimizes a logistic surrogate loss on each pair of items that can be compared. RankSVM optimizes a hinge surrogate loss on each pair of items that can be compared.
4.1 Simulation
We conduct experiments on synthetic data which is generated from a PL model. The goal of this simulation study is twofold: 1) we investigate how accurate the proposed method can recover the ground truth utility scores of a PL model; 2) we empirically compare the computation costs of different methods over data with different scales.
Synthetic data generated from a PL model. We first generate a categorical probability simplex as the ground truth utility scores for items following and . Then we draw samples of full ranking from a PL model parameterized by . Finally, we randomly split the full rankings into partitions and remove the order within each partition to get the partitioned preference. We note that the synthetic data is stateless (i.e., there is no feature for each sample), as this simulation focuses on the estimation of the PL utility scores rather than the relationship between and sample features. Further, in largescale realworld applications, we often can only observe the order of limited items per sample. For example, a user can only consume a limited number of recommended items. To respect this pattern, we restrict the total number of items in top partitions to be at most 500 regardless of . We fix and generate data with varying and random seeds.
Experiment setups. As the synthetic data is stateless, we only need to train free parameters, with each parameter corresponding to an item. We use the proposed method and baseline methods to respectively train the parameters using stochastic gradient descent with earlystopping. We use AdaGrad optimizer with initial learning rate of 0.1 for all methods. We report the mean squared error (MSE) between the softmax of these free parameters and the PL utility scores as a measure of how accurately different methods recover .
MSE of the estimated PL utility scores. Figure 1 shows the MSE of the estimated PL utility scores by different methods over various and . To better compare results across data with different numbers of items , we further include an oracle reference method PLTopK, which has access to the full ranking of the items in the top partitions and optimizes the corresponding PL likelihood. First, as expected, the proposed PLPartition method best recovers the ground truth utility scores in terms of MSE on all data configurations, as it numerically approximates the PL likelihood of the partitioned preference data. However, it is worth noting that PLLB, while also trying to approximate the PL likelihood with a lower bound, performs even worse than pairwise methods when is small, which indicates the existing lower bound method is not sufficient to take the full advantage of the PL model.
Computation cost. Figure 2 shows the time and memory costs of different methods over various . The results are obtained using a single Nvidia V100 GPU. We report the total time of running 1000 steps of stochastic gradient descent with batch size 20. We also report the peak CUDA memory. The costs of both time and memory for the pairwise methods grow faster than those for the listwise methods as increases. The two listwise baseline methods, AttRank and PLLB, have similar memory cost. PLLB has a larger running time due to the calculation of multiple partition functions. The proposed PLPartition method has an overhead cost due to the numerical integration. However, we observe that this overhead cost is amortized as increases. When , the computational cost of PLPartition becomes close to that of PLLB. Overall, this benchmark empirically demonstrates that the proposed method is scalable for largescale applications
4.2 RealWorld Datasets
Experiment setups. We also verify the effectiveness of the proposed method on 4 realworld XML datasets (Bhatia et al., 2016): Delicious1K (D1K), Eurlex4K (E4K), Wiki1031K (W31K), and Delicious200K (D200K). The trailing number in the name of each dataset indicates the number of classes in the dataset
We first compare the proposed PLPartition method with the 4 baseline methods. Note that PLLB and AttRank collapse into exactly the same method on XML classification as the number of partitions is 2. So we only report the results of PLLB. For each method, we train a neural network model with the same architecture, 2layer fully connected network with ReLU activations and hidden size of 256. We train the neural networks with stochastic gradient descent using the ADAM optimizer. The batch size is fixed to 128. We use the official traintest split of each dataset and further split the training set into training and validation (9:1 for D200K and 3:1 for other datasets). We tune the learning rate by linesearch from and apply earlystopping in training, based on validation sets.
We also compare PLPartition with two stateoftheart embeddingbased XML classifiers, SLEEC (Bhatia et al., 2015) and LEML (Yu et al., 2014), which are listed on the XML repository leaderboard (Bhatia et al., 2016). SLEEC and LEML share similar model architectures with our setup, i.e., 2layer neural networks, but use different training objectives: SLEEC uses a nearestneighbor loss; LEML uses a leastsquare loss.
Results. As can be seen in Table 1, the proposed PLPartition method significantly outperforms the softmaxbased listwise method PLLB on all datasets, indicating the importance of optimizing the proper utility function for listwise methods. PLPartition also outperforms the pairwise methods RankSVM and RankNet on D1K, W31K, and D200K, where the number of labels per sample is relatively large. When the number of labels per sample is relatively small, breaking the labels into pairwise comparisons leads to little loss of information, and pairwise methods perform well (E4K).
Table 2 shows the comparison between PLPartition and embeddingbased XML classifiers SLEEC and LEML. SLEEC is better than LEML on all metrics. PLPartition achieves similar performance to SLEEC on Precision@k and significantly outperforms the baselines on PropensityScored Precision@k. The propensityscored metrics are believed to be less biased towards the head items. Thus the results indicate PLPartition has better performance than SLEEC for the torso or tail items.
Discussions. We note for the task of XML classification, treebased methods (Prabhu and Varma, 2014; Jain et al., 2016) sometimes outperform the embeddingbased methods. The focus of this paper is to develop scalable listwise LTR methods for learning neural network ranking models from partitioned preference, instead of methods tailored for XML classification. Therefore we restrain our comparison to embedding methods only, whose model architectures are similar as the 2layer neural networks in our experiment setup. We also note that SLEEC outperforms treebased methods for D200K on the XML repository leaderboard (Bhatia et al., 2016). This indicates PLPartition achieves stateoftheart performance on D200K, where the top partition size is relatively large. Guo et al. (2019) recently showed that, with advanced regularization techniques, embeddingbased methods trained by RankSVM or PLLB can be significantly improved to surpass stateoftheart treebased methods on most XML datasets. It seems an interesting future direction to apply such regularization techniques on our proposed LTR objective.
5 Conclusion
In this paper, we study the problem of learning neural network ranking models with a PlackettLucebased listwise LTR method from data with partitioned preferences. We overcome the computational challenge of calculating the likelihood of partitioned preferences under the PL model by proposing an efficient numerical integration approach. The key insight of this approach comes from the random utility model formulation of PlackettLuce with Gumbel distribution. Our experiments on both synthetic data and realworld data show that the proposed method is both more effective and scalable compared to popular existing LTR methods.
Appendix A Appendix
a.1 Proof of Lemma 1
Proof.
To simplify the notation, let . Then
∎
a.2 Proof of Proposition 1
Proof.
We first show . If , then the event of is equivalent to the event of so this equality holds true. Otherwise, assume there is a but .
We introduce a few notations to assist the proof. For any , let . Further let be the set of all possible permutations of that are consistent with the partial ranking , i.e.,
Then we can write the LHS as
where we slightly abused the notation by using it to refer both the Gumbel random variables in the first line and the corresponding integral variables in the following lines. We have also omitted the integral variables and the probability densities in the derivation. To further ease the notation, we define and , then
(9) 
where the last equality utilizes the fact that all the Gumbel variables are independent.
By applying Eq. (10) to all the items that do not belong to , we get
(11) 
And note that this reduces to a situation equivalent to the case . Therefore we have shown .
The proof for
remains the same no matter if or not, as the Gumbel variables are independent. We refer the reader to the Appendix B of Kool et al. (2020) for the proof. ∎
a.3 Proof of Lemma 2
a.4 Proof of Theorem 1
Before we start our proof of Theorem 1, we first introduce the wellknown discretization error bound for the composite midpoint rule of numerical integration in Lemma 3.
Lemma 3 (Discretization Error Bound of the Composite Midpoint Rule.).
Suppose we use the composite midpoint rule with intervals to approximate the following integral for some ,
Assume is continuous for and . Then the discretization error is bounded by .
Proof of the part (a). The sketch of the proof is as follows. We first give an upper bound of the discretization error in terms of the number of intervals. Then we can obtain the number of intervals required for any desired level of error.
In particular, we bound the discretization error in two parts. We first bound the absolute value of the integral on the region for some sufficiently small . We then bound the second derivative of the integrand on and apply Lemma 3 to bound the discretization error of the integral on . The total discretization error is then bounded by the sum of the two parts.
Proof.
We first rewrite the likelihood as follows,
(15) 
where in the last second equality we have applied a change of variable for each integral.
To simplify the notations, let us define for any , and . Then can be written as
Further let
It remains to investigate the properties of and its derivatives on to bound the discretization error of .
We first bound the absolute value of the integral on for some . We have
For any , let , then
Next we bound the second derivative of the integrand, , on . We have
For , and each , we know that
and
Further,
and
Therefore, for , we have
By Lemma 3, we know that the discretization error of the integral on is bounded by
For the total discretization error of to be smaller than , it suffices to have