We investigate the Plackett-Luce (PL) model based listwise learning-to-rank (LTR) on data with partitioned preference, where a set of items are sliced into ordered and disjoint partitions, but the ranking of items within a partition is unknown. Given items with partitions, calculating the likelihood of data with partitioned preference under the PL model has a time complexity of , where is the maximum size of the top partitions. This computational challenge restrains most existing PL-based listwise LTR methods to a special case of partitioned preference, top- ranking, where the exact order of the top items is known. In this paper, we exploit a random utility model formulation of the PL model, and propose an efficient numerical integration approach for calculating the likelihood and its gradients with a time complexity . We demonstrate that the proposed method outperforms well-known LTR baselines and remains scalable through both simulation experiments and applications to real-world eXtreme Multi-Label classification tasks.
Ma, Yi, Tang, Zhao, Hong, Chi, and Mei
Ranking is a core problem in many information retrieval systems, such as recommender systems, search engines, and online advertising. The industry-scale ranking systems are typically applied to millions of items in a personalized way for billions of users. To meet the need of scalability and to exploit a huge amount of user feedback data, learning-to-rank (LTR) has been the most popular paradigm for building the ranking system. Existing LTR approaches can be categorized into three groups: pointwise (Gey, 1994), pairwise (Burges et al., 2005), and listwise (Cao et al., 2007; Taylor et al., 2008) methods. The pointwise and pairwise LTR methods convert the ranking problem into regression or classification tasks on single or pairs of items respectively. As the real-world ranking data are often presented as (partially) ordered lists of items, the listwise LTR methods instead directly optimize objective functions defined on ranked lists of items, in order to preserve more information of the interrelations among items in a list.
One of the most well-known group of listwise LTR methods (Cao et al., 2007; Xia et al., 2008) are based on the Plackett-Luce (PL) model (Plackett, 1975; Luce, 1959). These methods define their objective functions as the likelihood of the observed ranked list under a PL model. Despite being useful in many cases, a major limitation of such methods comes from the fact that, evaluating the likelihood of general partial rankings under a PL model is usually intractable for a large number of items. This computational challenge restricts the application of existing PL-based listwise LTR methods to limited special cases of partial rankings, such as top- ranking, where the exact order of the top items is known.
In this paper, we extend PL-based listwise LTR to a more general class of partial rankings, the partitioned preference (Lebanon and Mao, 2008; Lu and Boutilier, 2014), defined as following: given items, partitioned preference slices the items into disjoint partitions, where order of items within each partition are unknown while the partitions have a global order. Partitioned preference not only is a strictly more general class of partial rankings compared to top- ranking, but also better characterizes real-world ranking data. For example, in a page of recommended items, we usually only observe binary clicks or a small number of ordinal ratings (e.g., 5-star rating) as user feedback but do not know the exact order among the clicked items or items with the same rating scale. However, computing the exact likelihood of data with partitioned preference under the PL model requires an intractable time complexity
To overcome this computational challenge, we propose a novel numerical integration method. The key insight of our method is that, by exploiting a random utility model formulation of the PL model with Gumbel distribution (Yellott Jr, 1977; McFadden, 1978), we find that both the log-likelihood and its gradients can be re-written as the summation of multiple one-dimensional integrals, which can be efficiently approximated by numerical integration. We formally demonstrate that, as the number of items grows, the overall time complexity of the proposed numerical approach is in order to maintain a constant level of numerical error , which is much more efficient than the naive approach with the complexity . We also discuss how our proposed approach might improve the generalized rank-breaking methods (Khetan and Oh, 2018).
We evaluate the effectiveness of the proposed method through both simulation and experiments with real-world datasets. For simulation, we show that the proposed method can better recover the ground-truth parameters of a PL model compared to baseline methods, including a method (Hino et al., 2010) that approximates the PL likelihood with a tractable lower bound. We also test the proposed method on real-world extreme multilabel (XML) classification datasets (Bhatia et al., 2016). We show that the proposed method can efficiently train neural network ranking models for items at million-level, and outperforms other popular listwise and pairwise LTR baselines.
2 Related Work
Our work falls in the area of LTR (Liu, 2009). The goal of LTR is to build machine learning models to rank a list of items for a given context (e.g., a user) based on the feature representation of the items and the context. The choice of the ranking objective plays an important role in learning the ranking models. Existing ranking objectives can be generally categorized in to three groups: pointwise (Gey, 1994), pairwise (Joachims, 2002; Burges et al., 2005), and listwise (Cao et al., 2007; Xia et al., 2008; Taylor et al., 2008; Christakopoulou and Banerjee, 2015; Ai et al., 2018; Wang et al., 2018). The PL model has been widely used in listwise LTR methods (Cao et al., 2007; Xia et al., 2008; Schäfer, 2018). However, to our best knowledge, existing PL-based listwise methods cannot be applied to partitioned preference data, due to the aforementioned computational complexity of evaluating the likelihood. Our work tackles the computational challenge with a novel numerical approach. Beyond the computational challenge, another major limitation of the PL-based listwise methods is that, the underlying independence of irrelevant alternatives (IIA) assumption of the PL model, is sometimes overly strong in real-world applications (Seshadri and Ugander, 2019; Wilhelm et al., 2018; Christakopoulou and Banerjee, 2015). But more detailed discussions on the IIA assumption is out of the scope of this paper.
XML classification as a ranking problem. Given features of each sample, the XML classification task requires a machine learning model to tag the most relevant subset of an extremely large label set. The XML classification tasks were initially established as a reformulation of ranking problems (Agrawal et al., 2013; Prabhu and Varma, 2014), and the performance of which is primarily evaluated by various ranking metrics such as Precision@k or nDCG@k (Bhatia et al., 2016). The XML classification tasks are special cases of ranking with partitioned preference, where the class labels are considered items, and for each document its relevant labels form one partition and irrelevant labels form a second, lower-ranked partition. In this work, we apply the proposed method for ranking with partitioned preference to the XML classification datasets, and we find it achieving the state-of-the-art performance on datasets where the first partition, i.e., the set of relevant labels, is relatively large.
2.2 Rank Aggregation
Rank aggregation aims to integrate multiple partial or full rankings into one ranking. The multiple rankings are considered as noisy samples from underlying ground truth ranking preferences. Rank aggregation is a broader research area that includes LTR as a subproblem. Statistical modeling is a popular approach for rank aggregation. Various statistical models (Mallows, 1957; Luce, 1959; Plackett, 1975) are proposed to model the rank generation process in the real world. Among them, the PL model (Luce, 1959; Plackett, 1975) is one of the most widely-used. Evaluating the likelihood of the PL model on various types of partial rankings has been widely studied (Hunter et al., 2004; Maystre and Grossglauser, 2015; Liu et al., 2019; Yıldız et al., 2020; Zhao and Xia, 2020). However, we note that many of these studies (Maystre and Grossglauser, 2015; Liu et al., 2019) are designed for ranking data without any features, and thus are not suitable for LTR tasks. The ones (Yıldız et al., 2020; Zhao and Xia, 2020) that can leverage sample features are not directly applicable to large-scale partitioned-preference data. It is worth noting that, our proposed method shares the motivation of approximating the intractable PL likelihood using sampling methods (Liu et al., 2019), as numerical integration is a special case of sampling. However, the integral form of the PL likelihood inspired by the connection between the PL model and Gumbel distribution makes our method more efficient than a general sampling method.
3.1 Problem Formulation: Learning PL Model from Partitioned Preference
Suppose there are different items in total and we denote the set by . The PL model and the partitioned preference are formally defined below.
Given the utility scores of the items, , the probability of observing a certain ordered list of these items, , is defined as
A group of disjoint partitions of , , is called a partitioned preference if (a) , where indicates that any item in -th partition has a higher rank than items in the -th partition; (b) the rank of items within the same partition is unknown.
Clearly, and for any . We also denote the size of each partition as , . Under a PL model parameterized by as defined in Eq. (1), the probability of observing such a partitioned preference is given by
where is a function that maps a partial ranking to the set of all possible permutations of that are consistent with the given partial ranking.
Typically, the utility scores are themselves parameterized functions, e.g. neural networks, of the feature representation of the items and the context of ranking (e.g., a particular user). Suppose the features for item are denoted as and the item-independent context features are denoted as . Then the utility score of for a given context (e.g., user) can be written as , where represents the neural network parameters
However, evaluating the likelihood function naively by Eq. (2) requires a time complexity of , which is made clear in the form of Eq. (3) given by Lemma 1. This implies that the likelihood becomes intractable as long as one partition is mildly large (e.g., ).
Let be a function that maps a set of items to the set of all possible permutations of these items. Then Eq. (2) can be re-written as
which happens to be
where is the set of items that do not belong to the top partitions, i.e. .
3.2 Efficient Evaluation of the Likelihood and Gradients
Next, we present how to efficiently the likelihood Eq. (5) and its gradients. We derive a numerical integral approach based on the random utility model formulation of the PL model with Gumbel distribution (Yellott Jr, 1977; McFadden, 1978).
The random utility model formulation of PL.
A random utility model assumes that, for each context, the utility of preferring the item is a random variable . In particular, is the aforementioned parameterized utility function, and is a random noise term that contains all the unobserved factors affecting the individual’s utility. When each independently follows a standard Gumbel distribution (or equivalently, each independently follows , a Gumbel distribution with the location parameter set to ), we have the following fact (Yellott Jr, 1977), for any permutation of items ,
It implies that, after sampling independent Gumbel variables, the ordered indices returned by sorting the Gumbel variables follow the PL model. Following this result, we have developed Proposition 1 that characterizes the preference for a given context between any two disjoint partitions.
Given a PL model parameterized by , for any and , the probability of is given by
where denotes a random variable following and .
Kool et al. (2020) have shown a weaker version of Proposition 1 where . Here we extend it to the case where , whose proof is given in Appendix A.2. In particular, the second equality in Eq. (1) provides an efficient way of computing the likelihood of the preference for a given context between two disjoint partitions.
Compute the likelihood and gradients by numerical integral.
Following the random utility model formulation of PL, we can compute the log-likelihood of the preference for a given context among partitions and its gradient efficiently by one-dimensional numerical integration. For the likelihood function, it directly follows from Lemma 1 and Proposition 1 that
where . Therefore, we can compute the log-likelihood function through one-dimensional numerical integration. From Lemma 2, we can see that the gradients of the log-likelihood can also be obtained by numerical integrations using Eq. (6).
The gradients of the log-likelihood w.r.t. can be written as
Analysis of the computation cost. Suppose the number of numerical integration intervals is set as . Evaluating the likelihood (5) requires a time complexity and evaluating the gradients (6) requires a time complexity . Next we quantify the minimum required by certain desired numerical error. The numerical error consists of the discretization error, which is caused by the discretization of the integral, and the round-off error, which is caused by the finite precision of the computer. While there is non-negligible round-off error if we calculate the numerical integration directly using formula (5) and (6), this can be largely alleviated using common numerical tricks (see Appendix A.5). So we mainly focus on the analysis of the discretization error, which is given in Theorem 1.
Assume that there exist some constants , and and , such that, for any ,
Then for any and ,
we need at most intervals for the -th integral in the likelihood (5) to have a discretization error smaller than ;
if there exists some constant and , such that, for each , defining and , the following conditions hold,
then we need at most intervals for the -th integral in gradients (6) to have a discretization error smaller than .
We note that the assumptions (7) and (8) in Theorem 1 respectively require the largest and the smallest neural network output logit, plus a constant , to be not too far away from zero
Theorem 1 implies that the computation cost of the proposed method is overall to maintain a discretization error at most for both the likelihood and its gradients, which is clearly much more efficient than the naive approach with factorial terms. We further highlight several computational advantages of the proposed numerical approach. 1) The whole computation is highly parallelizable: the computation of the integrands and the product over within each integrand can all be done in parallel. 2) The number of intervals can be adjusted to control the trade-off between computation cost and accuracy. 3) In large-scale ranking data, we often have , thus will be negligible for large , resulting a linear complexity w.r.t. .
3.3 Improving the Computational Efficiency of Generalized Rank-Breaking Methods
We discuss the potential application of the proposed numerical approach to generalized rank-breaking methods (Khetan and Oh, 2018) as a final remark of this section.
While partitioned preference is a general class of partial rankings for a set of items, it is not able to represent the class of all possible partial rankings, which is also known as arbitrary pairwise preference (Lu and Boutilier, 2014; Liu et al., 2019). It is challenging to learn arbitrary pairwise preference using a listwise method. To the best of our knowledge, there is no scalable listwise method that is able to learn industry-scale PL-based ranking models. Pointwise and pairwise methods are able to deal with any types of partial rankings at the expense of lower statistical efficiency. Generalized rank-breaking methods (Khetan and Oh, 2018) are recently proposed to better trade-off the computational and statistical efficiency in LTR.
An arbitrary pairwise preference of items can be represented as a directed acyclic graph (DAG) of nodes, where each node is an item and each directed edge represents the preference over a pair of items. The generalized rank-breaking methods first apply a graph algorithm to extract a maximal ordered partition of , : a group of disjoint partitions of with largest possible M, such that the item preference in the M partitions is consistent with that of the DAG. One difference between data with partitioned preference and data with arbitrary pairwise preference is that the maximal ordered partition is not unique for the latter, as the maximal ordered partition does not preserve all relationships in the DAG. With the extracted partitions, we can maximize the likelihood of these partitions under a PL model to learn the model parameters. Khetan and Oh (2018) propose to calculate the likelihood as shown in Eq. (2), which has a time complexity involves factorials of the partition sizes. To overcome this challenge for learning large-scale data, existing methods need to approximate the likelihood by dropping the top partitions with large sizes. In contrast, the proposed numerical approach in this paper can be directly applied to the likelihood evaluation step of generalized rank-breaking methods to significantly improve the computational efficiency.
In this section, we report empirical results on both synthetic and real-world datasets. We compare the proposed method, denoted as PL-Partition, with two groups of baseline methods that can be applied to large-scale partitioned preference data.
First, we consider two softmax-based listwise methods: PL-LB (Hino et al., 2010) and AttRank (Ai et al., 2018). PL-LB optimizes a lower bound of the likelihood of partitioned preference under the PL model. In particular, for each , the term in Eq. (5) is replaced by its lower bound . AttRank optimizes the cross-entropy between the softmax outputs and an empirical probability based on the item relevance given by training labels.
For the second group of baselines, we consider two popular pairwise methods: RankNet (Burges et al., 2005) and RankSVM (Joachims, 2002). RankNet optimizes a logistic surrogate loss on each pair of items that can be compared. RankSVM optimizes a hinge surrogate loss on each pair of items that can be compared.
We conduct experiments on synthetic data which is generated from a PL model. The goal of this simulation study is two-fold: 1) we investigate how accurate the proposed method can recover the ground truth utility scores of a PL model; 2) we empirically compare the computation costs of different methods over data with different scales.
Synthetic data generated from a PL model. We first generate a categorical probability simplex as the ground truth utility scores for items following and . Then we draw samples of full ranking from a PL model parameterized by . Finally, we randomly split the full rankings into partitions and remove the order within each partition to get the partitioned preference. We note that the synthetic data is stateless (i.e., there is no feature for each sample), as this simulation focuses on the estimation of the PL utility scores rather than the relationship between and sample features. Further, in large-scale real-world applications, we often can only observe the order of limited items per sample. For example, a user can only consume a limited number of recommended items. To respect this pattern, we restrict the total number of items in top partitions to be at most 500 regardless of . We fix and generate data with varying and random seeds.
Experiment setups. As the synthetic data is stateless, we only need to train free parameters, with each parameter corresponding to an item. We use the proposed method and baseline methods to respectively train the parameters using stochastic gradient descent with early-stopping. We use AdaGrad optimizer with initial learning rate of 0.1 for all methods. We report the mean squared error (MSE) between the softmax of these free parameters and the PL utility scores as a measure of how accurately different methods recover .
MSE of the estimated PL utility scores. Figure 1 shows the MSE of the estimated PL utility scores by different methods over various and . To better compare results across data with different numbers of items , we further include an oracle reference method PL-TopK, which has access to the full ranking of the items in the top partitions and optimizes the corresponding PL likelihood. First, as expected, the proposed PL-Partition method best recovers the ground truth utility scores in terms of MSE on all data configurations, as it numerically approximates the PL likelihood of the partitioned preference data. However, it is worth noting that PL-LB, while also trying to approximate the PL likelihood with a lower bound, performs even worse than pairwise methods when is small, which indicates the existing lower bound method is not sufficient to take the full advantage of the PL model.
Computation cost. Figure 2 shows the time and memory costs of different methods over various . The results are obtained using a single Nvidia V100 GPU. We report the total time of running 1000 steps of stochastic gradient descent with batch size 20. We also report the peak CUDA memory. The costs of both time and memory for the pairwise methods grow faster than those for the listwise methods as increases. The two listwise baseline methods, AttRank and PL-LB, have similar memory cost. PL-LB has a larger running time due to the calculation of multiple partition functions. The proposed PL-Partition method has an overhead cost due to the numerical integration. However, we observe that this overhead cost is amortized as increases. When , the computational cost of PL-Partition becomes close to that of PL-LB. Overall, this benchmark empirically demonstrates that the proposed method is scalable for large-scale applications
4.2 Real-World Datasets
Experiment setups. We also verify the effectiveness of the proposed method on 4 real-world XML datasets (Bhatia et al., 2016): Delicious-1K (D-1K), Eurlex-4K (E-4K), Wiki10-31K (W-31K), and Delicious-200K (D-200K). The trailing number in the name of each dataset indicates the number of classes in the dataset
We first compare the proposed PL-Partition method with the 4 baseline methods. Note that PL-LB and AttRank collapse into exactly the same method on XML classification as the number of partitions is 2. So we only report the results of PL-LB. For each method, we train a neural network model with the same architecture, 2-layer fully connected network with ReLU activations and hidden size of 256. We train the neural networks with stochastic gradient descent using the ADAM optimizer. The batch size is fixed to 128. We use the official train-test split of each dataset and further split the training set into training and validation (9:1 for D-200K and 3:1 for other datasets). We tune the learning rate by line-search from and apply early-stopping in training, based on validation sets.
We also compare PL-Partition with two state-of-the-art embedding-based XML classifiers, SLEEC (Bhatia et al., 2015) and LEML (Yu et al., 2014), which are listed on the XML repository leaderboard (Bhatia et al., 2016). SLEEC and LEML share similar model architectures with our setup, i.e., 2-layer neural networks, but use different training objectives: SLEEC uses a nearest-neighbor loss; LEML uses a least-square loss.
Results. As can be seen in Table 1, the proposed PL-Partition method significantly outperforms the softmax-based listwise method PL-LB on all datasets, indicating the importance of optimizing the proper utility function for listwise methods. PL-Partition also outperforms the pairwise methods RankSVM and RankNet on D-1K, W-31K, and D-200K, where the number of labels per sample is relatively large. When the number of labels per sample is relatively small, breaking the labels into pairwise comparisons leads to little loss of information, and pairwise methods perform well (E-4K).
Table 2 shows the comparison between PL-Partition and embedding-based XML classifiers SLEEC and LEML. SLEEC is better than LEML on all metrics. PL-Partition achieves similar performance to SLEEC on Precision@k and significantly outperforms the baselines on Propensity-Scored Precision@k. The propensity-scored metrics are believed to be less biased towards the head items. Thus the results indicate PL-Partition has better performance than SLEEC for the torso or tail items.
Discussions. We note for the task of XML classification, tree-based methods (Prabhu and Varma, 2014; Jain et al., 2016) sometimes outperform the embedding-based methods. The focus of this paper is to develop scalable listwise LTR methods for learning neural network ranking models from partitioned preference, instead of methods tailored for XML classification. Therefore we restrain our comparison to embedding methods only, whose model architectures are similar as the 2-layer neural networks in our experiment setup. We also note that SLEEC outperforms tree-based methods for D-200K on the XML repository leaderboard (Bhatia et al., 2016). This indicates PL-Partition achieves state-of-the-art performance on D-200K, where the top partition size is relatively large. Guo et al. (2019) recently showed that, with advanced regularization techniques, embedding-based methods trained by RankSVM or PL-LB can be significantly improved to surpass state-of-the-art tree-based methods on most XML datasets. It seems an interesting future direction to apply such regularization techniques on our proposed LTR objective.
In this paper, we study the problem of learning neural network ranking models with a Plackett-Luce-based listwise LTR method from data with partitioned preferences. We overcome the computational challenge of calculating the likelihood of partitioned preferences under the PL model by proposing an efficient numerical integration approach. The key insight of this approach comes from the random utility model formulation of Plackett-Luce with Gumbel distribution. Our experiments on both synthetic data and real-world data show that the proposed method is both more effective and scalable compared to popular existing LTR methods.
Appendix A Appendix
a.1 Proof of Lemma 1
To simplify the notation, let . Then
a.2 Proof of Proposition 1
We first show . If , then the event of is equivalent to the event of so this equality holds true. Otherwise, assume there is a but .
We introduce a few notations to assist the proof. For any , let . Further let be the set of all possible permutations of that are consistent with the partial ranking , i.e.,
Then we can write the LHS as
where we slightly abused the notation by using it to refer both the Gumbel random variables in the first line and the corresponding integral variables in the following lines. We have also omitted the integral variables and the probability densities in the derivation. To further ease the notation, we define and , then
where the last equality utilizes the fact that all the Gumbel variables are independent.
Note that, in Eq. (9), given , regardless the choice of . Therefore,
By applying Eq. (10) to all the items that do not belong to , we get
And note that this reduces to a situation equivalent to the case . Therefore we have shown .
The proof for
remains the same no matter if or not, as the Gumbel variables are independent. We refer the reader to the Appendix B of Kool et al. (2020) for the proof. ∎
a.3 Proof of Lemma 2
a.4 Proof of Theorem 1
Lemma 3 (Discretization Error Bound of the Composite Mid-point Rule.).
Suppose we use the composite mid-point rule with intervals to approximate the following integral for some ,
Assume is continuous for and . Then the discretization error is bounded by .
Proof of the part (a). The sketch of the proof is as follows. We first give an upper bound of the discretization error in terms of the number of intervals. Then we can obtain the number of intervals required for any desired level of error.
In particular, we bound the discretization error in two parts. We first bound the absolute value of the integral on the region for some sufficiently small . We then bound the second derivative of the integrand on and apply Lemma 3 to bound the discretization error of the integral on . The total discretization error is then bounded by the sum of the two parts.
We first re-write the likelihood as follows,
where in the last second equality we have applied a change of variable for each integral.
To simplify the notations, let us define for any , and . Then can be written as
It remains to investigate the properties of and its derivatives on to bound the discretization error of .
We first bound the absolute value of the integral on for some . We have
For any , let , then
Next we bound the second derivative of the integrand, , on . We have
For , and each , we know that
Therefore, for , we have
By Lemma 3, we know that the discretization error of the integral on is bounded by
For the total discretization error of to be smaller than , it suffices to have