Analysis of Regression Tree Fitting Algorithms
in Learning to Rank
Abstract
In learning to rank area, industrylevel applications have been dominated by gradient boosting framework, which fits a tree using least square error principle. While in classification area, another tree fitting principle, weighted least square error, has been widely used, such as LogitBoost and its variants. However, there is a lack of analysis on the relationship between the two principles in the scenario of learning to rank. We propose a new principle named least objective loss based error that enables us to analyze the issue above as well as several important learning to rank models. We also implement two typical and strong systems and conduct our experiments in two realworld datasets. Experimental results show that our proposed method brings moderate improvements over least square error principle.
1 Introduction
Top practical learning to ranking systems are adopting gradient boosting framework and using regression trees as weak learners. These systems performed much better than linear systems on realworld datasets, such as Yahoo challenge 2010 ([4]), and Microsoft 30K ([21]). Among these systems, LambdaMART ([24, 3]), a pairwise based model, gained an excellent reputation in Yahoo challenge; MART ([8]), a pointwise based, is a regression model which utilizes least square loss as objective loss function, and McRank ([12])^{1}^{1}1Li et al. call the model of McRank as MART in the scenario of classification., a pointwise based, uses multiclass classification technique and converts predictions into ranking. For industry application, gradient boosting combined with regression trees appears to be a standard practice.
An important finding was made in the work of ([6, 12]) that has created a bridge between learning to rank and classification. They proved that an important measure NDCG in learning to rank could be bounded by multiclass classification error. This insight opens a door for learning to rank, as we could borrow stateoftheart classification techniques.
Least square error (SE) principle in fitting a regression tree only utilizes the firstorder information of objective loss function, while in multiclass classification area, there is a work that fits a regression tree by use of secondorder information besides that. Their tree fitting principle is called weighted least square error (WSE). LogitBoost ([9]) and its robust versions ([14, 15]) are examples. A comparison between gradient boosting using SE and LogitBoost using WSE for classification task ([8]) shows that the latter is slightly better. As WSE is empirically considered as unstable in practice ([9, 15, 8]), [15] obtained a stable form of WSE, called RWSE.
However, both WSE and RWSE have no clear theoretical explanation, and somewhat hard to understand. As a result, Li thus proposed an interesting question in Section 2.3 of ([15]): in determining the output of a leaf of a regression tree, the onestep Newton formula from gradient boosting coincides with the weighted averaging from WSE. Moreover, RWSE looks concise, which might be mistaken to apply to any ranking model besides LogitBoost and its variants. These issues drive us to consider from another point of view.
We propose a general regression tree fitting principle for ranking models, called least objective loss based error (OLE). It only requires simple computation to derive exact formula and is easy to understand. Under this principle, we first clearly answer the aforementioned question, and then analyze a variety of ranking systems to build a relationship between SE, (R)WSE and OLE (Figure 1). Experiments in realworld datasets also show moderate improvements.
2 Background
Given a set of queries , each query is associated with a set of candidate relevant documents and a corresponding vector of relevance scores for each . The relevance score is usually an integer, assuming . Greater value means more related for the document to the query. An dimensional feature vector is created for each querydocument pair, where s are predefined feature functions, which usually return real values.
A ranking function is designed to score each querydocument pair, and the documents associated with the same query are ranked by their scores and returned to users. Since these documents have a fixed ground truth rank with its corresponding query, our goal is to find an optimal ranking function which returns such a rank of related documents that is as close to the ground truth rank as possible. Industrylevel applications often adopt regression trees to construct the ranking function, and use Newton Formula to calculate the output values of leaves of trees.
Since fitting a regression tree is conducted node by node, in this paper we only discuss the differences of algorithms running on each node. We call them nodesplitting algorithms. To distinguish them from the objective loss functions of ranking models, we refer to the loss of each nodesplitting algorithm as the "error". For example, square error (Section 2.1), (robust) weighted square error (Section 2.2.1, 2.2.2), and the objective loss based error (Section 3.1) proposed in this work. Obviously, if there is a minor difference in the split of a tree node for two nodesplitting algorithms, they would lead to two totally different regression trees.
Several measures have been used to quantify the quality of the rank with respect to a groundtruth rank such as NDCG, ERR, MAP etc. In this paper, we use the most popular NDCG and ERR ([4]) as the performance measures.
2.1 Square Error (SE) in Gradient Boosting
We regard gradient boosting ([8]) as a general framework for function approximation, and it is applicable for any differentiable loss function. Combining with regression trees as weak learners has been the most successful application in learning to rank area.
Gradient boosting iteratively finds an additive predictor , which minimizes the loss function . At the th iteration, a new weak learner is selected to add to current predictor . Then
(1) 
where is the learning rate.
In gradient boosting, is chosen as the one most parallel to pseudoresponse , which is defined as the negative derivative of the loss function in functional space .
(2) 
In practice, the globally optimal usually can not be obtained, then we turn to operations on each tree node to fit a suboptimal tree. In one node, we enumerate all possible featurethreshold pairs, and then the best one is selected to conduct binary splitting. This procedure recursively iterates on its child nodes until reaching a predefined condition.
Regarding a feature, suppose a threshold to split samples on the current node into two parts. The samples, whose feature values are less than , are denoted as , and others are denoted as . The squared error is defined as
(3) 
where , are average pseudoresponse of samples on the left and right respectively.
The complexity of a regression tree could be controlled by limiting the tree height or leaf number. In learning to rank, the latter is more flexible, and adopted in this work by default.
2.2 Weighted Square Error in Classification
[6] proved that the negative unnormalized NDCG value is upperbounded by multiclass classification error, where NDCG is an important measure in learning to rank. Thus [12] proposed a multiclass classification based ranking systems called McRank in gradient boosting framework.
McRank utilizes classic logistic regression, which models class probability as
(4) 
where is an additive predictor function for the th class.
The objective loss function is the negative loglikelihood, defined as
(5) 
where is an indicator function.
2.2.1 Weighted Square Error (WSE)
In classification, this loss function (Eqn. 5) resulted in the wellknown system LogitBoost, which first used WSE to fit a regression tree. WSE utilizes both first and second order derivative information.
WSE uses a different definition of the response value from that in SE (Eqn. 2), and defines an extra weight for each sample.
(6) 
The splitting principle is minimizing the following weighted error
(7) 
where
(8) 
However, Logitboost is usually thought as instable. Combining its loss function, the response and weight values, by Eqn. 6, are set as , , in fitting a tree for the th classification. The response might become huge and lead to unsteadiness when in the denominator is close to or . Though [9] described some heuristics to smooth the response values, LogitBoost was still believed numerically unstable ([9, 8, 15]). As a result, McRank actually adopts SE and gradient boosting to fit regression trees (Section 2.1), rather than LogitBoost.
2.2.2 Robust Weighted Square Error (RWSE)
(9) 
Since all denominators in Eqn. 9 are summation of a set of weights , which are less likely to be close to zero in practical applications, and RWSE is hence more stable than WSE.
After fitting a regression tree by either SE or (R)WSE, the data aggregated in the same leaf is assigned with a value by weighted averaging responses (Eqn. 8).
Li proposed an interesting question regarding Eqn. 8. Li mentioned, Eqn. 8 could be interpreted as a weighted average in (R)WSE; while in gradient boosting, it is interpreted as a onestep Newton update. It looks like a coincidence. In next section, we propose a unified splitting principle, which not only clearly explains the relationship of these principles, but also could be extended to more complex loss functions. Also, our method generates Li’s robust version directly.
3 Greedy Tree Fitting Algorithm in Learning to Rank
3.1 Objective Loss Based Error (OLE)
We were motivated by the success of AdaBoost ([20]). In each iteration, AdaBoost selects the weak learner that has a minimal weighted error, and this can be proved that the weak learner selected ensures a maximum improvement of its objective loss, so we are borrowing a similar strategy in learning to rank. Note that, we would use totally different formulas.
Exactly finding an optimal regression tree is computationally infeasible, as the number of possible trees is combinatorially huge. We thus turn to focus on the most basic unit in fitting a regression tree, that is how to conduct a good binary partition to improve the objective loss most. This is a more acceptable approach.
Given a set of samples , and a selected feature, we first assume there are at most potential positions to define a threshold for the selected feature. Based on a threshold , the samples are split into two parts, and . Second, we assume once a partition is conducted, the samples on the two sides would receive their updated score. In other words, we fit a temporary twoleaf tree in the current samples and update the outputs of two leaves separately (diagonal approximation). We update samples and calculate the objective loss, and allow the maximum improvement quantity as a measure for the current partition. The best partition and its respective threshold are selected after enumerating at most possibilities. We ignore the fact that either side may not be a real leaf in the final fitted regression tree, so that we have a feasible method.
Regarding a threshold , let the outputs of the temporary twoleaf tree be and , then the objective loss has become a function of and . Once the values of and are determined, samples on two sides would be updated, and then the objective loss can be straightforwardly computed. However, even in a moderate size dataset, this computation is still prohibitive. So we approximate the objective loss with the Taylor formula in the second order at the point of 0.
(10) 
where .
The local optimum can be obtained as by letting the firstorder derivative be zero. More specifically, and are optimized independently as following
(11) 
(12)  
(13) 
This resultant formula is not equivalent to SE or (R)WSE in a general case, and would lead to a totally different regression tree from that using other nodesplitting algorithms.
In the case of learning to rank, we analyze their equivalence for point, pair and list wised based models.
3.2 Derivative Additive Loss Functions
In order to calculate all the gradients in Eqn. 13 in an efficient lefttoright incremental updating way, we explore the cases the gradient , can be decomposed into operations on each sample.
Definition 1.
A loss function is defined as derivative additive if and .
Example 1.
The loss function of MART system is derivative additive, since
Example 2.
The loss function of McRank system is derivative additive, since
Example 3.
The loss function of RankBoost system is not derivative additive, since
To clearly explain, we use a toy example with , sorted by their relevances . Assuming in current th iteration, their score is . The exponential loss is
where, to simplify, , , .
Suppose one partition is and , with the output value , respectively, then the current loss is
The key reason is that if two samples appearing in the same term of the objective loss also are classified into the same leaf, then they would receive the same output from current leaf, which does not contribute to the objective loss. In this example, the two , coming from and , should be counteracted. As a result, the exponential loss is not derivative additive.
Example 4.
The loss function of LambdaMART system is not derivative additive.
This famous system has no explicit objective loss function, but has exact first and secondorder derivatives. Its first derivative has a similar unit with the RankBoost model, both having such terms of , since LambdaMART is pairwise based system. Based on the detailed analysis in RankBoost, we could easily know the loss function of LambdaMART, potentially existing, is not derivative additive^{2}^{2}2To examine strictly, the onestep Newton formula in LambdaMART (Line 11 in Alg. 1 of ([24])) is incorrect conceptually, as the denominator in is tackled as derivative additive. We have not found any explanations from their paper. But it can be viewed as an approximation of the exact formula..
Example 5.
The loss function of ListMLE is not derivative additive.
As the term is frequently appearing in its loss function, the loss function of ListMLE is also not derivative additive.
Example 6.
Some special listwise models have derivative additive loss functions.
In the work of ([19]), several pointwise based systems are modified by using listwise information, so they are considered to be listwise based systems, such as consistentMART, consistent KL divergence based, consistent cosine distance based. As the extra listwise information is actually utilized in a preprocessing step, and then they are running in a pointwise style, so these socalled listwise based systems also own derivative additive loss functions.
3.3 (R)wse Ole
(R)WSE was proposed for LogitBoost, which is a classification system. However, from the angle of learning to rank, LogitBoost could be classified into pointwise based. We prove (R)WSE is actually a special case of OLE.
Theorem 1.
Regarding derivative additive loss function, OLE is simplified into (R)WSE.
Proof.
We simply obtain the robust weighted least square error from ([15]). As Li proved the robust version is equivalent to original WSE, thus our method is equivalent to WSE for all derivative additive loss functions. ∎
By the explanation of robustness of Li, our OLE method (Eqn. 13) is intrinsically robust, as all denominators are less likely to be zero in summing a set of samples.
Recall that a model is classified into the pointwise category if the model does not use the relationship between samples, but only individual samples. See the typical pointwise based models, Example 1 and 2. So, pointwise based systems have derivative additive loss functions, and in this case, (R)WSE is always equivalent to OLE.
A pairwise based system considers the relationship only between two samples. If a pair of associated samples are classified into two different tree nodes, the objective loss function is derivative additive; otherwise, it is not. In practical applications, it is not difficult to overcome this inconvenience by using an incremental updating.
A listwise based system would render more samples interact to each other, and it is relatively more difficult to tackle. But in splitting a tree node, the incremental updating is still working.
We are now able to answer the question from ([15]), the Eqn. 8 appears to have two explanations, one from weighted average, and the other from onestep Newton. As we proved that (R)WSE is a special case derived from optimizing only derivative additive objectives, and LogitBoost uses derivative additive objective, so (R)WSE and SE are equivalent. Moreover, for other complex objective losses, (R)WSE may have no theoretical support, but it may serve as an approximation of our method.
3.4 SE = (R)WSE = OLE for MART
SE is generally not equivalent to (R)WSE or OLE, even the objective loss functions used satisfy the condition of Theorem 1. However, we find that MART is an ideal intersection of OLE, (R)WSE and SE.
MART system adopts least square loss as objective loss, and classic gradient boosting framework to fit regression trees. Many commercial search engines are using this model to construct their ranking systems.
Theorem 2.
Regarding MART system, whose objective loss is the leastsquare , then .
Proof.
Given some chosen feature function and pseudoresponse of each document , there are positions to define a threshold which is the middle value of two adjacent feature values.
We derive from the definition to prove that minimizing objective loss is the same with minimizing the least square error splitting principle (Eqn. 3).
So, optimizing objective loss is equivalent to optimizing SE, and as mentioned before MART is a pointwise bases system, which suggests . ∎
This theorem means, regarding MART system, the three nodesplitting principles lead to the same binary partition in any selected tree node, and then lead to the same regression tree in current iteration of boosting.
The key step here is the average pseudoresponse for and , whose definition is same with that in Eqn. 3, is exactly double of the negative optimum computed by Newton equation using the extra second derivative. Regarding other loss functions, this relationship does not necessarily hold.
This theorem shows the classic tree fitting algorithm in gradient boosting is very suitable for the leastsquare loss function, on which MART system is based, and this could explain why MART system actually performs excellently in many practical applications.
Besides, there is a by product formula for MART system. By plugging its firstorder derivative , , and secondorder derivative , into Eqn. 13, we obtain a simpler splitting principle than Eqn. 3.
(14) 
This form is more intuitive for incremental computation of the optimal threshold from left to right.
4 Experiments
4.1 Datasets and Systems
As suggested by [18], we use two realworld datasets to make our results more stable, Yahoo challenge 2010 and Microsoft 10K. The statistics of these data sets are reported in Table 1.

Yahoo Challenge 2010. After Yahoo corporation hosted this farreaching influence contest of learning to rank in 2010, this dataset has been important for a comparison. It contains two sets, and here we use the bigger one (set 1). Yahoo dataset was released with only one split of training, validating, and testing set, and we add an extra two splits and also report average results.

Microsoft 10K. Another publicly released datasets, and even larger than the Yahoo data in terms of the number of documents. As a 5fold splitting is provided by official release, we report average results.
#Query  #Doc.  #D. / #Q.  #Feat.  
Yahoo  20K  473K  23  519 
Micro10K  6K  723K  120  136 
McRank (([12]))  1026K  4741741K  1888  367619 
MART (([24]))  31K  4154K  134  416 
Ohsumed  106  16K  150  45 
letor 4.0  2.4K  85K  34  46 
The two datasets above were empirically found to be different. The Microsoft dataset seems more difficult than Yahoo as some models are reportedly running badly on it ([21]). It has comparatively less features, 136, and larger average number of documents per query 120, compared to 519 and 23 of Yahoo. The two realworld datasets should be capable of providing convincing results.
Yahoo challenge 2010  

#leaf  NDCG@1  NDCG@3  NDCG@10  ERR  
0.06  71.32/71.76  71.47/72.22  77.97/78.52  45.44/45.71  
10  0.10  71.38/71.73  71.82/72.37  78.25/78.67  45.52/45.77  
McRank  0.12  71.67/71.87  71.96/72.52  78.33/78.77  45.59/45.79  
0.06  71.52/71.90  71.80/72.60  78.26/78.84  45.54/45.83  
20  0.10  71.65/72.03  72.14/72.77  78.50/79.00  45.64/45.88  
0.12  71.80/72.10  72.23/72.79  78.58/78.99  45.66/45.89  
0.06  71.15/71.62  71.60/71.94  77.82/78.15  45.66/45.80  
10  0.10  71.29/71.81  71.76/72.11  77.96/78.29  45.72/45.90  
LambdaMART  0.12  71.30/71.76  71.76/72.19  77.93/78.34  45.67/45.87  
0.06  71.51/71.75  72.02/72.25  78.13/78.40  45.77/45.92  
20  0.10  71.37/72.10  71.92/72.56  78.04/78.58  45.72/46.02  
0.12  71.44/71.76  71.91/72.39  78.06/78.57  45.71/45.96 
Microsoft 10K  

#leaf  NDCG@1  NDCG@3  NDCG@10  ERR  
0.06  47.43/47.14  46.14/46.46  48.60/49.17  35.90/36.13  
10  0.10  47.49/47.69  46.41/46.79  49.00/49.47  36.13/36.32  
McRank  0.12  47.42/47.67  46.50/46.78  49.02/49.51  36.13/36.30  
0.06  47.69/47.94  46.44/47.06  49.07/49.67  36.09/36.45  
20  0.10  47.52/48.04  46.76/47.24  49.36/49.80  36.26/36.54  
0.12  47.87/47.89  46.86/47.12  49.51/49.68  36.34/36.45  
0.06  47.42/47.98  46.21/46.54  48.22/48.70  36.44/36.68  
10  0.10  47.79/47.64  46.55/46.57  48.57/48.85  36.54/36.71  
LambdaMART  0.12  47.45/47.79  46.32/46.62  48.54/48.94  36.42/36.67  
0.06  48.01/48.19  46.52/46.87  48.78/49.13  36.74/36.88  
20  0.10  47.99/48.07  46.66/46.79  48.95/49.10  36.72/36.81  
0.12  47.63/47.67  46.69/46.51  49.02/49.03  36.70/36.61 
As (R)WSE has been shown to be a special case of OLE, we only compare SE and OLE in the scenario of learning to rank. We adopt two famous ranking systems with regression trees as weak learners. To be consistent in implementation details, we used the same code template. Their differences are only from objective loss functions and regression tree fitting principles.
1. pointwise based McRank ([12]). The multiclass classification based system was reported to be strong in realworld datasets ([24]), and is natural to be one of our baseline systems.
2. pairwise based LambdaMART ([24]). This famous pairwised system gained its reputation in Yahoo Challenge 2010, as a combined system, mainly constructed on LambdaMART, winning the championship. In our work, we only compare with single LambdaMART systems, which are trained using NDCG loss. Maybe LambdaMART systems could be improved further using different configurations, but here it is not our concern.
As shown by the proof, MART is an ideal intersection of these ideas, we do not use this system. For each system and algorithm, we set configurations as follows: the number of leaves is set as 10, 20; the learning rate in Eqn. 1 is set as 0.06, 0.1, 0.12. So there are six configurations for each system. After examining the testing performance in the realworld datasets, we observed several hundreds of iterations (or regression trees) could almost lead to convergence, so we just set maximum number of iterations to 1000 for LambdaMART, and 2500 for McRank, as the latter converges more slowly. We report popular measures, NDCG@(1, 3, 10) and ERR.
4.2 Experimental Comparison on Two Systems
Instead of providing limited testing results using the best parameter from a validating data, we provide two kinds of testing results, one from converged training, anther from the whole training procedure.
First, in Table 2 and 3, we compare exact performances of SE and OLE for six configurations at predefined iteration. Empirically, training in large datasets, systems are easy to converge after sufficient iterations. Second, in Figure 2, we further provide a complete comparison in a whole training with difference configurations.
In table 2 and 3, for McRank and LambdaMART respectively, among 72 comparisons, OLE gains 68 and 56 improvements for at least over 0.1 point, and most of them are 0.3 to 0.4. These improvements are reasonable, as our baselines are strong, and in such large datasets. These statistics are based on six typical configurations, and demonstrate OLE is workable for the McRank and LambdaMART models in a general case.
We further analyze four measures separately, NDCG@(1, 3, 10) and ERR. Though both McRank (Figure 2, shown in the complementary material due to the space limit) and LambdaMART have been improved consistently with OLE, the NDCG@1 (real read line) and ERR (dotted blue line) have relatively smaller improvements. ERR is more difficult to improve than NDCG.
Improvements of NDCG@3 and NDCG@10 on McRank and LambdaMART are more robust in a variety of configurations. As NDCG@1 is computed on the first document predicted by models, and ERR is computed on the whole of ranking documents whose numbers are usually several dozens, in practice, the first page with 10 links returned by a search engine are more desired by users. So we think it may be more useful to improve NDCG@3 and NDCG@10 measures.
As OLE is supposed to have a faster convergence than SE, we also have a statistics of objective losses in the final iteration. OLE indeed leads to smaller objective losses, but not by that much, about 0.32%  1%. As this work only focuses on the splitting rule in a single node, we also tried different strategies to generate node. Widthfirst search and depthfirst search. Interestingly, depthfirst search runs poorly for both baselines and our method. This is an open question and left to future exploration. We thus adopted the widthfirst search and limits the number of leaves.
Regarding the running time, there is no loss for the systems with derivative additive objective losses compared to SE in gradient boosting. Typical such systems are pointwise based. But for pairwise and listwise based, OLE suffers from extra overheads of maintaining exact second derivatives of objective loss function. In an incremental updating style, this overhead is about 30% of SE for pairwise based, and regarding listwise, there may be more.
5 Conclusion
In this paper, we propose a minimum objective loss based tree construction algorithm in the boosting framework, and analyze two existent tree construction principles, least square error and (robust) weighted square error. The former is widely used in the gradient boosting and practical learning to rank systems, while the latter is famous in LogitBoost and classification area. We successful build a relationship between our method and WSE in LogitBoost. We prove that WSE is just a special case of our method. This provides a theoretical support for (robust) LogitBoost and pointwise based ranking systems. Based on our analysis, we show MART is an ideal connection to SE, WSE, and OLE, and obtain a more concise formula for MART. Finally, for a full empirical comparison of the three principles, we implement two strong ranking systems, and examine them with a variety of configurations of regression trees in two largest public datasets. Our results indicate that our proposed method is better used for McRank, LambdaMART and MART systems.
References
 [1] (2006) Nonlinear programming: theory and algorithms, 3rd edition. John Wiley & Sons.
 [2] (2005) Learning to rank using gradient descent. In ICML, pp. 89–96.
 [3] (2011) Learning to rank using an ensemble of lambdagradient models.. JMLRProceedings Track, pp. 25–35. Cited by: §1.
 [4] (2011) Yahoo! learning to rank challenge overview.. JMLRProceedings Track, pp. 1–24. Cited by: §1, §2.
 [5] (2010) Gradient descent optimization of smoothed information retrieval metrics. Information Retrieval (3), pp. 216–235.
 [6] (2006) Subset ranking using regression. In COLT, pp. 605–619. Cited by: §1, §2.2.
 [7] (2003) An efficient boosting algorithm for combining preferences. JMLR, pp. 933–969.
 [8] (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp. 1189–1232. Cited by: §1, §1, §2.1, §2.2.1.
 [9] (2000) Additive logistic regression: a statistical view of boosting. Annals of statistics. Cited by: §1, §2.2.1.
 [10] (2010) Direct loss minimization for structured prediction. In NIPS, pp. 1594–1602.
 [11] (2007) Direct optimization of ranking measures. CoRR abs/0704.3359. Informal publication.
 [12] (2007) McRank: learning to rank using multiple classification and gradient boosting. In NIPS, pp. 897–904. Cited by: §1, §1, §2.2, §4.1, Table 1.
 [13] (2009) ABCboost: adaptive base class boost for multiclass classification. In ICML, pp. 625–632.
 [14] (2010) Fast abcboost for multiclass classification. arXiv preprint arXiv:1006.5051. Cited by: §1.
 [15] (2010) Robust logitboost and adaptive base class (abc) logitboost. UAI. Cited by: §1, §1, §2.2.1, §2.2.2, §3.3, §3.3.
 [16] (2009) Learning to rank for information retrieval. Foundations and Trends in IR (3), pp. 225–331.
 [17] (1975) The analysis of permutations. Applied Statistics.
 [18] (2010) LETOR: a benchmark collection for research on learning to rank for information retrieval. Information Retrieval (4), pp. 346–374. Cited by: §4.1.
 [19] (2011) On ndcg consistency of listwise ranking methods. In AISTATS, pp. 618–626. Cited by: §3.2.
 [20] (2012) Boosting: foundations and algorithms. MIT Press. Cited by: §3.1.
 [21] (2013) Direct optimization of ranking measures for learning to rank models. In SIGKDD, pp. 856–864. Cited by: §1, §4.1.
 [22] (2011) Parallel boosted regression trees for web search ranking. In WWW, pp. 387–396.
 [23] (2009) Learning to rank by optimizing ndcg measure. In NIPS, pp. 1883–1891.
 [24] (2010) Adapting boosting for information retrieval measures. Information Retrieval (3), pp. 254–270. Cited by: §1, §4.1, §4.1, Table 1, footnote 2.
 [25] (2009) Statistical consistency of topk ranking. In NIPS, pp. 2098–2106.
 [26] (2008) Listwise approach to learning to rank: theory and algorithm. In ICML, pp. 1192–1199.