A review on ranking problems in statistical learning

# A review on ranking problems in statistical learning

Tino Werner1
11Institute for Mathematics, Carl von Ossietzky University Oldenburg, P/O Box 2503, 26111 Oldenburg (Oldb), Germany, tino.werner1@uni-oldenburg.de
###### Abstract

Ranking problems define a widely spread class of statistical learning problems with many applications, including fraud detection, document ranking, medicine, credit risk screening, image ranking or media memorability. In this article, we systematically describe different types of non-probabilistic supervised ranking problems, i.e., ranking problems that require the prediction of an order of the response variables, and the corresponding loss functions resp. goodness criteria. We discuss the difficulties when trying to optimize those criteria. As for a detailed and comprehensive overview of existing machine learning techniques to solve such ranking problems, we group the suitable techniques into SVM-, tree-, Boosting and Neural Network-type approaches and recapitulate the corresponding optimization problems in a unified notation. We also discuss to which of the ranking problems the respective algorithms are tailored and identify their strengths and limitations. Computational aspects and open research problems are also considered.

Ranking problems; Supervised learning; Empirical risk minimization; Structural risk minimization; Surrogate losses

## 1 Introduction

Search-engines like Google provide a list of web-sites that are suitable for the user’s query in the sense that the first web-sites that are displayed are expected to be the most relevant ones. Mathematically spoken, the search-engine has to solve a ranking problem which is done by the PageRank algorithm (Page et al. (1999)) for Google.

In their seminal paper (Clémençon et al. (2008)), Clémençon and co-authors proposed a statistical framework for ranking problems and proved that the common approach of empirical risk minimization is indeed suitable for ranking problems. Although there already existed ranking techniques, most of them indeed follow the ERM principle and can directly be embedded into the framework of Clémençon et al. (2008).

In general, the responses in data sets corresponding to those problems are binary, therefore a natural criterion for such binary ranking problems is the probability that an instance belongs to the class of interest. While ranking can be generally seen in between classification and regression, those binary ranking problems are very closely related to binary classification tasks (see also Balcan et al. (2008)). For binary ranking problems, there exists vast literature, including theoretical work as well as learning algorithms that use SVMs (Brefeld and Scheffer (2005), Herbrich et al. (1999), Joachims (2002)), Boosting (Freund et al. (2003), Rudin (2009)), neural networks (Burges et al. (2005)) or trees (Clémençon and Vayatis (2008), Clémençon and Vayatis (2010)).

As for the document ranking, the labels may also be discrete, but with classes, for example in the OHSUMED data set (Hersh et al. (1994)). For such general partite ranking problems, there also has been developed theoretical work (Clémençon et al. (2013c)) as well as tree-based learning algorithms (Clémençon and Robbiano (2015a), Clémençon and Robbiano (2015b), see also Robbiano (2013)).

Recently, Clémençon investigated a new branch of ranking problems, namely the continuous ranking problems where the name already indicates that the response variable is continuous, with potential applications in natural sciences or quantitative finance (cf. Clémençon and Achab (2017)). This continuous ranking problem can be located on the other flank of the spectrum of ranking problems that is closest to regression.

The continuous ranking problem is especially interesting when trying to rank instances whose response is difficult to quantify. A common technique is to introduce latent variables which are used for example to measure or quantify intelligence (Borsboom et al. (2003)), personality (Anand et al. (2011)) or the familiar background (Dickerson and Popli (2016)). While in these cases, the latent variables are treated as features, a continuous ranking problem would arise once a response variable which is hard to measure is implicitly fitted by replacing it with some latent score which is much more general than ranking binary responses by means of their probability of belonging to class 1. An example is given in Lan et al. (2012) where images have to be ranked according to their compatibility to a given query. Another application of continuous ranking problems is given in the risk-based auditing context to detect tax evasion, using the restricted personal resources of tax offices as reasonable as possible. Risk-based auditing can be seen as a general strategy for internal auditing, fraud detection and resource allocation that incorporates different types of risks to be more tailored to the real-world situation, see Pickett (2006) for a broad overview, Moraru and Dumitru (2011) for a short survey of different risks in auditing and Khanna (2008) and Bowlin (2011) for a study on bank-internal risk-based auditing resp. for a study on risk-based auditing for resource planning.

This paper is organized as follows. Starting in section 2 with the definition of several different ranking problems that are distinguished by the nature of the response variable and by the goal of the analyst, it becomes evident that suitable loss functions have at least a pair-wise structure in this case. We describe in detail the loss functions corresponding to the different types of ranking problems and related quality criteria which are optimized especially for ranking problems with discrete response variable. We also carefully discuss the combined ranking problems and distinguish between ranking and ordinal regression. In section 3, we provide a systematic overview of different machine learning algorithms by grouping them into SVM-, tree-, Boosting- and Neural Network-type approaches. We review these approaches and discuss their strengths, limitations and computational aspects. We conclude with open research problems for supervised ranking.

## 2 Supervised ranking problems

In this work, we always have data where and where denotes the th row of the regressor matrix .

Solutions to ranking problems do not necessarily need to recover the responses based on the observations . In fact, the goal is in general to predict the right ordering of the responses albeit there exist some relaxations of this (hard) ranking problem, e.g., only the top instances have to be ranked exactly while the predicted ranking of the other instances is not a quantity of interest. Clémençon et al. (2005) and Clémençon et al. (2008) provided the theoretical statistical framework for empirical risk minimization in the ranking setting.

### 2.1 Different types of ranking problems

We try to rank the instances by comparing their predicted response, i.e., will be ranked higher than if . Then, Clémençon et al. (2008) provide the following definitions.

###### Definition 2.1.

With the convention above,

a) a ranking rule is a mapping where indicates that is ranked higher than and vice versa.

b) a ranking rule induced by a scoring rule is given by

with a scoring function where precisely if .

In this work, we will refer to the problem to correctly rank all instances as the hard ranking problem which is a global problem. A weaker problem is the localized ranking problem that intends to find the correct ordering of the best instances, so misrankings at the bottom of the list are not taken into account. However, misclassifications have to be additionally penalized in this setting. It is obvious that these two problems are stronger problems than classification problems.

In contrast, sometimes it suffices to tackle the weak ranking problem where we only require to reliably detect the best instances but where their pair-wise ordering is not a quantity of interest. For example, in the tax fraud detection context, we try to find the most suspicious tax payers whose income tax statements need to be rigorously verified. If one knew that one exactly will review instances (which is not a realistic assumption), it would not be necessary to try to predict which of them is the most suspicious one. This problem has been identified in Clémençon and Vayatis (2007) as a classification problem with a mass constraint, since we require to get exactly class 1 objects if class 1 is defined as the ”interesting” class.

We will always denote the index set of the true best instances by and its empirical counterpart, i.e., the indices of the instances that have been predicted to be the best ones, by .

Worked out theory for the weak and localized ranking problem is given in Clémençon and Vayatis (2007).

On the other hand, one distinguishes between three other types of ranking problems in dependence of the set . If is binary-valued, w.l.o.g. , then a ranking problem that intends to retrieve the correct ordering of the probabilities of the instances to belong to class 1 is called a bipartite ranking problem (binary ranking problem). If can take different values, a corresponding ranking problem is referred to as a partite ranking problem and for continuously-valued responses, one faces a continuous ranking problem.

Further discussions on possible combinations of these types of ranking problems and their relation to classification and regression follow in section 2.5.

### 2.2 Loss functions for supervised ranking

Empirical risk minimization needs the definition of a suitable risk function. The hard ranking risk, i.e., the risk function of the hard ranking problem, introduced in Clémençon et al. (2005) is given by

 Rhard(r):=I\negthinspace E[I((Y−Y′)r(X,X′)<0)], (2.1)

so in fact, this is nothing but the probability of a misranking of and . Thus, empirical risk minimization intends to find an optimal ranking rule by solving the optimization problem

where

 Lhardn(r)=1n(n−1)∑∑i≠jI((Yi−Yj)r(Xi,Xj)<0) (2.2)

where is some class of ranking rules . For the sake of notation, the additional arguments in the loss functions are suppressed. Note that , i.e., the hard empirical risk, is also the hard ranking loss function which reflects the global nature of hard ranking problems.

In this review, we restrict ourselves to ranking rules that are induced by scoring rules. Considering some parameter space , it suffices to empirically find the best scoring function (and with it, the empirically optimal induced ranking rule) by solving the parametric optimization problem

with

 Lhardn(θ)=1n(n−1)∑∑i≠jI((Yi−Yj)(sθ(Xi)−sθ(Xj))<0). (2.3)

For the weak ranking problem, Clémençon and Vayatis (2007) introduce the upper quantile for the random variable for binary responses. Since a weak ranking problem can also be formulated for continuous-valued responses, Werner (2019) introduced the transformed responses

where the ranks come from a descending ordering. Then the misclassification risk corresponding to the weak ranking problem in the sense of Clémençon and Vayatis (2007) is given by

with the empirical counterpart

 Lweak,Kn(s)=1nn∑i=1I(~Y(K)i(s(Xi)−^Q(s,1−u(K)))<0)

for the empirical quantile . To approximate the quantile, one needs to set , i.e., for a given quantile , one looks at the top instances that represent this upper quantile.

###### Remark 2.1.

Werner (2019) mention that due to the mass constraint, each false positive generates exactly one false negative, so the loss can be equivalently written as

 Lweak,Kn(s)=2n∑i∈BestKI(~Yi(s(Xi)−^Q(s,1−u(K)))<0).

Note that the weak ranking loss is not standardized, i.e., it does not necessarily take values in the whole interval . More precisely, as shown in Werner (2019), its maximal value is always , so we can only hit the value one if for even and if all instances that belong to the ”upper half” and predicted to be in the ”lower half” and vice versa. For better comparison of the losses, they propose the standardized weak ranking loss

 Lweak,K,normn(s)=1K∑i∈BestKI(~Yi(s(Xi)−^Q(s,1−u(K)))<0). (2.4)
###### Remark 2.2.

Having get rid of the ratio , the standardized weak ranking loss function has a very intuitive interpretation. For a fixed , a standardized weak ranking loss of means that of the instances of did not have been recovered by the model.

A suitable loss function for the localized ranking problem was proposed in Clémençon and Vayatis (2007), too. In our notation, it is given by

 Lloc,Kn(s):=n−nLweak,Kn(s)+1n(n−1)∑∑i≠jI({(s(Xi)−s(Xj))(Yi−Yj)<0}∩{min(s(Xi),s(Xj))≥^Q(s,1−u(K))}) (2.5)

In the second summand, indicates the number of negatives, so the quotient is just an estimation for . Note that Clémençon and Vayatis (2007) introduced this loss for binary-valued responses. Werner (2019) propose to set for continuously-valued responses since localizing artificially labels the top instances as class 1 objects, hence we get negatives. Again, the second summand may be rewritten as

As the weak ranking loss, this loss is not standardized. Taking a closer look on it, the maximal achievable loss given a fixed is

so a standardized version, provided in Werner (2019), is simply

###### Remark 2.3.

Note that even in the case for even , the localized ranking loss cannot take the value one as mentioned in Werner (2019). This is true since

A simple example for clarification is given below in example 2.1 which we borrow from Werner (2019). We insist to once more take a look on the U-statistics that arise for the hard and the localized ranking problem. Clémençon et al. (2008) already mentioned that these pair-wise loss functions can be generalized to loss functions with input arguments. This leads to U-statistics of order . But if the whole permutations that represent the ordering of the response values should be compared at once (i.e., ), then this again boils down to a U-statistic of order 2. Let

and let , be the true resp. the estimated permutation, then the empirical hard ranking loss can be equivalently written as

 Lhardn(π,^π)=2n(n−1)∑∑i
###### Example 2.1.

Assume that we have a data set with the true response values

and the fitted values

Then we order the vectors according to , so that and get the permutations

For example, is the largest value of , having rank 1. So we reorder such that is the first entry. But since this is only the second-largest entry of , we have a rank of 2, leading to the first component and so forth.

Setting , we obviously get

The standardized weak ranking loss is then

which is most intuitive since one of the indices of the four true best instances is not contained in the predicted set . The second part of the localized loss is then

This makes it obviously why the misclassification loss has to be included since this loss would be same if the instances of rank 4 and 5 were not switched. The complete localized ranking loss is

The standardized localized ranking loss is then

Finally, the hard ranking loss is

Setting , the weak ranking loss is zero and the localized ranking loss is

The standardized localized ranking loss is

The hard ranking loss is a global loss and does not change when changing .

This nice and simple example has shown how important the selection of can be.

### 2.3 Fast computation of the hard ranking loss

A naïve evaluation of the hard ranking loss requires comparisons. This will surely become infeasible for data sets with many observations, therefore Werner (2019) provided a solution which comes up with evaluations.

They take a look at the concordance measure

called Kendall’s Tau. Unlike the ranking loss which is high if there are many misrankings and which is valued, the Kendall’s Tau is high if many pairs are concordant, i.e., if the pair-wise ranking is correct in most cases and takes values in .

This leads to a bijection between these two quantities if we do not face ties as given in the following lemma from Werner (2019). We also recapitulate its proof.

###### Lemma 2.1 (Hard ranking loss and Kendall’s Tau).

Assume the vectors and have the same length and do not contain ties. Then it holds that

###### Proof.

In the case of a perfect concordance, the ranking loss function is zero whereas Kendall’s takes the value 1. If we produce one misranking, w.l.o.g. by swapping the largest and the second largest entry resp. of , the indicator function in the ranking loss jumps from zero to 1 for , increasing the total ranking loss by . The same manipulation results in the summands for the same indices in the Kendall’s changing from 1 to -1, decreasing it by .

By induction, the claimed formula is valid.

This indeed turns out to be useful in practice since there exists an command that provides fast computation of Kendall’s , namely the command cor.fk from the package pcaPP (Filzmoser et al. (2018)). The algorithm essentially goes back to Knight (Knight (1966)) and relies on the idea of fast ordering algorithms. So in fact, they first compute Kendall’s Tau using cor.fk and then, they use the bijection to compute the hard ranking loss which results in the number of calculations necessary for the computation of the hard ranking loss decreasing from in the naïve implementation to .

### 2.4 Quality criteria for ranking

So far, we presented loss functions for ranking problems that lead to algorithms in the spirit of the ERM resp. SRM paradigm. On the other hand, there also exist quality measures that are popular in classification settings but which already have been transferred to the ranking setting. Before we go into detail, we recapitulate the definition of a common and well-known quality criterion for classification.

###### Definition 2.2.

Let take values in where the total number of positives is and the total number of negatives is . Let , , be predicted values.

a) The true positive rate (TPR) and the false positive rate (FPR) are given by

b) The Receiver Operation Characteristic Curve (ROC Curve) is the plot of the true positive rate against the false positive rate.

c) The AUC is the abbreviation for the area under the ROC curve.

For theoretical aspects of the empirical AUC and its optimization, we refer to Agarwal et al. (2005), Cortes and Mohri (2004) and Calders and Jaroszewicz (2007). We continue presenting the reparametrization of the ROC curve as it has been introduced in Clémençon et al. (2008) and used in subsequent papers of Clémençon and coauthors.

###### Definition 2.3.

For a scoring function , the true positive rate and the false positive rate are given by

Setting

the ROC curve is the plot of against the level .

The ROC curve is a standard tool to validate binary classification rules. If the classification depends on a threshold, different points of the ROC curve are generated by changing the threshold and calculating the TPR and the FPR. Since the goal is to achieve a TPR as high as possible for the price of a FPR as low as possible, one usually chooses the threshold corresponding to the upper-leftmost point of the empirical ROC curve. A combined quality measure that incorporates all points of the ROC curve is the AUC where a classification rule is better the higher the empirical AUC is. Random guessing clearly has a theoretical AUC of 0.5.

For the bipartite localized ranking problem, Clémençon and Vayatis (2007) provide the following localized version of the AUC. It is important to note a strong equivalence between the AUC and the ranking error in the sense that minimizing this error is equivalent to maximizing the AUC corresponding to the scoring function (see Clémençon and Vayatis (2007)).

###### Definition 2.4.

The localized AUC is defined as

.

As for partite ranking problems, i.e., can take different values, Clémençon et al. (2013c) proposed the VUS (volume under the ROC surface) as quality criterion.

###### Definition 2.5.

Let w.l.o.g. take values in and let again take values in . For a scoring function , define

for .

a) The ROC surface is the ”continuous extension” (Clémençon et al. (2013c)) of the plot

for .

b) The VUS is the volume under the ROC surface.

In this definition, the term ”continuous extension” means to connect the points by hyperplane parts as described in Clémençon et al. (2013c). The ROC surface can be interpreted as joint plot of the class-wise true positive rates since if the value of the scoring function is between and (artificially define and ), the instance is assigned to class .

Other well-known quality criteria for ranking problems are for example the MAP (mean average precision) and the NCDG (normalized discounted cumulative gain), but since we focus on losses in this article, we do not consider algorithms that optimize such quality criteria that are not directly related to a ranking loss like the AUC.

### 2.5 Discussion of the different ranking problems

In this subsection, we recapitulate a discussion from Werner (2019) of the different types of ranking problems introduced earlier from a qualitative point of view and the differences between ranking and ordinal regression.

Ordinal regression problems are indeed very closely related to ranking problems. As already pointed out in Robbiano (2013), especially multipartite ranking problems (Clémençon et al. (2013c)) share the main ingredient, i.e., the computation of a scoring function that should provide pseudo-responses with a suitable ordering. However, the main difference is that the multipartite ranking problem is already solved once the ordering of the pseudo-responses is correct while the ordinal regression problem still needs thresholds such that a discretization of the pseudo-responses into the classes of the original responses is correct.

Note that due to the discretization, ordinal regression problems can also be perfectly solved even if the rankings provided by the scoring function are not perfect. For example, consider observations with indices that belong to class . If for a scoring rule we had the predicted ordering but the true ordering is different, then we can still choose thresholds such that all instances that belong to class (and no other instance) are classified into this class, provided that . Though, as Robbiano (2013) already pointed out, the ordinal regression is based on another loss function.

Concerning informativity, one can state that multipartite ranking problems are more informative than ordinal regression problems due to the chunking that is done in the latter ones. But in fact, in an intermediary step, i.e., when having computed the scoring function, the ordinal regression problem is as informative as multipartite ranking problems. This is also true for standard logit or probit models (the two classes generally are not ordered, but when artificially replacing the true labels by and where the particular assignment does not affect the quality of the models, they can indeed be treated as ordinal regression models) where the real-valued pseudo-responses computed by the scoring function are discretized at the end to have again two classes.

The continuous ranking problem can be treated as a special case where no pseudo-responses are needed since the original responses are already real-valued, but again, instead of optimizing some regression loss function, the goal is actually to optimize a ranking loss function.

For further discussions on the relation of ranking and ordinal regression (also called ”ordinal classification” and ”ordinal ranking” in the reference), see Lin (2008).

From this point of view, the three combined problems for the continuous case, i.e., weak, hard and localized continuous ranking problems, are easy to distinguish and are all meaningful. Hard bipartite and hard partite ranking problems are essentially optimized by the corresponding algorithms that we will describe in subsection 3.2 and localized bipartite ranking problems can be solved using the tree-type algorithms of Clémençon as pointed out for instance in Clémençon et al. (2013b).

Clearly, these localized bipartite problems directly reflect the motivation from risk-based auditing or document retrieval. It has been mentioned in Clémençon and Robbiano (2015b) that their tree-type algorithm is not able to optimize the VUS locally. To the best of our knowledge, this has not been achieved until now. But indeed, localized partite ranking problems can also be interesting in document retrieval settings where the classes represent different degrees of relevance. Then it would be interesting for example to just recover the correct ranking of the relevant instances, i.e., the ones from the ”best” classes.

As mentioned earlier, weak ranking problems can be identified with binary classification with a mass constraint (Clémençon and Vayatis (2007)). In the case of weak bipartite ranking problems, it may sound strange to essentially mix up two classification paradigms, but one can think of performing binary classification by computing a scoring function and by predicting each instance as element of class 1 whose score exceeds some threshold, as it is done for example in logit or probit models. One can think of choosing the threshold such that there are exactly instances classified into class 1 instead of optimizing the AUC or some misclassification rate.

The only combination that does not seem to be meaningful at all would be weak partite ranking problems. By its inherent nature, a weak ranking problem imposes are binarity which cannot be reasonably given for the partite case. Even in the document retrieval setting, a weak partite ranking problem may be thought of trying to find the most important documents which implied that the information that is already given by the classes would be boiled down to essentially two classes, so this combination is not reasonable.

### 2.6 Other ranking problems

In this subsection, we want to clarify the scope of this review by briefly referring to examples of probabilistic resp. unsupervised ranking problems.

First, probabilistic approaches predict a probability distribution on the set of permutations of the instances, i.e., the set . Two prominent models are the Mallows model and the Plackett-Luce model. The Mallows model (Mallows (1957)) is based on distances between different permutations, in general based on Kendall’s Tau, which leads to a maximum likelihood approach. The Plackett-Luce model (Luce (1959), Plackett (1975)) performs a Bayes estimation. However, we do not go into detail since the types of algorithms that we encounter in this review are different.

Unsupervised ranking refers to the situation where instances ,…, are given and should be ranked according to their degree of anomaly as described in Clémençon and Robbiano (2014), therefore they refer to it as anomaly ranking. They propose a so-called mass-volume curve as quality criterion for the unsupervised ranking problem and show in (Clémençon and Robbiano, 2014, Thm. 1) under which conditions the ROC curve is related to the MV curve in terms of a bijection. They extend their TreeRank algorithm for supervised binary ranking problems (see Subsection 3.2) to this case. See also Goix et al. (2015), Clémençon et al. (2016) and Clémençon and Thomas (2017) as well as references therein for further details on anomaly ranking.

The PageRank algorithm (Page et al. (1999)) can also be identified with unsupervised ranking since it does not invoke any response variable but is based on a graphical model including an adjacency matrix that represents the links connecting the different websites.

## 3 Current techniques to solve ranking problems

This section is divided into four parts. Each subsection is devoted to a particular underlying machine learning algorithm for the discussed ranking approach, i.e., Support Vector Machines (SVM), trees, Boosting and Neural Networks resp. Deep Learning.

### 3.1 SVM-type approaches

Joachims (2002) provided the RankingSVM algorithm for document retrieval which is essentially based on a similar approach for ordinal regression introduced in Herbrich et al. (1999). In their situation, a set of documents is given. The goal is that for a given query, a scoring function has to be computed such that the ordering of the documents according to the scoring function is as concordant as possible with the true ordering according to the relevance of the documents w.r.t. the query.

Their goal is to solve a hard bipartite ranking problem, but they do not formulate the hard ranking loss but the constraint inequalities in the sense that for being more relevant than , given each of the queries. As they argue, trying to find a scoring function such that every inequality is satisfied would be NP-hard, so they introduce slack variables and formulate the problem as a standard SVM problem but with all the relaxed inequalities as constraints, so that one gets a standard SVM-type solution for a kernel as identified in Clémençon et al. (2013b). Due to the equivalence of SVM problems and structural risk minimization problems with a Hinge loss, Clémençon et al. (2013b) translated the criterion in Joachims (2002) into the regularized pair-wise empirical loss

.

where is some Reproducing Kernel Hilbert Space (RKHS) defined by a kernel (see for example Schölkopf et al. (2001)). They call their algorithm RankingSVM.

An implementation of RankingSVM is given as software package (Joachims (1999)) in C language as well as an improved implementation in the software package relying on the cutting-plane algorithm from Joachims (2006). As for the computation of the solutions, note that Chapelle and Keerthi (2010) argued that the implementation for RankingSVM requires the computation of all pairwise differences which leads to a complexity of . They propose a truncated Newton step which is computed via conjugate gradients in order to remedy this issue and result with the MATLAB implementation PRSVM , essentially reducing the respective complexity to for . Chen et al. (2017) accelerate the computation of the kernel matrix for the case by invoking the kernel approximation which generates an approximate kernel Hilbert space and provides an SVM solution of the form

They propose two methods to get a suitable kernel approximation. The first is a Nyström approximation where rows of , say, ,…,, are sampled uniformly, followed by a singular value decomposition of the matrix . Truncating the SVD by taking just the first columns of the orthonormal matrix and the upper left submatrix of the diagonal matrix, one gets a rank-approximation, reducing the complexity to . Another strategy is to Fourier transform the kernel, i.e.,

,

and to draw samples according to , providing a kernel approximation using Bochner’s theorem. Despite the approximation error is higher than for the Nyström approximation (for equal ), the complexity is just . Chen et al. (2017) provide publicly available MATLAB code .

Rakotomamonjy (2004) and Ataman and Street (2005) use the fact that the binary hard ranking problem can be solved by maximizing the AUC of the scoring function. Since the responses are binary-valued, Rakotomamonjy (2004) explicitely distinguishes between positive and negative instances by writing resp. for the features. The empirical AUC can be estimated by

for . Using the definition of as equality constraint, Rakotomamonjy (2004) formulate the problem as SVM-type problem by considering linear scoring functions . They show that the solution essentially has the form

in the general case when using kernels. In Rakotomamonjy (2004), the algorithm is applied to different data sets, including a cancer and a credit data set. They conclude that their algorithm also provides good accuracy performances. Ataman and Street (2005) try a MATLAB and a WEKA implementation, the algorithm from Rakotomamonjy (2004) can be found in a MATLAB toolbox 555http://asi.insa-rouen.fr/enseignants/ arakoto/toolbox/ (Canu et al. (2005)).

Brefeld and Scheffer (2005) provide a very similar approach, but they both provide a so-called ”1-Norm” and ”2-Norm” problem, namely

for the target function of the SVM, where , and the corresponding solutions. A recommendation for the choice of is however not given. Due to the evaluation of the kernel matrix and the quadratically growing number of constraints, their algorithm is of complexity . They however make some suggestions how to reduce the complexity.

Cao et al. (2006) argue that a major weakness of RankingSVM is that misranking on the top of the list get the same loss as misrankings at the bottom. Therefore, they propose a weighted variant of the Hinge loss in the sense that the weights are higher the higher the importance of the documents and the queries is. They apply their algorithm to the OHSUMED data set.

Jung et al. (2011) provide Ensemble RankingSVM by combining different RankingSVM models.

However, since SVM-type solutions are not sparse, there are several approaches to construct SVM-type ranking functions with feature selection.

Tian et al. (2011) consider essentially the same problem as Rakotomamonjy (2004), but with the crucial difference that the target function is

for , so has been replaced by a concave loss. They solve the problem with a multi-stage convex relaxation technique. They conclude that by the norm, the algorithm indeed performs feature selection which results from the equivalence to write an SVM problem as a regularized problem with the Hinge loss. Since the number of constraints grows quadratically with the number of observations, they propose to cluster the observations first and to just perform the computations on the representants.

Another approach is given in Lai et al. (2013a) where they replace the quadratic penalty (i.e., in the equivalent formulization) with an regularization term and use the squared Hinge loss. They solve the problem by invoking Fenchel duality (hence the name FenchelRank) and prove convergence of the solution. After experiments on real data sets for document retrieval, they conclude sparsity of the solutions as well as superiority of FenchelRank to non-sparse algorithms. They implement their method in MATLAB. An iterative gradient procedure for this problem has been developed in Lai et al. (2013b) and shows comparable performance.

As an extension of FenchelRank, Laporte et al. (2014) tackle the analogous problem with nonconvex regularization to get even sparser models. They solve the problem with a so-called majorization minimization method where the nonconvex regularization term is represented by the difference of two convex functions. In addition, for convex regularization, they present an approach that relies on differentiability and Lipschitz continuity of the penalty term so that the ISTA-algorithm can be applied. They provide publicly available MATLAB code .

Another approach that does not provide an SVM-type solution at the first glance is given in Pahikkala et al. (2007). They intend to predict the differences of the responses by the differences of the scores assigned to the respective features, i.e., to essentially solve

 1n(n−1)∑∑i

for some function and some kernel with corresponding RKHS . Since this problem is clearly not tractable, as Pahikkala et al. (2007) point out, they instead minimize the regularized least-squares-type criterion

 1n(n−1)∑∑i

Using the representer theorem (see e.g. Schölkopf et al. (2001)), the solution has the form

for some . The algorithm is called RankRLS (”regularized least squares”). The complexity of the algorithm is of order resulting from matrix inversion and matrix multiplication. Note that Pahikkala et al. (2010) provided a greedy method to compute the respective inverse by successively selecting up to features which results in an overall complexity of their greedy RankRLS algorithm of . Pahikkala et al. (2010) provided a link leading to implementations of both RankRLS and greedy RankRLS, but it does not seem to be available anymore.

Summarizing, there exist a rich variety of SVM-type ranking algorithms in order to minimize the hard ranking loss, including approaches that provide sparse solutions. The approach of Cao et al. (2006) minimizes a weighted hard ranking loss and can be seen as the closest SVM-type approach for localized ranking problems. However, note that the algorithms are tailored to bipartite ranking problems. Furthermore, SVM solutions are in general hard to interpret. In contrast to the AUC-maximizing approaches, the other algorithms make use of a surrogate loss function for the hard ranking loss which is either a pair-wise Hinge or pair-wise squared loss.

### 3.2 Tree-type approaches

Clémençon and Vayatis (2008), Clémençon and Vayatis (2010) and Clémençon and Vayatis (2009), for instance, also concentrate on AUC maximization to solve binary ranking problems as for example Rakotomamonjy (2004), but in a stricter and more sophisticated way. Given the true conditional probability and a scoring function , they introduce metrics on the ROC space which are

and

where is the optimal ROC curve and the ROC curve induced by the scoring function . Note that the absolute value in the supremum is not necessary since per definition since the optimal ROC curve dominates every competitor ROC curve. The idea in the cited references is to optimize the ROC according to , i.e., in an sense due to the disadvantage that an optimization is nothing but a AUC-optimization due to

 d1(s,η)=∫10ROC∗(α)dα−∫10ROCs(α)dα=AUC∗−AUCs.

An AUC-optimization is not appropriate according to the authors since different ROC curves can have the same AUC.

Clémençon and coauthors provide tree-type algorithms which turn out to be an impressively flexible class of ranking algorithms that can be applied to all hard ranking problems as well as to localized binary ranking problems.

As for binary ranking problems, they provided TreeRank and RankOver (Clémençon and Vayatis (2008), Clémençon and Vayatis (2010)). The idea behind the TreeRank algorithm is to divide the feature space into disjoint parts and to construct a piece-wise constant scoring function

for . This results in a ROC curve that is piece-wise linear with nodes (not counting and ) as shown in (Clémençon and Vayatis, 2008, Prop. 13). The TreeRank algorithm then recursively adds nodes between all existing nodes such that the ROC curve approximates the optimal ROC curve by splitting each region in two parts. More precisely, one starts with the region and the coefficients . In each stage of the tree and in every iteration , one computes the estimates

and optimizes the entropy measure

 Ent(Cd,k):=(αd,k+1−αd,k)^β(Cd,k)−(βd,k+1−βd,k)^α(Cd,k)

by finding a subset of which maximizes this empirical entropy. The coefficients are updated recursively.

Similarly, the RankOver algorithm constructs a piece-wise linear approximaton of the optimal ROC curve by computing a piece-wise constant scoring function, too, but instead of partitioning the feature space, it generates a partition of the ROC space. However, the authors seem to prefer TreeRank over it since their subsequent algorithms are based on the former, so we do not explain more details of RankOver.

Clémençon and Vayatis (2008) already mention that TreeRank may be used as weak ranker for a Boosting-type approach.

Extensions by combining the TreeRank algorithm in combination with bagging resp. in a RandomForest-like sense are given in Clémençon et al. (2009), Clémençon et al. (2013a). A crucial question is how to combine the rankings predicted by the different trees. This leads to a so-called Kemeny aggregation (Kemeny (1959), see also Korba et al. (2017) for theoretical aspects of rank aggregation) where a consensus ranking is computed. Having some distance measure which in Clémençon et al. (2009) and Clémençon et al. (2013a) may be a Spearman correlation or Kendall’s Tau, the consensus ranking, represented by a permutation , is the solution of

for the predicted permutations for tree , respectively. As for the RandomForest-type approach (”Ranking forest”), Clémençon et al. (2013a) make two suggestions how to randomize the features in each node.

As for the pruning of ranking trees, we refer to Clémençon et al. (2011) and Clémençon et al. (2013a) who recommend to use the penalized empirical AUC as pruning criterion, i.e., for a tree , one selects the subtree which maximizes

where denotes the scoring function corresponding to tree .

The TreeRank algorithm has been available in the package TreeRank, but it had been removed. Nevertheless, the source code is still available .

Theoretically, these tree-type algorithms provide an advantage over the algorithms that optimize the AUC since they approximate the optimal ROC curve in an sense while the competitors just optimize the ROC in an sense (see (Clémençon and Vayatis, 2010, Sec. 2.2)). On the other hand, they suffer from strong assumptions since it is required that the optimal ROC curve is known. Additionally, this optimal ROC curve has to fulfill some regularity conditions which is differentiability and concavity for the TreeRank algorithm and twice differentiability with bounded second derivatives for the RankOver algorithm.

These tree-type algorithms are tailored to bipartite ranking problems. However, as pointed out in Clémençon et al. (2013b), they can be used for local AUC optimization (see Def. 2.4), so they are applicable for both hard and localized bipartite ranking problems while the AUC-maximizing competitors show inferior local ranking performance in the simulation studies of Clémençon et al. (2013b).

As for the partite ranking problems, Clémençon et al. (2013c) argued that they can be regarded as collection of bipartite ranking problems if one considers approaches like one-versus-all or one-versus-one. In Clémençon et al. (2013c), they apply different algorithms tailored to bipartite ranking problems like TreeRank, RankBoost or RankingSVM and evaluate their performance in the VUS criterion.

However, since the algorithms are not designed for VUS-optimization, Clémençon and Robbiano (2015b) modify their TreeRank algorithm such that the splits a each node are performed first in a one-versus-one sense (but only for adjacent classes) and then the optimal split of them is selected according to the VUS criterion. The resulting TreeRankTournament algorithm therefore solved the hard partite ranking problem. Clémençon and Robbiano (2015a) provide a bagged and a RandomForest-type version of this algorithm, analogously to the bipartite case.

Recently, Clémençon and Achab (2017) provided pioneer work for the hard continuous ranking problem which did not have been considered so far. Let w.l.o.g. . Then each subproblem

for , i.e., given should be stochastically larger than given , is a bipartite ranking problem, so the continuous ranking problem can be regarded as a so-called ”continuum” of bipartite ranking problems (Clémençon and Achab (2017)).

As a suitable performance measure, they provide the area under the integrated ROC curve

where indicates the ROC curve of scoring function for the bipartite ranking problem corresponding to and where is the marginal distribution of . Alternatively, they make use of Kendall’s as a performance measure for continuous ranking.

The approach presented in Clémençon and Achab (2017) manifests itself in the tree-type CRank algorithm that divides the input space and therefore the training data into disjoint regions. In each step/node, the binary classification problem corresponding to the median of the current part of the training data is formulated and solved. Then, all instances whose predicted label was positive are delegated to the left children node, the others to the right children node. Stopping when a predefined depth of the tree is reached, the instance of the leftmost leaf is ranked highest and so far, so the rightmost leaf indicates the bottom instance.

Clémençon and Achab (2017) already announced a forthcoming paper where a RandomForest-type approach for CRank will be presented.

All these tree-type approaches focus on a sophisticated optimization of the AUC or another appropriate criterion. For the price of getting models that are difficult to interpret, these techniques are very flexible and are applicable to the most types of ranking problems.

### 3.3 Boosting-type approaches

In the case of bipartite ranking, the sometimes called ”plug-in approach” that estimates the conditional probability can be realized for example by LogitBoost (see e.g. Bühlmann and Van De Geer (2011)), i.e., minimizing the loss

The resulting function is then used as a (valued) scoring function for the ranking. However, the plug-in approach has disadvantages when facing high-dimensional data and it furthermore just optimizes the ROC curve in an sense as pointed out in Clémençon and Vayatis (2008), Clémençon and Vayatis (2010). Taking a closer look on this loss function, it is indeed a convex surrogate of the misclassification loss and does not respect a pair-wise structure. Concerning informativity, one just applies an algorithm that solves a classification problem which is less informative than a ranking problem which is another aspect why this approach cannot be optimal. As mentioned in Clémençon et al. (2013b), a kernel logistic regression may also be thinkable in the same plug-in sense (which has the same weaknesses).

Freund et al. (2003) developed a Boosting-type algorithm (RankBoost) which combines weak rankers in an AdaBoost-style (for the latter, see Freund and Schapire (1997)) benefitting from the binarity of the response variable. First, they propose a distribution on the space which, for data , is represented as a matrix that essentially contains weights. These weights can be thought of representing the importance to rank the corresponding pair correctly. As for the weak rankers which are nothing but a scoring function , they consider either the identity function or a function that maps the features essentially into the set according to some threshold. More precisely, the weak ranker is chosen such that the quality measure

is maximized where again denotes the ranking rule introduced in Definition 2.1, meaning that the sum runs over all pairs where is ranked higher than . As the AdaBoost algorithm minimizes the exponential surrogate of the 0/1-classification loss, Clémençon et al. (2013b) pointed out that RankBoost minimizes the pair-wise surrogate loss function

.

Note that there is a small mistake in Section 3.2.1 of Clémençon et al. (2013b) since the minus sign in the exponential function is missing. But if and , the sign of the product is positive which would imply a high loss due to a positive exponent without the minus sign.

It is shown in Rudin and Schapire (2009) that in the case of binary outcome variables, RankBoost and the classifier AdaBoost are equivalent under very weak assumptions. Therefore, RankBoost can also be seen as an AUC maximizer in the bipartite ranking problem. Freund et al. (2003) apply RankBoost for document retrieval. The algorithm is available at the RankLib library (Dang (2013)).

An extension of RankBoost has been provided in Rudin (2009). They intend to optimize essentially

for some (Rudin (2009) originally distinguish positive and negative instances, but Clémençon et al. (2013b) used the notation as in the display above). The argument behind this power loss given in Rudin (2009) is that the higher is chosen, the higher the difference between the loss of misrankings at the top of the list and misrankings at the bottom of the list becomes. The algorithm parallels the RankBoost algorithm in combining weak rankers, but since the weights are not always analytically computable, they may use a linesearch. They call their algorithm p-Norm-Push. The case has been studied in Rakotomamonjy (2012).

So, while RankBoost is tailored to hard bipartite ranking problems (and may be used for partite ranking problems in the sense of Clémençon et al. (2013c)), the p-Norm-Push is closest to handle localized bipartite ranking problems. However, the results of the simulation study in Clémençon et al. (2013b) reveal that the localized AUC criterion for the corresponding predictions is not better than for RankBoost. To the best of our knowledge, the p-Norm-Push has never been applied to partite ranking problems.

Another generalization of RankBoost has been proposed in Zheng et al. (2008), again for hard bipartite ranking problems. They suggest to use a sufficiently regular surrogate of the ranking loss like a squared or a squared Hinge loss and to apply Gradient Boosting (Friedman (2001), Bühlmann and Hothorn (2007)) to this surrogate loss. As weak learner, they consider a so-called ”regression weak learner” to fit the gradients in each iteration. They apply their algorithm to document retrieval data.

In contrast to the already reviewed Boosting-type approaches which are designed for bipartite ranking problems, Werner (2019) argue that in the context of risk-based auditing (see e.g. Alm et al. (1993), Gupta and Nagadevara (2007), Hsu et al. (2015)), it is more reasonable to solve a continuous ranking problem. The risk-based auditing context is in fact an example where even the type of the suitable ranking problem is not determined in advance. One can formulate the problem as a binary ranking problem where the response variable is either tax compliance or a wrong report of the tax liabilities. However, as classification is not as informative as ranking since the classes do not have to be ordered while ranking also incorporates an ordering, ranking in turn is less informative than regression since regression tries to predict the actual response values themselves where ranking just tries to find the right ordering. An analogous argument is true for binary ranking problems and continuous ranking problems. If one states a binary ranking problem, one would just get information which taxpayer is most likely to misreport his or her income without providing any information on its amount. On the other hand, if one sets up a continuous ranking problem where the amount of damage is the variable of interest, one can directly get information about the compliance of the taxpayer by looking at the sign of the response value. In particular, if information on the compliance is available, then one can assume that the information on the amount of additional payment or back-payment has also been collected, so imposing a binary ranking problem would lead to a large loss of information.

The Boosting-type and most of the SVM-type approaches that we reviewed so far invoke surrogate losses of the hard ranking loss (or even of the 0/1-classification loss). It is discussed in Werner (2019) whether an analogous approach is appropriate for a Gradient Boosting algorithm (see e.g. Bühlmann and Hothorn (2007)) for the hard continuous ranking problem. They conclude that due to the support of the response variable which is no longer just or some finite set as in the partite ranking problem, exponential or Hinge surrogates would dramatically fail to be meaningful surrogates for the hard ranking loss. Another weakness would be the necessity to evaluate the gradients of the pair-wise loss (which are sums themselves) in each Boosting iteration, making the algorithm computationally expensive.

To handle these issues, Werner and Ruckdeschel (2019) proposed a so-called ”gradient-free Gradient Boosting” approach to make Gradient Boosting accessible to non-regular loss functions like the hard ranking loss. Their approach is based on Boosting with component-wise linear baselearners (Bühlmann (2006)) which minimizes the squared loss by successively selecting the simple linear regression model, i.e., the linear regression model based on one single column, that minimizes the squared loss w.r.t. the resulting combined model most. Werner and Ruckdeschel (2019) propose to alternatingly perform of these standard iterations for some and one ”singular iteration” where the linear baselearner which improves the hard ranking loss of the combined model most is selected.

However, despite they derive estimation and consistency results for their ”SingBoost” algorithm (based on similar theorems in Bühlmann (2006) resp. Bühlmann and Van De Geer (2011)), they discuss that the resulting Boosting solution suffers from overfitting (as Gradient Boosting solutions without early stopping generally do) and that the predictor set corresponding to the solution is not stable. They argue that a combination with a Stability Selection (Meinshausen and Bühlmann (2010), Hofner et al. (2015)) would be necessary which is outlined in Werner (2019). Nevertheless, the approach presented in Werner (2019) is mainly designed for the hard continuous ranking problem.

As for the computational complexity of SingBoost, they show that using the formula from Lemma 2.1 which is applicable since ties occur with probability zero for continuous-valued response variables, the algorithm requires computations for the number of Boosting iterations which is only insignificantly more than the complexity of Boosting of . Their algorithm is implemented in , though not yet publicly available.

### 3.4 Approaches with neural networks and Deep Learning

Burges et al. (2005) suggest to define a pair-wise variant of the cross-entropy loss as surrogate for the hard ranking loss. More precisely, their pair-wise cross-entropy loss is given by

where

and where is the analog for the theoretical differences. From a probabilistic point of view, the are interpreted as posterior probabilities that instance is ranked higher than instance . The main contribution of Burges et al. (2005) is to generalize the back-propagation algorithm used when fitting neural networks.

They propose a two-layer neural network and define the following pair-wise linear combination of features:

 s(Xi):=h(3)(∑jw(32)ijh(2)(∑kw(21)jkXk+b(2)j)+b(3)i).

The are considered to be activation functions. The back-propagation algorithm then is based on the partial derivatives of w.r.t. the weights resp. the offsets.

Again, this RankNet algorithm is tailored to the hard bipartite ranking problem and the experiments in Burges et al. (2005) are based on document retrieval data. It is available at the RankLib library (Dang (2013)).

Song et al. (2016) introduce an approach based on gradients of the expected loss. Their work is based on Hazan et al. (2010) who proved that

 ∇θI\negthinspace E[L(Y,sθ(X))]=limϵ→0(1ϵI% \negthinspace E[∇θF(X,Ydirect,θ)−∇θF(X,Yθ,θ)])

where

and

for some function that is linear in . Song et al. (2016) extend this results for non-linear and non-convex functions.

In fact, Song et al. (2016) apply their results to bipartite ranking problems by setting

for the ranking rule introduced in Def. 2.1 and by invoking the loss function

 L(Y,^Y):=1−1n+∑j:rk(^Yj)=11n+∑iI(rk(^Yi)≤j)I(Yi=1)

where is the scoring function that is learned by the Deep Neural Network. Song et al. (2016) prove how their theoretical results can be applied to the case with the given functions and and show that a back-propagation strategy with a suitable Bellman recursion is available.

The approach in Song et al. (2016) is designed for bipartite hard ranking problems.

Engilberge et al. (2019) propose to use Deep Learning and essentially combine two Deep Neural Networks. They discuss several smooth surrogate losses, for example for losses corresponding to Spearman correlation, Mean Average Precision or Recall at and argue that since they are all rank-based, i.e., depend on and , it is hard to optimize them due to non-differentiability. Therefore, they propose to invoke a real-valued scoring function such that the fitted scoring function approximates the true ranking vector as best as possible by considering the loss function

According to (Engilberge et al., 2019, Sec. 3.2), needs to be trained on synthetic training data, using a sorting Deep Neural Network.

Having real-valued scores, they propose the surrogate loss

for a loss based on Spearman’s correlation and in the case of multilabel responses with classes , they propose the surrogate loss

based on the Mean Average Precision, where is a binary vector with ones where the respective component of is from class