Perceptron-like Algorithms and Generalization Bounds for Learning to Rank
Abstract
Learning to rank is a supervised learning problem where the output space is the space of rankings but the supervision space is the space of relevance scores. We make theoretical contributions to the learning to rank problem both in the online and batch settings. First, we propose a perceptron-like algorithm for learning a ranking function in an online setting. Our algorithm is an extension of the classic perceptron algorithm for the classification problem. Second, in the setting of batch learning, we introduce a sufficient condition for convex ranking surrogates to ensure a generalization bound that is independent of number of objects per query. Our bound holds when linear ranking functions are used: a common practice in many learning to rank algorithms. En route to developing the online algorithm and generalization bound, we propose a novel family of listwise large margin ranking surrogates. Our novel surrogate family is obtained by modifying a well-known pairwise large margin ranking surrogate and is distinct from the listwise large margin surrogates developed using the structured prediction framework. Using the proposed family, we provide a guaranteed upper bound on the cumulative NDCG (or MAP) induced loss under the perceptron-like algorithm. We also show that the novel surrogates satisfy the generalization bound condition.
1 Introduction
Learning to rank is a supervised learning problem where the output space is the space of rankings of a set of objects. In the learning to rank problem that frequently arises in information retrieval, the objective is to rank documents associated with a query, in the order of the relevance of the documents for the given query. During training, a number of queries, each with their associated documents and relevance levels, are provided. A ranking function is learnt by using the training data with the hope that it will accurately order documents for a test query, according to their respective relevance levels. In order to measure the accuracy of a ranked list, in comparison to the actual relevance scores, various ranking performance measures, such as NDCG [16], MAP [1] and others, have been suggested.
All major performance measures are non-convex and discontinuous in the scores. Therefore, optimizing them during the training phase is a computationally difficult problem. For this reason, several existing ranking methods are based on minimizing surrogate losses, which are easy to optimize. Ranking methods can be broadly categorized into three categories. In the pointwise approach, the problem is formulated as regression or classification problem, with the objective of predicting the true relevance level of individual documents [11]. In the pairwise approach, document pairs are taken as instances, and the problem is reduced to binary classification (which document in a pair is more relevant?). Examples include RankSVM [15], RankBoost [13], and RankNet [3]. In the listwise approach, the entire list of document associated with a query is taken as an instance, and listwise surrogates are minimized during training. Examples include ListNet [4] and AdaRank [22].
The listwise method for ranking has become popular since the major performance measures themselves are listwise in nature. Usually, listwise surrogates are used in conjuction with linear ranking functions so that powerful optimization algorithms can be used. Despite the plethora of existing ranking methods, the comparison between them is mainly based on empirical performance on a limited set of publicly available data sets. Moreover, it has been observed that non-linear ranking function, in conjunction with even simple surrogates, are hard to beat in practice [7]. Important theoretical questions, such as online algorithms with provable guarantees and batch algorithms with generalization error bounds, remain open [8], even for linear ranking functions.
Listwise large margin surrogates form an important sub-class of listwise surrogates. Their use is motivated by the success of large margin surrogates in supervised classification problems. However, existing popular listwise large margin surrogates in the learning to rank literature are derived using the structured prediction framework [9, 23, 5]. In standard structured prediction, the supervision space is the same as the output space of the function being learned. To fit the structured prediction framework to the learning to rank problem (where the supervision is in form of relevance vectors but the output space consists of full rankings of the documents associated with a query), the relevance vectors are arbitrarily mapped to full rankings. Though such an approach can yield good empirical results, it does not lead to well-defined surrogates in the learning to rank setting since the mapping from relevance scores to full rankings is left unspecified (or is arbitrarily chosen).
One important reason for investigating listwise large margin ranking surrogate is to develop an analogue of the perceptron algorithm used in classification [14]. In classification, large margin surrogates have been used to learn classifiers in an online setting using perceptron. Large margin surrogates have special properties that allow for the establishment of theoretical bounds on the cumulative zero-one loss (viz. the total number of mistakes) without making any statistical assumptions on the data. Perceptron-like algorithms have been developed for ranking but in a different setting [12]. To the best of our knowledge, the perceptron algorithm has not been extended to the learning to rank setting described in this paper where, instead of mistake bounds, we desire bounds on cumulative losses as measured by the popular listwise ranking measures such as NDCG and MAP.
The three main contributions of this paper are the following. First, we modify a popular pairwise large margin ranking surrogate to develop a family of listwise large margin ranking surrogates. The family is parameterized by a set of weight vectors that gives us the flexibility to upper bound losses induced by NDCG and MAP. Unlike surrogates designed from a structured prediction perspective, ours directly use the relevance scores and do not require an arbitrary map from relevance scores to full rankings. Second, we use the novel family of surrogates to develop a perceptron-like algorithm for learning to rank. We provide theoretical bounds on the cumulative NDCG and MAP induced losses. If there is a perfect linear ranking function which can rank every instance correctly, the loss bound is independent of number of training instances just as in the classic perceptron case. Third, we analyze the generalization bound of the proposed family to understand its performance in a batch setting. In doing so, we provide a sufficient condition for any ranking surrogate (with linear ranking functions) to have a generalization bound independent of number of documents per query. We show that the proposed family and few other popular ranking surrogates satisfy the sufficient condition.
We defer all proofs to the supplementary appendix.
2 Problem Definition
In learning to rank, an instance consist of a query , associated with a list of documents and corresponding relevance label vector of length . The documents are represented as dimensional feature vectors. The relevance labels represent how relevant the documents are to the query. The relevance vector can be binary or multi-graded (say through ). Formally, the input space is representing lists of documents represented as dimensional feature vectors and supervision space is , representing relevance label vectors. It is important to note that the supervision is not in the form of full rankings. In fact, a list of documents usually has multiple correct full rankings corresponding to the relevance vector.
The objective is to learn a ranking function which ranks the documents associated with a query. The prevalent technique in the literature is to learn a scoring function and get ranking by sorting the score vector. For a , a linear scoring function is , where . The quality of the learnt ranking function is evaluated on an independent test query by comparing the ranks of the documents according to the scores, and their ranks according to actual relevance labels, using various performance measures. For example, the Normalized Cumulative Discounted Gain (NDCG) measure, for a set of documents in a test query, with multi-graded relevance vector and score vector induced by ranking function, is defined as follows:
(1) |
where , , . Further, is the relevance level of document and is the rank of document in the permutation ( is the permutation induced by score vector ). For example, if document is placed 3rd in permutation , then .
Another popular performance measure, Mean Average Precision (MAP), is defined only for binary relevances:
(2) |
where is the total number of relevant documents in the set of documents. Note that indicates the document which is placed at position in permutation . Thus, if , that means the document in 3rd position in is document 1.
All ranking performances measures are actually gains intended to be maximized. When we say “NDCG induced loss”, we mean a loss function that simply subtracts NDCG from its maximum possible value (which is ). Similar losses can be induced from for other performance measures defined as gains.
3 A Novel Family of Listwise Surrogates
We define the novel family of loss functions: these are Surrogate, Large margin, Listwise and Lipschitz losses, are Adaptable to Multiple ranking measures, and can handle Multiple graded relevance.
In RankSVM [15], a loss is incurred on a pair of documents in a list, if a relevant document does not outscore an irrelevant document with a margin. We use this idea to develop the family. In our definition of loss function, we will use score vector , corresponding to a list of documents, and relevance vector . If the score vector is induced by linear scoring function, parameterized by , as defined in Sec. 2, we write instead of . The family of convex loss functions is defined as follows:
(3) | ||||
s.t. | ||||
Here, is a margin-scaling constant and is an element-wise non-negative weight vector yielding different members of the family. Though can be varied for empirical purposes, we fix for subsequent analysis.
In a batch setting, the estimation of the parameter vector is done via minimization of regularized empirical loss:
(4) |
where are iid samples drawn from an unknown joint distribution on . We point out again that .
Lemma 1.
For any , the function is convex
We prove this lemma in the appendix directly from the definition of convexity. However, the reformulation below makes it easy to see that is convex:
(5) |
3.1 Properties of the SLAM Family
The similarity with the RankSVM surrogate is understood by observing the constraints in Eq. 3. Similar to RankSVM, a loss is induced if a more relevant document fails to outscore a less relevant document with a margin. However, one of the the main modifications is that there is a single corresponding to document . Thus, unlike RankSVM, the loss is not added for each pair of documents. Rather, the maximum loss corresponding to each document and all documents less relevant than it is measured (as seen in Eq. 5). Moreover, each is weighted by before they are added. The weight vector imparts a listwise nature to our surrogate.
As noted in Sec. 1, all popular ranking measures are listwise in nature, where correct ranking at the top of the list is much more critical than near the bottom. The critical property that a surrogate must posses to be considered listwise is this: the loss must be calculated viewing the entire list of documents as a whole, with errors at the top penalized much more than errors at the bottom. Since a perfect ranking places the most relevant documents at top, errors corresponding to most relevant documents should be penalized more in , in order for it to be considered a listwise surrogate family. The weight vector we design does exactly that. If document is the most relevant in the list, is the maximum entry in weight vector . Thus, even though our loss definition uses intuitive pairwise comparison between documents, it is truly a listwise loss. We define two weight vectors, and , in Sec.4.
We want to re-emphasize the structural difference of with listwise surrogates obtained via the structured prediction framework. There are multiple listwise surrogates in learning to rank literature. The popular large margin listwise surrogates are direct extensions of the structured prediction framework developed for classification [21]. As we pointed out in Sec. 1, structured prediction for ranking models assume that the supervision space is the space of full rankings of a document list. Usually a large number of full rankings are compatible with a relevance vector, in which case the relevance vector is arbitrarily mapped to a full ranking. In fact, here is a quote from one of the relevant papers [9], “It is often the case that this is not unique and we simply take of one of them at random” ( refers to a correct full ranking pertaining to query ). Though empirically they can yield competitive results; theoretically, structured prediction based ranking surrogates are less suitable in a learning to rank setting where supervision is given as relevance vectors but the ranking function returns full rankings.
4 Weight Vectors Parameterizing the SLAM Family
As we stated in Sec 3, different weight vectors lead to different members of the family. The weight vectors play a crucial role in the subsequent theoretical analysis.
First, the weight vectors need to be such that the surrogate family is truly listwise. For this, as explained in Sec 3.1, maxiumum weights need to be assigned to most relevant documents. Second, the weight vectors need to be such that different members of the family are upper bounds on (losses induced by) different ranking performance measures. The upper bound property will be crucial in deriving guarantees for a perceptron-like algorithm in learning to rank. Moreover, it makes sense to formally relate the loss being minimized to the performance measured being maximized. Recall that surrogates like hinge loss and logistic loss are upper bounds on the loss in classification. However, the weight vectors also need to be as small as possible, because the magnitude of the generalization bound for members of ends up being directly proportional to sum of components of the corresponding weight vectors (see Sec. 8).
Thus, we will require weight vectors to be as small as possible so far as the corresponding members of still upper bound different ranking performance measures. Upper bounds on ranking performance measure have also been investigated by [10]. However, our analysis technique is completely different, and yields different results.
We will provide two weight vectors, and , that results in upper bounds MAP and NDCG induced losses respectively. Since weight vectors are defined with the knowledge of relevance vectors, we can assume w.l.o.g that documents are sorted according to their relevance levels. Thus, , where is the relevance of document .
Upper bounding MAP loss: It is to be noted that is defined for binary relevance vectors. Let be a binary relevance vector, where is the number of relevant documents (thus, and ). We define vector as
(6) |
We have the following theorem on upper bound.
Theorem 2.
Let be the weight vector as defined in Eq. 6. Let be the MAP value determined by relevance vector and permutation induced by sorting of score vector . Then the following holds,
(7) |
We say a vector dominates a vector () if and for at least one .
For a given binary relevance vector , let . Then, , the following relation holds,
(8) |
We remind that is iteself a function of .
Thus, the choice of makes it in the sense that it all other choices of upper-bounding weight vectors. This implies that leads to tightest possible upper bound on MAP induced loss when is the surrogate used. The proof of Eq. 8 follows as a direct consequence of the way is derived.
Upper bounding NDCG loss: For a given relevance vector , we define vector as
(9) |
The definition of functions are as given in Section 2. We have the following inequality.
Theorem 3.
Let be the weight vector as defined in Eq. 9. Let be the NDCG value determined by relevance vector and permutation induced by sorting of score vector . Then the following inequality holds,
(10) |
We note that the choice is not optimal. However, the upper bound property still holds and it satisfies the condition required for to have -independent generalization bound (as detailed in Sec 8).
It can also be easily calculated that and . This fact will be crucial in the generalization bound analysis.
5 Perceptron-like Algorithm for Learning to Rank
We present a perceptron-like algorithm for learning a ranking function in an online setting, using the family. We also provide theoretical bounds on accumulated losses induced by two major ranking performance measures: NDCG and MAP. Though perceptron has been extended to a different ranking setting [12], to the best of our knowledge, cumulative loss guarantees for a perceptron-like algorithm (evaluated using popular performance measures such as NDCG and MAP) have not been provided before. The online gradient descent algorithm used in this section has been used by numerous authors (see the seminal paper of [24] and the survey article of [19]).
Since our proposed perceptron like algorithm works for both NDCG and MAP induced losses, we denote a performance measure induced loss as RankingMeasureLoss (RML). Thus, RML can be NDCG induced loss or MAP induced loss.
To make subsequent calculations easy to understand, we re-write the family from Eq.5. Also, we write for to emphasize that we are using linear ranking functions.
Denoting , we have
(11) |
where
It is easy to see Eq.11 and Eq.5 are the same. We remind the reader that for our choice of weight vectors and as defined in Eq.9 and Eq.6 respectively, we have, , the following inequalities,
(12) | ||||
It should also be noted that and are functions of .
In the online learning setting, at round , the input received is and ground truth received is . We define the following function
(13) |
Here, is the function parameter learnt at time point , and or depending on whether is NDCG induced loss or MAP induced loss respectively. Since weight vector depends on relevance vector , depends on .
It is clear from Eq.12 and Eq.13 that . It should also be noted that that is convex in both cases, i.e, when and . Due to the convexity of the sequence of functions , we can run online gradient descent (OGD) algorithm to learn the sequence of parameters , starting with . The OGD update rule, , for some and step size , requires a sub gradient that, in our case, is:
When .
When
(14) |
where
Here, is the standard basis vector along coordinate and is as defined Eq.11 (with ).
Note that means that there is at least one document with relevance less than at least another document but with greater score. That is, there is at least one pair of documents, indexed by , with but .
Since predicted ranking at round is obtained by sorting the score vector , we have, from the update rule, the following prediction at round
where is the set of rounds on which . Since sorted order of a vector is invariant under scaling by a positive constant, and do not depend on as long as . Thus, we can take in our algorithm. We now obtain a perceptron-like algorithm for the learning to rank problem.
Initialize | ||
For | to | |
Receive | ||
Set & predict | ||
Receive | ||
If | ^{1}^{1}1The first argument in RML is actually the sorted order of , as detailed in Sec.2. Thus, the condition of the algorithm depends on | |
// see def. of in Eq.(14) | ||
else | ||
End For |
5.1 Theoretical Bound on Cumulative Loss
We provide a theoretical bound on the cumulative loss (as measured by RML) of perceptron for the learning to rank problem. This result is similar to the theoretical bound on accumulated - loss of classic perceptron in the binary classification problem. The technique is based on regret analysis of online convex optimization algorithms. In this analysis, is used to represent the Euclidean norm (or norm), unless otherwise stated. We begin by stating a standard bound from the literature [24, 19].
Proposition (OGD regret).
Let be parameterized by any . Then the following regret bound holds for OGD, after rounds,
(15) |
where is the learning parameter and .
We first control the norm of the subgradient .
Proposition 4.
Let be the bound on the maximum norm of the feature vectors, as defined in Sec. 2. Let with . Then the following norm bound on the subgradient holds,
(16) |
Assuming that setting taking , we have our main theorem for the proposed perceptron algorithm. Note that since is independent of , the same bound holds for Algorithm 1 even though it uses .
Theorem 5.
Suppose the perceptron algorithm receives a sequence of instances . Let be the bound on the maximum norm of feature vectors. Then for defined in Sec.5, defined in Eq.13, , and being the bound on number of documents per query, the following bound holds.
(17) |
In particular, if there exists an s.t. , we have,
(18) |
The perceptron RML bound in Eq.17 is meaningful only if is a meaningful, finite quantity. It can be seen from the definition of in Eq.6 that . Thus, when is MAP induced loss, the perceptron bound is meaningful and is (hiding the dependence). For , depends on maximum relevance level. Assuming maximum relevance level is finite (in practice, maximum relevance level is usually between and ), . Thus, when is NDCG induced loss, the perceptron bound is meaningful and is .
Like perceptron for binary classification, the bound is Eq. 18 leads to an interesting conclusion. Let us assume that there is a linear scoring function parameterized by a unit vector , such all documents for all queries are ranked not only correctly, but correctly with a margin :
Corollary 6.
If the margin condition above holds, then accumulated losses, for both NDCG and MAP induced loss, is upper bounded by , a constant independent of the number of training instances.
We point out that the bound on the cumulative loss in Eq. 18 is dependent on . It is often the case that though a list has documents, the focus is on the top documents in the order sorted by score. We define a modified set of weights s.t. holds . We provide the definition of and in the appendix. We note that .
Overloading notation with , let with and .
Corollary 7.
In the setting of Theorem 5 and being the cut-off point for NDCG, the following bound holds
(19) |
Assuming maximum relevance level is finite, . Thus, the variance term in the perceptron bound is , a significant improvement from original variance term.
6 Generalization Error Bound
In batch setting, the ranking function parameter is learnt by solving Eq.4. We analyze how “good” the learnt parameter is w.r.t. to the functional parameter minimizing expected . We formalize this notion via establishing a generalization error bound.
Our main theorem on generalization error is applicable to any convex ranking surrogate with linear ranking function. But first, we take a closer look at the concept of a “linear ranking function” that is prevalent in the learning to rank literature, and show that it is actually a low dimensional parameterization of the full space of linear ranking functions.
As stated in Sec.2, ranking is obtained by sorting a score vector obtained via a linear scoring function . Specifically, is a dimensional vector which maps the matrix to a dimensional score vector . The space of linear scoring function consists of linear maps . The linear function space can be fully parameterized by matrices , where . The representation will be of the form
where . Thus, a full parameterization of the linear scoring function is of dimension .
The popularly used form of linear scoring function, viz. , with is actually a low -dimensional subspace of the full dimensional space of linear maps. It corresponding to choosing each matrix such that the th row is the vector and rest of the rows are vectors . Thus, it is a -dimensional parameterization. Most importantly, the dimension is independent of .
In learning theory, one of the factors influencing the generalization error bound is the richness of the class of hypothesis functions. Since the parameterization of the linear ranking function is of dimension independent of , intuition would suggest that, under some conditions, ranking surrogates with linear ranking function should have an independent complexity term in the generalization bound.
Before we state our main theorem on generalization error bound, we need some notations. For input matrix , relevance vector , weight vector and any convex (in first argument) surrogate loss function , we denote
(20) |
where . The expectation is taken over the underlying joint distribution on . We also define
(21) |
and
(22) |
where are iid samples from the underlying joint distribution on .
We now have our main theorem on generalization error bound.
Theorem 8.
Lipschitz continuity of w.r.t in norm means that there is a constant such that , for all . By duality, it follows that . Now, by chain rule, we have . It turns out that if each row of is bounded in norm by and then and the bound in Theorem 8 becomes . This immediately gives the following corollary, which provides a sufficient condition for an independent generalization bound to hold.
Corollary 9.
A sufficient condition for the ranking surrogate to have independent generalization bound is for it have independent Lipschitz bound, w.r.t , in norm. That is, there is a constant , independent of , such that
We point out that the generalization bound in Theorem 8 depends on the Lipschitz constant of w.r.t . However, the condition in Corollary 9 depends on the Lipschitz constant of w.r.t (the tilde in serves a reminder that Lipschitz continuity is meant w.r.t. , not ).
The only comparable result in the existing literature is the generalization bound given by [6] for ranking surrogates with linear ranking function. Their generalization bound is , where is the Lipschitz constant of the surrogate w.r.t in -norm. The generalization bound, however, is inherently dependent on and ours is always better since . A comparison of the proof techniques reveals that [6] proceed via Gaussian complexity and use Slepian’s lemma that forces them to use Lipschitz constant and introduces the dependence. We use stochastic convex optimization results of [20] thereby avoiding the explicit dependence. However, the price we pay is that our result only holds for convex surrogates whereas that of [6] holds for any Lipschitz surrogate.
We also note that [17] obtained generalization bounds for certain listwise surrogates. However, their analysis technique went via Rademacher complexity theory and is limited to specific listwise surrogates, while ours is a general result, applicable to all convex surrogates using linear ranking function.
We now show that family satisfies the sufficient condition.
Let . The gradient of w.r.t. to , is as follows:
(24) |
where
and is a standard basis vector along coordinate .
Since , we have . Further, and are both bounded by . Hence the family members corresponding to both NDCG and MAP have -independent generalization bound.
We now go back and analyze why the linear scoring function , with is the only correct choice in the learning to rank setting. Though we mentioned that the full parameterization of the linear scoring function is of dimension , we will formally prove that the correct full parameterization, under a natural permutation invariance condition, is of dimension .
An important property in ranking is permutation invariance. This means that score assigned to documents should be independent of the order in which documents are listed. Formally, a linear scoring function can be used for ranking if it satisfies the permutation invariance property:
We now show that the vector space of linear function that satisfy the permutation invariance property has dimension no more than . Because functions of the form are obviously permutation invariant and constitute a space of dimension , we easily then get that the dimension has to be exactly .
Theorem 10.
The space of linear, permutation invariant functions from to has dimension at most .
Proof: Using the full parameterization model, the permutation invariance property translates into: , where is permutation matrix of order .
Let , where denotes the index of the element in the th position of the permutation induced by . Then, , . Using , we get . Since will preserve the first column and create any permutation of the other columns, this indicates that all columns of are same, except maybe the first column. We can repeat this arguement for .
Let . Then, , . will put the second column of in first position and create any other permutation of the other columns. Hence, the first column of will match the second column of , and the second column of will match both first and thrid column of . Hence, all columns of matrix and are same and the matrices themselves are same. The argument can be repeated to show and is a rank 1 matrix.
Hence the linear function space has maximum dimension of .
6.1 Application to Existing Surrogates
In this subsection, we check whether a few popular convex ranking surrogates, which learn linear ranking function, satisfy the sufficient condition established above. We select only a few from the plethora of surrogates existing in learning to rank literature, representing both pairwise and listwise surrogates. All relevant calculations are shown in the appendix.
RankSVM minimizes a pairwise large margin surrogate and is designed for binary relevance vector. The norm of the gradient, w.r.t. score vector, is and hence fails to satisfy the sufficient condition for independent generalization bound.
ListNet optimizes a listwise cross-entropy loss (as the surrogate) in conjunction with linear ranking function. Our calculations show that the surrogate is Lipschitz in norm, w.r.t. score vector, and is independent of . The Lipschitz is actually bounded by the constant . However, we point out that since the surrogate is not large-margin in nature, its use in online learning will not result in a perceptron-like algorithm. The gradient of the surrogate varies with the point where the gradient is calculated, which makes the online predictions sensitive to the choice of the learning rate .
We also analyze large margin listwise surrogates suggested by [9] and [23], which are realizations of structured prediction framework. To make the surrogates theoretically suitable for learning to rank problem, we assume relevance levels within each relevance vector to be distinct. This gives a one-one mapping from space of relevance scores to space of full ranking, without any arbitrariness.
[9] minimize a listwise large margin surrogate and can handle multi-graded relevance vector. The norm of the gradient, w.r.t. score vector, is and hence fails to satisfy the sufficient condition for independent generalization bound. If the dependence is removed by simple normalization, the loss does not remain an upper bound on NDCG induced loss.
[23] minimize a listwise large margin surrogate and is designed for binary relevance vector. The norm of the gradient, w.r.t. score vector, is constant and hence satisfy the sufficient condition for independent generalization bound.
7 Conclusion
We provided the first perceptron-like algorithm for learning to rank that enjoys guaranteed loss bounds under losses induced by ranking performance measures such as NDCG and MAP. The loss bounds become independent of the number of training examples under a suitable margin condition. We also provided generalization bounds for general convex surrogate loss functions with linear ranking functions. Our analysis implied a sufficient condition for having a generalization bound that does not scale with , the number of documents per query. A key role in both the online bounds and generalization bounds is played by a novel family of listwise surrogates that we introduced in this paper by modifying a well known pairwise surrogate.
Several interesting questions for further exploration are suggested by our results. First, is it possible to derive a perceptron-like algorithm whose cumulative loss bound (under NDCG or MAP induced losses) does not scale with ? Second, is it possible to extend our main generalization bound to all Lipschitz surrogates and not just convex ones? Third, do the online and batch algorithms implied by our novel loss family enjoy good practical performance possibly with the use of kernels to tackle non-linearities? Our preliminary experiments suggest that it is the case but a full empirical comparison with the existing state-of-the-art is outside the scope of this paper and will be pursued in a subsequent work.
Acknowledgments
We gratefully acknowledge the support of NSF under grant IIS-1319810. Thanks to Prateek Jain for pointing out the simple argument required to prove Theorem 10.
References
- [1] R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval, volume 463. ACM press New York., 1999.
- [2] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
- [3] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In Proceedings of ICML, pages 89–96, 2005.
- [4] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of ICML, pages 129–136, 2007.
- [5] Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and Chiru Bhattacharyya. Structured learning for non-smooth ranking losses. In Proceedings of KDD, pages 88–96, 2008.
- [6] O. Chapelle and M. Wu. Gradient descent optimization of smoothed information retrieval metrics. Information retrieval, 13(3):216–235, 2010.
- [7] Olivier Chapelle and Yi Chang. Yahoo! learning to rank challenge overview. Journal of Machine Learning Research-Proceedings Track, pages 1–24, 2011.
- [8] Olivier Chapelle, Yi Chang, and Tie-Yan Liu. Future directions in learning to rank. In JMLR Workshop and Conference Proceedings, pages 91–100, 2011.
- [9] Olivier Chapelle, Quoc Le, and Alex Smola. Large margin optimization of ranking measures. In NIPS Workshop: Machine Learning for Web Search, 2007.
- [10] W. Chen, T.Y. Liu, Y. Lan, Z.M. Ma, and H. Li. Ranking measures and loss functions in learning to rank. Advances in NIPS, pages 315–323, 2009.
- [11] David Cossock and Tong Zhang. Subset ranking using regression. In Proceedings of COLT, pages 605–619, 2006.
- [12] Koby Crammer, Yoram Singer, et al. Pranking with ranking. In NIPS, pages 641–647, 2001.
- [13] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res., 4:933–969, December 2003.
- [14] Yoav Freund and Robert E Schapire. Large margin classification using the perceptron algorithm. Machine learning, pages 277–296, 1999.
- [15] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in NIPS, pages 115–132, 1999.
- [16] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), pages 422–446, 2002.
- [17] Yanyan Lan, Tie-Yan Liu, Zhiming Ma, and Hang Li. Generalization analysis of listwise learning-to-rank algorithms. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 577–584, 2009.
- [18] T.Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR workshop, pages 3–10, 2007.
- [19] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, pages 107–194, 2011.
- [20] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In COLT, 2009.
- [21] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the twenty-first international conference on Machine learning, page 104, 2004.
- [22] Jun Xu and Hang Li. Adarank: a boosting algorithm for information retrieval. In Proceedings of SIGIR, pages 391–398, 2007.
- [23] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In Proceedings of ACM SIGIR, pages 271–278, 2007.
- [24] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928–936, 2003.
In the appendix, we provide proofs of theorems stated in the main section. Unless otherwise stated, and are used alternatingly, with understood from the context.
.1 Proof of Lemma 1
Proof.
Let satisfies the constraints of Eq. (3)}. Then defines a polyhedra and hence is a convex set.
Let , for some non-negative vector . Thus, is a convex function.
Let us formulate a function h as follows:
(25) |
We will first show that is a jointly convex function.
For joint convexity, we have to show that: + , for .
Let , . (If either of the vectors is not in , then the right side of convexity equation is and the inequality is trivially true). Then , since is a convex set. Hence, = = (due to convexity of g).
As is jointly convex, and is the minimum of over in a convex set C(R), is convex in [2]. ∎
.2 Proof of Theorem 2
Proof.
As stated in Sec.4, documents pertaining to every query is sorted according to relevance labels. Let be an arbitrary relevance vector, corresponding to relevant documents and irrelevant documents in a list. MAP loss is only incurred if atleast 1 irrelevant document is placed above atleast 1 relevant document. With reference to in Eq. 5, for any and , we have , since documents are sorted according to relevance labels. Thus, w.l.o.g, we can take .
Let a score vector be such that an irrelevant document has the highest score among documents. Then, . The maximum possible MAP induced loss in case atleast one irrelevant document has highest score is when all irrelevant documents outscore all relevant documents. The MAP loss in that case is: . Since has to upper bound MAP and since can be infinitesimally greater than all of (thus, ), we need the following equation for upper bound property to hold:
.
Similarly, let a score vector be such that an irrelevant document has higher score than all but the 1st relevant document. Then . The maximum possible MAP induced loss in case atleast one irrelevant document has higher score than all but 1st relevant document is when all irrelevant documents are placed above all relevant documents but first one. The MAP loss in that case is: . Following same line of logic for upper bounding as before, we get
.
Likewise, if we keep repeating the logic, we get sequence of inequalities, with the last inequality being
.
To get smallest possible ’s, we take equality in all equations and by back calculation, we get .
Proof of dominance: Let, for some , s.t.