Generalized Ensemble Model for Document Ranking in Information Retrieval

Generalized Ensemble Model for Document Ranking in Information Retrieval

Yanshan Wang1, In-Chan Choi2, Hongfang Liu1 Manuscript received ; revised . Corresponding author: Y. Wang (email: Wang.Yanshan@mayo.edu). 1Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA 2School of Industrial Management Engineering, Korea University, Seoul 136-701, South Korea
Abstract

A generalized ensemble model (gEnM) for document ranking is proposed in this paper. The gEnM linearly combines basis document retrieval models and tries to retrieve relevant documents at high positions. In order to obtain the optimal linear combination of multiple document retrieval models or rankers, an optimization program is formulated by directly maximizing the mean average precision. Both supervised and unsupervised learning algorithms are presented to solve this program. For the supervised scheme, two approaches are considered based on the data setting, namely batch and online setting. In the batch setting, we propose a revised Newton’s algorithm, gEnM.BAT, by approximating the derivative and Hessian matrix. In the online setting, we advocate a stochastic gradient descent (SGD) based algorithm—gEnM.ON. As for the unsupervised scheme, an unsupervised ensemble model (UnsEnM) by iteratively co-learning from each constituent ranker is presented. Experimental study on benchmark data sets verifies the effectiveness of the proposed algorithms. Therefore, with appropriate algorithms, the gEnM is a viable option in diverse practical information retrieval applications.

ensemble model, mean average precision, document ranking, Information Retrieval, nonlinear optimization

I Introduction

Ranking is a core task for Information Retrieval (IR) in practical applications such as search engines and advertising recommendation systems. The aim of ranking task is to retrieve the most relevant objects (documents, for example) with regard to a given query. With the continuous growth of information in modern world wide webs, this task has become more challenging than ever before. In the ranking task, the general problem is the over-inclusion of relevant documents that a user is willing to receive [1]. During the last decade, a large quantity of models has been proposed to solve this problem. In general, those models are evaluated by two IR performance measures, namely Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) [2]. Compared to the framework in which models are proposed and then tested by IR measures, the approaches of directly optimizing IR measures have been showing more effective [3, 4]. These approaches apply efficient algorithms to solve the optimization problem where the objective function is one of the IR measures.

Structured SVM is a widely used framework for optimizing the bound of IR measures. Examples include [5] and [6]. Many other methods, such as Softrank [7, 8], first approximate the ranking measures through smooth functions and then optimize the surrogate objective functions. Yet, the drawbacks of those methods has been shown in two aspects: \@setpar the relationship between the surrogate objective functions and ranking measures was not sufficiently studied; and the algorithms resolving the optimization problems are not trivial to be employed in practice [3].\@noitemerr Recently, a general framework that directly optimizes of IR measure has been reported [3]. This framework can effectively overcome those drawbacks. However, it only optimizes the IR measure of one ranker, and the information provided by other rankers is not fully utilized. In classification area, an ensemble classifier that linearly combines multiple classifiers has been successfully proved to perform better than any of the constituent classifiers. A number of sophisticated algorithms have been proposed for obtaining the ensemble classifier such as AdaBoost [9]. Thus, the hypothesis that the performance can be improved by combining multiple rankers may be true as well. As a matter of fact, AdaRank [10, 11] and LambdaMART are two well-known models in IR area utilizing AdaBoost. The AdaRank repeatedly constructs weak rankers (features) and finally linearly combines into a strong ranker with proper weights assigned to the constituent rankers. However, the drawback of the AdaRank is the inexplicit theoretical justification and determination of the iteration number. While the LambdaMART enjoys the theoretical advantage of directly optimizing IR measures by linearly combining any two rankers, it cannot be extended to multiple rankers straightforwardly. In those previous studies, the direct optimization of NDCG is well-studied but the direct optimization of MAP are rarely tackled, to the best of our knowledge. The main difficulty of directly optimizing MAP is that the objective function defined by MAP is nonsmooth, nondifferentiable and nonconvex. Ensemble Model (EnM) [12] solves this problem by using boosting algorithm and coordinate descent algorithm. However, the solutions cannot be theoretically guaranteed to be optimal, or even local optimal. In this paper, we propose a generalized ensemble model (gEnM) for document ranking. It is an ensemble ranker that linearly combines multiple rankers. By appropriate adjustments to the weights for those constituent rankers, one may improve the overall performance of document ranking. To compute the weights, we formulate a constrained nonlinear program which directly optimizes the MAP. The difficulty of solving this nonlinear program lies in the nondifferentiable and noncontinuous objective function. To overcome this difficulty, we first introduce a differentiable surrogate to approximate the objective function, and then formulate an approximated unconstrained nonlinear program. Both supervised and unsupervised algorithms are employed for solving the nonlinear program. In the supervised scheme, batch and online data settings are considered. These schemes and settings are designed for different IR environments. For the batch setting, the algorithm gEnM.BAT is a revised Newton’s method by approximating the derivative and Hessian matrix. As for the online scheme, an online algorithm, gEnM.ON, is proposed based on stochastic gradient descent algorithms. The gEnM.ON is the first online algorithm for obtaining an ensemble ranker, to the best of our knowledge. In the unsupervised scheme, an unsupervised gEnM (UnsEnM) inspired by iRANK[13] is proposed. The UnsEnM utilizes the collaborative information among constituent rankers. The advantage of UnsEnM over the iRANK is that it is applicable to any number of constituent rankers. Compared to the EnM, the generalized version gEnM differs in three aspects: The assumption for EnM is relaxed for gEnM; the batch algorithms proposed for gEnM performs better; both online algorithm and unsupervised algorithm are proposed for gEnM whereas only batch algorithm for EnM. The remainder of this paper is organized as follows. In the next section, the problem of direct optimization of MAP is described and formulated. Also, the approximation to this problem is provided as long as the theoretical proofs. The algorithms, including gEnM.BAT, gEnM.ON and UnsEnM, are presented in Section 5. The computational results of the proposed algorithms tested on the public data sets are demonstrated in Section 6. The last section concludes this paper with discussions.

Ii Generalized Ensemble Model

Ii-a Problem Description

Consider the task of constructing a linear combination of rankers that result in better performance than each constituent. We call this linear combination the ensemble ranker or ensemble model hereinafter. Given a search query in this task, a sequence of documents is retrieved by the constituent rankers according to the relevance to the query. The relevance is measured by the ranking scores calculated by each ranker. For explicit description, let denote the ranking score or relevant score calculated by the ranker. With appropriate weights over those constituent rankers, the ranking scores of ensemble ranker is defined by linearly summing the weighted constituent ranking scores, i.e.,

where the weights satisfy and . The documents ranked by the ensemble ranker are thus ordered according to the ensemble ranker scores. Our goal is to uncover an optimal weight vector

with which more relevant documents can be ranked at high positions.

A toy example shown in Table I describes this problem. According to the ranking scores, the ranking lists returned by Ranker 1 and 2 are {2,1,3} and {3,1,2}, respectively, and the corresponding MAPs are 0.72 and 0.72. In order to make full use of the ranking information provided by both rankers, a conventional heuristic is to sum up ranking scores (i.e., use uniform weights, ), which generates Ensemble 1 with MAP equal to 0.72. Obviously, this procedure is not optimal since we can give arbitrary alternative weights that generate a better precision. For example, Ensemble 2 uses weights so as to result in higher MAP, i.e., 0.89, as listed in the table.

Ranker 1 Ranker 2 Ensemble 1 Ensemble 2
Document 1 0.35 0.2 0.55 0.305
Document 2 0.4 0.1 0.5 0.31
Document 3 0.25 0.7 0.95 0.385
MAP 0.72 0.72 0.72 0.89
TABLE I: A toy example. The values in the mid-three rows represent the ranking scores given an identical query. The rankers are measured by MAP, as listed in the fifth row. The ranking scores of Ensemble 1 and 2 are defined by 0.5*Ranker 1+0.5*Ranker 2 and 0.7*Ranker 1+0.3*Ranker 2, respectively. The relevant document list is assumed to be {2,3}.

This toy example implies that there exist optimal weights assigned for the constituent rankers to construct an ensemble ranker. Different from proposing new probabilistic or nonprobabilistic models, this ensemble model motivates an alternative way for solving ranking tasks. In order to formulate this task as an optimization problem, the metric—MAP—is used as the objective function since it reflects the performance of IR system and tends to discriminate stably among systems compared to other IR metrics [14]. Therefore, our goal is changed to calculate the weights with which the MAP is maximized. In the following, we will describe and solve this problem mathematically.

Ii-B Problem Definition

Let be a set of documents, a set of queries and a set of rankers. denotes the relevant document list, the document associated with relevant document in , the query and the ranker. represents the number of queries, the number of relevant documents associated with and the number of rankers. The ensemble ranker is defined as which linearly combines constituent rankers with weights ’s. We assume the relevant documents have been sorted in descending order according to the ranking sores. On the basis of these notations and the definition of MAP, the aforementioned problem can be formulated as:

s.t.

where represents the ranking position of document given by the ensemble model . In this constrained nonlinear program, \@setpar the objective function is a general definition of MAP; and the constraints indicate that the linear combination is convex and that the weights can be interpreted as a distribution.\@noitemerr Since the position function is defined by the ranking scores, it can be written as

(1)
where and is an indicator function which equals 1 if is true and 0 otherwise. Here, denotes the ranking score of document given by ensemble model and the difference of the ranking scores between document and . Since is linear with respect to the weights, it can be rewritten as
(2)
where denotes the relevant score of document for query calculated by model . Here, we give an example plot that illustrates the graph of the objective function. This example employed the MED data set with the settings identical to those in [12] except that only two constituent rankers, LDI and pLSI, were used to comprise the ensemble ranker for plotting purpose. The weights were restricted to the constraints in Problem P1 with the precision of three digits after the decimal point. In detail, the objective function was evaluated by setting for LDI and for pLSI, where , and increased from to with a step size of . Figure 1 shows a partial of the graph of objective function. From this plot, it is clearly observed that \@setpar the objective function is highly nonsmooth and nonconvex; and there are numerous local optimums in the objective function.\@noitemerr Though the differentiability is not obvious in this graph, the position function implies that the objective function is nondifferentiable in terms of weights. Therefore, the general gradient-based algorithms, such as Lagrangian Relaxation and Newton’s Method, cannot be applied to this problem directly to find the optimum, even local optimums [3]. Fig. 1: An illustrated example of the objective function with two constituent rankers in Problem P1. From this analysis of the objective function, the position function plays an important role in the differentiability. Thus, we will discuss how to approximate it with a differentiable function and how to solve this optimization Problem P1 in the next two sections.

Iii Approximation

In this section, we propose a differentiable surrogate for the position function and further approximate the Problem P1 with an easier nonlinear program.

Since the position function is defined by an indicator function (Equation 1), we can use a sigmoid function to approximate this indicator function, i.e.,

(3)

where is a scaling constant. It is obvious that this approximation is in the range of if and if . The following theorem shows that we can get a tight bound by this approximation.

Theorem 1.

The difference between the sigmoid function and the indicator function is bounded as:

where , and represents for notational simplicity henceforth

Proof.

For , we have and , thus,

For , we have and , thus,

Since , we can get

(4)

This completes the proof. ∎

This theorem tells us that the sigmoid function is asymptotic to the indicator function especially when is chosen to be large enough. By using this approximation, the position function can be correspondingly approximated as

(5)

which becomes differentiable and continuous.

Then it is trivial to show the approximation error of position function, i.e.,

(6)

Suppose 1000 documents exit in the document set and . By setting , the approximation error of the position function is bounded by

(7)

which is tight enough for our problem.

In this way, the original Problem P1 can be approximated by the following problem

s.t.

Using the settings identical to Figure 1, Figure 2 plots the graphs of the original objective function (OOF) in Problem P1 and the approximated objective function (AOF) in Problem P2. As shown in the plot, the trend of the AOF is close to that of the OOF. The weights generating the optimal MAP almost remain unchanged in these two curves. From this example, it is illustratively shown that the original noncontinuous and nondifferentiable objective function can be effectively approximated by a continuous and differentiable function. The following lemma and theorem will theoretically prove this conclusion.

Fig. 2: Comparison of the OOF in Problem P1 and AOF in Problem P2. ()
Theorem 2.

The error between the OOF in Problem P1 and the AOF in Problem P2 is bounded as

(8)

where and denote the objective function in Problem P2 and Problem P1, respectively.

Proof.

For the approximation error, we have

where denotes for notational simplicity. Since and are strictly positive, we have

According to Equation 6, we have

(9)

This completes the proof. ∎

This theorem indicates that the OOF in Problem P1 can be accurately approximated by the surrogate defined by the position function (5) in Problem P2. For example, if , , , and , the absolute discrepancy between the objectives in Problem P1 and P2 is bounded by

This discrepancy is within an acceptable level and will decrease with the growth of the query size and the value of .

The constraints of weights in Problem P2 are of practical significance because these weights can be regarded as probabilities drawn from a distribution over the constituent rankers. However, adding constraints increases the difficulty of solving this optimization problem. Intuitively, the normalization of weights assigned for ranking scores is nonessential because the ranking position is determined by the relative values of ranking scores. Take the toy in Table I as an example, the weights result in the identical Ensemble 2 to . The lemmas and theorems below prove the hypothesis that this constrained nonlinear program can be approximated by an unconstrained nonlinear program.

Lemma 1.

Problem P2 is equivalent to the following problem:

where , and

Since , it can be straightforwardly proved that Problem P3 is equivalent to Problem P2.

Remark 1.

If we let , Theorem 1 applies for both and as well.

The following theorem states that Problem P3 can be surrogated by an easier problem.

Theorem 3.

Consider the following problem

where . Let and denote the objective function in Problem P3 and Problem P4, respectively. Then, we have the following bound for the absolute difference between and

(10)

where , and .

Proof.

From Lemma 1 and Lemma 1, we can derive the following bound.

Since and are strictly positive, we have

According to the general triangle inequality, we can draw an upper bound for the term in numerator

Then, it is trivial to get

(11)

This completes the proof. ∎

Since the differences and are small enough, Problem P4 can accurately approximate Problem P3. This theorem tells us that the AOF is also determined by the ranking positions, i.e., the relative values of ranking scores, thus the normalization constraints in Problem P2 can be removed. Taking Lemma 1 and Theorem 2 into account, we can trivially draw the following corollary.

Corollary 1.

Problem P1 can be approximated by Problem P4.

In the next section, we focus on proposing algorithms that solves Problem P4.

Iv Algorithm

In order to solve Problem P4, we propose algorithms according to the data settings—batch setting and online setting. In the batch setting, all the queries and ranking scores given by constituent rankers are processed as a batch. Based on the batch data, the weights over constituent rankers are computed by maximizing the MAP. Two algorithms, gEnM.BAT and gEnM.IP, are reported in this setting. The potential for the batch algorithms merit consideration for those systems containing complete data. Take academic search engine as an example. The titles can be seen as queries while the abstracts and contents of publications can be regarded as relevant documents. So a batch can be established to train the proposed model.

In many IR environments such as recommendation systems in E-commerce, however, the queries and ranking scores are generated in real time so as to construct data sequences at different times. Thus, we will secondly propose an online algorithm, gEnM.ON, for dealing with these data sequences. The online algorithm is more scalable to large data sets with limited storage than the batch algorithm. In the online algorithm, the queries as well as corresponding ranking scores are input in a data stream and processed in a serial fashion.

A common assumption for the aforementioned frameworks is that the relevant documents are known. However, the knowledge of relevant documents are unknown in many modern IR systems such as search engines. For this IR environment, we further propose an unsupervised ensemble model, UnsEnM, which makes use of a co-training framework.

Iv-a Batch Algorithm: gEnM.BAT

Although many sophisticated methods can be applied for finding a local optimum, we first propose a revised Newton’s method. Major modification includes the approximation of gradients and Hessian matrix.

For notational simplicity, we utilize:

(12)
(13)
(14)
(15)

Under those notations, the first and second derivative of the objective function in Problem P4 can be written as

(16)

and

(17)

respectively. According to the second derivative, the Hessian matrix is defined by

(18)

As stated by Theorem 6 in Appendix B, the addends in the first derivative can be estimated by zeros under certain conditions. This approximation also applies for the second derivative as well as the Hessian matrix since both contain the first derivative item. The advantages of using this approximation are two-fold: \@setpar the computation of Hessian is simplified since many addends are set to zeros under certain conditions; and the computations of , , and can be carried out offline before evaluating the derivative and Hessian, which makes the learning algorithm inexpensive.\@noitemerr Since the objective function in Problem P4 is nonconvex, multiple local optimums may exist in the variable space. Therefore, different starting points are chosen to preclude the algorithm from getting stuck in one local optimum. The largest local optimum and the corresponding weights are returned as the final solutions. To accelerate the algorithm, we can distribute different starting points onto different cores for parallel computing. The batch algorithm is summarized as follows. We note that and represent the vectors with elements and , respectively, and that indexes initial values. Algorithm 1 gEnM.BAT (Generalized Ensemble Model by Revised Newton’s Algorithm in Batch Setting.) 0:  Query set , document set , relevant document set with respect to , ranking scores with respect to the query, th method and document , a number of initial points and a threshold for stopping the algorithm. 1:  for each  do 2:     Set iteration counter ; 3:     Evaluate ; 4:     repeat 5:        Set ; 6:        Compute gradient and Hessian matrix (Algorithm 2); 7:        Update ; 8:        Evaluate ; 9:     until  10:     Store 11:  end for 12:  return  ’s. Algorithm 2 Approximated Derivative and Hessian Computation Algorithm. 0:  Query set , document set , relevant document set with respect to , ranking scores with respect to the query, th method and document , current . 1:  for  do 2:     for  do 3:        Set , , and to zeros; 4:        for  do 5:           ; 6:           ; 7:            8:           if  then 9:              ; 10:              ; 11:              ; 12:           else 13:              ; 14:              ; 15:              ; 16:           end if 17:        end for 18:     end for 19:  end for 20:  Compute gradient \hfill(Equation 40) and Hessian matrix ; \hfill(Equation 18) 21:  return   and .

A drawback of the conventional Newton’s method lies in that it is designed for unconstrained nonlinear programs while our problem requests nonnegative. Thus applying the above algorithms may result in negative weights. The strategy for avoiding this shortcoming is to set the final negative weights to zeros. As a matter of fact, the rankers with negative weights play a negative role in the ensemble model. Thus, the ignorance of those rankers are reasonable in practice.

Iv-B Online Algorithm: gEnM.ON

In the previous two subsections, we have presented the learning algorithms for generating gEnM by batch data sets. In contrast to the batch setting, the online setting provides the gEnM a long sequence of data. The weights are calculated sequentially based on the data stream that consists of a series of time steps . For example, the gEnM is constructed based on the new queries and corresponding rankings given at different times in a search engine. The final goal is also to maximize the overall MAP on the data sets.

(19)

As a matter of fact, the presented batch algorithms can be applied directly in the online setting by regarding the whole observed sequences as a batch at each step. In doing so, however, the overall complexity is extremely high since the batch algorithm should be run once at each time step.

In the online setting, the subsequent queries are not available at present. An alternative optimization technique should be considered to prevent from focusing too much on the present training data. To distinguish with the notation in the batch setting, we let be the query and suppose are the given query at time in the online setting. Here, we assume that these sequences are given with the grand truth distribution . Thus, the objective function of MAP can be defined as the expectation of average precision, i.e.,

(20)

where

The expectation cannot be maximized directly because the truth distribution is unknown. However, we can estimate the expectation by the empirical MAP that simply uses finite training observations. A plausible approach for solving this empirical MAP optimization problem is that using the stochastic gradient descent (SGD) algorithm which is a drastic simplification for the expensive gradient descent algorithm. Though the SGD algorithm is a less accurate optimization algorithm compared to the batch algorithm, it is faster in terms of computational time and cheaper in terms of storing memory [15]. Another advantage is that the SGD algorithm is more adaptive to the changing environment in which examples are given sequentially [16].

For our problem, the SGD learning rule is formulated as

(21)

where is called learning rate, i.e., a positive value depending on . This updating rule is validated to increase the objective value at each step in terms of expectation, which can be verified by the following theorem.

Theorem 4.

Using the updating rule (21), the expectation of average precision increases at each step, i.e.,

Proof.

Since , we only need to show .
Since

we need to verify . According to the denotation of , we have

where .
Since

(22)

we can conclude that

This completes the proof. ∎

The learning rate plays an important role in the updating (Equation 22), hence an adequate will enhance the online algorithm to converge. Define in this article, then we have the following well-known properties:

(23)
(24)

Since it is difficult to analyze the whole process of online algorithm [15], we will show the convergence property around the global or local optimum in the following analysis.

Lemma 2.

If is in the neighborhood of the optimum , we have

(25)

The proof of is straightforward referring to Equation 35. This lemma states that the gradient drives the current point towards the maximum . In the stochastic process, the following inequality holds

(26)
Lemma 3.

If is in the neighborhood of the optimum , we have

(27)

The proof is given in the Appendix. For the stochastic nature, the expectation of also converges almost surely, i.e.,

(28)
Theorem 5 ([17]).

In the neighborhood of the maximum , the recursive variables converge to the maximum, i.e.,

(29)
Proof.

Define a sequence of positive numbers whose values measure the distance from the optimum, i.e.,

(30)

The sequence can be written as an expectation under the stochastic nature, i.e.,

(31)

Since the first term on the right hand side is negative according to (26), we can obtain the following bound:

(32)

Conditions (24) and (28) imply that the right hand side converges. According to the quasi-martingale convergence theorem [18], we can also verify that converges almost surely. This result implies the convergence of the first term in (31).

Since does not converge according to (23), we can get

(33)

This result leads to the convergence of the online algorithm, i.e.,

This completes the proof. ∎

Based on the learning rule (21), the online algorithm for achieving the ensemble model is summarized below.

0:  Query set , document set , relevant document set with respect to , ranking scores with respect to the query, th method and document , a number of initial points and a threshold for stopping the algorithm.
1:  for each  do
2:     Set iteration counter ;
3:     Evaluate ;
4:     repeat
5:        for each  do
6:           Set ;
7:           Compute gradient with respect to \hfill(Algorithm 2);
8:           Update ;
9:        end for
10:        Evaluate ;
11:     until 
12:     Store
13:  end for
14:  return  ’s.
Algorithm 3 gEnM.ON (Generalized Ensemble Model by Online Algorithm.)

Iv-C Unsupervised Algorithm: UnsEnM

The proceeding proposed algorithms for both batch setting and online setting are based on the knowledge of labeled data, which has been regarded as supervised learning. As a matter of fact, in the community of conventional information retrieval systems, labeled data are difficult to obtain in general. Under this condition, unsupervised learning plays a crucial role. The inspiration of unsupervised algorithm for solving Problem P4 comes from the idea of co-training that is based on the belief that each constituent ranker in the ensemble model can provide valuable information to the other constituent rankers such that they can co-learn from each other [13]. In order to utilize this collaborative learning scheme, the gEnM requires all constituent rankers are generated by unsupervised learning. In each round, the ranking scores of one of the constituent rankers are provided as fake labeled data for other rankers to refine the weights. Iteratively learning from the constituent rankers, the ensemble model may result in an overall improvement in terms of MAP.

We modify the objective function in Problem P4 by adding a penalty item so that the refined ranking does not depend on the fake label too much. The modified objective function is defined as

where .

Let denote the objective function in Problem P8. The second derivatives of can be written as follows: