Combination of Multiple Bipartite Ranking for Web Content Quality Evaluation
Web content quality estimation is crucial to various web content processing applications. Our previous work applied Bagging + C4.5 to achive the best results on the ECML/PKDD Discovery Challenge 2010, which is the comibination of many point-wise rankinig models. In this paper, we combine multiple pair-wise bipartite ranking learner to solve the multi-partite ranking problems for the web quality estimation. In encoding stage, we present the ternary encoding and the binary coding extending each rank value to (L is the number of the different ranking value). For the decoding, we discuss the combination of multiple ranking results from multiple bipartite ranking models with the predefined weighting and the adaptive weighting. The experiments on ECML/PKDD 2010 Discovery Challenge datasets show that binary coding + predefined weighting yields the highest performance in all four combinations and furthermore it is better than the best results reported in ECML/PKDD 2010 Discovery Challenge competition.
keywords:Web Content Quality Estimation, Multipartite Ranking, Bipartite Ranking, Encoding Design, Decoding Design
In the past, most data quality measures were developed on an ad hoc basis to solve specific problems, and fundamental principles necessary for developing stable metrics in practice were insufficient Pipino2002 . In the research of Web content quality assessment, computational models that can automatically predict the Web content quality should be focused on.
Web spam can significantly deteriorate the quality of search engine results, but high quality is more than just the opposite of Web spam. ECML/PKDD 2010 Discovery Challenge (DC2010) aims at more aspects of the Web sites. DC2010 wants to develop site-level classification for the genre of the Web sites (editorial, news, commercial, educational, “deep Web” or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality Benczur20101 .
The algorithms of learning to rank are traditionally classified as three categories. In the simplest point-wise approach, the instances are assigned a ranking score as the absolute quantity using classical regression or classification techniques Crammer2002 ; Li2007 . In the pairwise approach, the order of pairs of instances is treated as a binary label and learned by a classification method (e.g. RankSVM Joachims2002 ). RankBoost Freund2003 maintains weak ranking functions where each function can orders the instances and then combine the ranking functions into a single ranking. Finally, the most complex list-wise approaches Valizadegan2009 try to directly optimize a ranking-specific evaluation metric (e.g. Normalized Discounted Cumulative Gain, NDCG).
In DC2010 2010, Geng et. alGeng2010 estimate the web quality with the weighted output of bagging of C4.5 to achieve the best results among all submitted reports, which can be regarded as the combination of the point-wise ranking method. However, the pair-wise ranking models often shows the better performance than the point-wise ones and less model complexity than the list-wise ones. In our study, we research on combining multiple binary pair-wise ranking with the efficient ranking coding and decoding for the web quality assessment, which converts the multi-partite ranking problem into multiple bipartite ranking problems. Specially, we use bipartite RankBoost as the base learner. In encoding stage, we present the ternary encoding and the binary coding extending each rank value to the vector with dimensions ( is the number of the ratings). For the decoding, we discuss the combination of multiple ranking results from multiple bipartite ranking models with the predefined weighting and the adaptive weighting. DC2010 Benczur20101 provide a chance to validate our algorithm called MultiRanking.ED, where the tasks is to rank the webpages from three international languages (English,French,Germany) according to their quality. The results show that MultiRanking.ED achieves the better results. In particular, binary coding + predefined weighting yields the highest performance in all four combinations and overpasses the best results with Bagging + C4.5 Breiman1996 ; Quinlan1993 in DC2010 competition by our team Geng2010 .
The reminder of the paper is organized as follows: Section 2 gives the multi-partite ranking problem and RankBoost algorithm; Section 3 presents multi-parite ranking with the efficient encoding-decoding; Section 4 provides the experimental results; the last section concludes the paper and the future work.
2 Multipartite Ranking
2.1 Framework of Multipartite Ranking
In the bipartite ranking problem, given the dataset from the instance space where and , then the objective of the ranking algorithm is to minimize the expected empirical error on the function :
where is the indicator function. It is worthy to note that we can compute the area under the ROC curve (AUC) Fawcett2006 as . As the direct minimization of the 0-1 loss is computationally intractable, the ranking algorithm minimizes the convex upper bound of the expected empirical error 111The expected empirical error also includes a regularization item in some ranking algorithm such as RankSVM.:
Formally, we assume that means that should be ranked above while means the opposite; indicates they have the same importance. In the multipartite ranking, we have the relation , or for all pairs . The dataset can be divided into subset , where and for .
For -partite ranking, the dataset can be divided into according to the ratings of the instances. Generally, we can define the empirical error or C-index Furnkranz2009 by extending (2) to the multi-partite case:
where . In the case of the multipartite problem, (1) will be extended as C-index measure:
2.2 Evaluation Measure
In DC2010, evaluation is in terms of the NDCG (Normalized Discounted Cumulative Gain) with the following ratings and the discount function given the sorted ranking sequence and the ratings of the instances (note that here ):
where is the normalization factor that is DCG in the ideal permutation ().
2.3 RankBoost for Ranking
Output the final ranking:
Rankboost Freund2003 maintains a distribution over that is passed to the weak learner and approximate the true ordering by combining many simple weak ranker. It minimize the following expected loss:
where and may be regarded as the probability on for all in the initial distribution.
The weak ranker focuses on the binary rating ( and , or the positive and the negative) that gives the relative ordering of the examples and ignores the specific scores. The ranker has the following simple form:
where is the j-th feature value of , is the threshold and is the default value of the ranker. With the convexity of , it is easily verified that is the upper bound of where
The upper bound of in each iteration is minimized when , which will yield . The weak ranker should choose the optimal value , and to maximize the weighted risk loss . Algorithm 1 shows the framework of RankBoost222For the bipartite ranking, the paper Freund2003 gives a more efficient implementation called RankBoost.B.
3 Multi-Ranking with Encoding and Decoding
In this section, we will decompose the multi-partite ranking into multiple bipartite ranking with the encoding and the decoding. In the coding stage, the ratings will be coded as -bit 0-1 sequence and each binary ranking algorithm may be the point-wise ranking or the pair-wise ranking algorithm. In the decoding stage, the final ranking score will be a weighted average of all binary ranking algorithms and the descending sorted examples give the right rankings.
3.1 Coding Design
In this section, we describe the coding design for the rank learning. Given a set of ratings to be learned and partite parts are formed. A codeword of is designed for each rating, where each bit code represents the response of a given dichotomizer 333The dichotomizer generally denotes the binary classifier in the literature of classification. Informally, here we introduce this term into machine-learned ranking and call the bipartite ranker as the dichotomizer.. The codeword of each row is arranged to construct a coding matrix , where . The l-th row corresponding to the rating () and the j-th column the dichotomizer (). We categorize the coding design into the binary coding and the ternary coding basing on the range of the coding value.
3.1.1 Binary Coding
The standard binary coding design is used for one-vs-all strategy in the multi-class classification, where each dichotomizer is built to distinguish one class from the rest of the classes. For the -partite ranking problem, we extend each rating to a vector with dimensions. Formally, the coding method will encode to ():
Fig. 1 give the binary coding design with the dichotomizer for 4 ratings. It can be explained with the fact that the algorithm will sequently execute the dichotomizer . First, judge that it is hold that for the instance . Then, test whether and or not, respectively. F&H method Frank2001 uses this encoding strategy to implement the ordinal regression. Unlike their method, we use the pair-wise method (RankBoost) instead of the point-wise method. In practice, the pair-wise methods often achieve better performance than the point-wise methods.
3.1.2 Ternary Coding
In classification, the ternary coding designs are the one-vs-one strategy Hastie1998 and the sparse random strategy Allwein2000 . In one-vs-one coding, the examples from all pairs of the classes are constructed to train a model. The negative code () in the coding means a particular class is not considered as the given classifier. We consider a -bit coding design including the preference pairs, where the k-th column contains the instances from . Formally, we define the following ternary coding ():
where the upper triangular part of the coding matrix is zeros called the upper triangular coding. Optionally, the coding could also contain all the instances from where the lower triangular part of the coding matrix are 1 called the lower triangular coding. Both of the ternary codings are depicted in Fig. 2.
In Fig. 2, the matrix is coded as three dichotomizers for the 4-partite problem. The white regions are coded by , the black regions by , and the gray regions correspond to the (the regions that is not considered by the respective dichotomizer ). In the left part of Fig. 2, the set of the partial relation will be considered by the dichotomizer .
We notice that LPC (Learning by Pairwise Comparison) Furnkranz2009 can be formulated in the ternary coding with bits. As a example, Fig. 3 show a coding with 6 bits when LPC solves a 4-partite problem. The column corresponds to a pair of ratings such as , where it only uses the examples with both of and ratings.
3.2 Decoding Design
In classification, the most frequently applied decoding designs are built on the certain distance metric between the output vector and the coding vector such as Hamming decoding Nilsson1965 for the binary decoding and the loss-based decoding Allwein2000 for the ternary decoding. ECOC will classify the instance to the label whose coding vector is nearest to the output vector in the specific distance metric. But for ranking, the objective of the decoding design is to fuse the outputs of multiple dichotomizers into a final ranking score instead of predicting the class label. The procedure of the decoding design is to determine the weight of each dichotomizer.
Recall that we assume the training dataset , where . Inspired by McRank Li2007 where the ranking algorithm fused the posterior probability of the instance conditioned on the class label into a score value, we define the scoring function:
where represents the weight function of the dichotomizer . The instances will be sorted in the descending order of after computing the weight scores for all instances. If we set for all dichotomizers, this is just as similar as the fusion manner of F&H method. In our experiments, we set both of and to implement our algorithms. We find that it is obviously better to set than if we assign the predefined weights to . The linear transformation of the scoring function will not change the ranking results. It seems that the adaptive weighting function which measures the ability of the dichotomizers is more intuitional than the predefined weighting. For each dichotomizer, we take the of 3-holdout validation instead of 3-crossfold validation as considering the training time. In F&H method, is set to 1 empirically. LPC trains a seperate model for each pair of the ratings such as and the prior probability of the ratings pair are used as the weights for the fusions:
4.1 Description of Dataset
In our experiments, we used all labeled samples of the English language as training for the English quality task. DC2010 only provides few labeled samples for the French and German tasks to emphasize cross-lingual methods. We put all the labeled examples including English, French and German into the training set for the multilingual quality tasks (French and German tasks). We noticed that due to the redirect of website there exist some duplicate instances with the different ratings in the collection. In this case, we chose the high ratings as the ranking value of the instance and keep the unique instance in the dataset. After removing the duplicated samples, we obtain the English training set with 2113 samples, French training set with 2334 samples, and German training set with 2238 samples 444We repeated to exact the features of the datasets and found there was a little difference from the results of our last competition (there exists some errors). But the new results is comparable to the best results in the competition. The rating of the instance ranges from to , which is measured as an aggregate function of genre, trust, factuality and bias and spam and is more delicate than the LETOR dataset Qin2010 (with 3 ratings).
The test dataset includes English instances, French instances and test instances. The organizations of the competition provided the instances with their ratings sampled from the test dataset to the participants for optimizing the ranking algorithms. Then they would extract another group of instances randomly from the test dataset to test the ranking algorithms as the final competition results. Tab. 1 gives the descriptions of the sampling set from the test dataset, where six sampling set are denoted as
4.2 Experiment Results and Discussions
In the following experiments, we set the number of the weak ranker to 100 and use all attributes as the thresholds of the weak ranker if not special specified.
For the bipartite ranking problem, the ranker should rank the positive instances in the head of the rank sequence as far as possible. The rank problem will become easier and easier when the ratio of the positive instances relative to the negative ones increases. In the extreme case, any permutation will be regarded as correct while all instances are positive. Fig. 4 shows the holdout NDCG (as the adaptive weighting) for each dichotomizer under the binary coding. For the binary coding, the ratios of the instances decreases with the increasing and the performance of the dichotomizer decrease step by step with .
How can we improve the overall performance according to the real situation of the dichotomizer? The critical view is to compensate the dichotomizer with the large . In the 4-partite ranking problem, a instance with ratings 3 in the dichotomizer probably can not be ranked in the head of the sequence due to its encountering disadvantage situation. But in other dichotomizers which have more perfect performance than , a instance with ratings 3 likely is ranked ahead. The instance with ratings 3 will keep its advantage when giving it a high weight for the compensation. Figs. 5 give a comparison between the predefine weighting and the adaptive weighting under the binary coding manner, where the decreasing weighting means a negative compensation and gives the inferior performance compared with the predefine weighting.
Fig. 6 give the comparisons among three coding methods under two decodings method. In both of the predefine weighting and the adaptive weighting, we can see that the binary coding outperforms the ternary coding. Moreover, let us compare the lower triangular coding and the upper triangular coding. It is interesting that the lower triangular coding is more effective than the upper triangular coding for the predefine weighting decoding and the opposite case holds for the adaptive weighting decoding. Compared with the upper triangular coding, the lower triangular coding is prone to be disturbed by the weight compensation.
The weak ranker determines output or by comparing the specified attribute of the instance with the threshold. A candidate set of the thresholds can be provided for searching the optimal threshold on the dataset. Fig. 7 show that the NDCG measure increases slightly when given more thresholds, where the horizontal axis represents the number of the thresholds for each attribute. It can be explained with the fact that the algorithm will obtain a more optimal threshold when enlarging the searching region.
In classification, AdaBoost Reyzin2006 methods are known not to usually overfit training data even as the size of the weak classifiers becomes large. RankBoost can be regarded as the application of the AdaBoost in the pairs of the instances. In Fig. 8, we see that the NDCGs vary gently (behaving nearly identically) and resist overfitting, which is consistent with Freund2003 .
Finally, Tab. 2 gives the comparisons between Bagging + C4.5 and MultiRanking.ED, which adopts the binary coding and the predefine weighting decoding. Comparing the first column and the second column, all rows except the second row show that MultiRanking.ED gives a better performance than Bagging + C4.5, which is the best competition results in DC2010.The middle two columns also present the performance of MultiRanking.ED under the different number of the thresholds.
In this study, we try to solve the web quality evaluation problems with the combination of multiple bipartite pair-wise ranking models by the efficient encoding and decoding strategy.In coding, we present the binary coding and the ternary coding extending the ratings to dimension vector. For decoding, we give the combination of the ranking sequences with the predefine weighting and the adaptive weighting. The ECML/PKDD Discovery Challenge 2010 datasets provide a chance to validate our proposed algorithm, which contains English, French and Germany webpage quality tasks. We dicussed the probable factors which influence the NDCG measurer including the number of weakers, the number of the thresholds and the different encoding and the decoding strategy experimentally. The final results show that our algorithm MultiRanking.ED with the binary coding and the predefine weighting decoding overpasses its counterparts and gives a perfect performance.
In future work, we will explore the fashions to combine the multiple rank sequence effectively. Another direction is to validate it on other state-of-art datasets.
This work is supported in part by Innovation Scientists and Technicians Troop Construction Projects of Henan Province under grants No.094200510009 and the National Natural Science Foundation of China (NSFC) under Grant No. 61103138 and No. 61005029.
- (1) L. L. Pipino, Y. W. Lee, R. Y. Wang, Data quality assessment, Commun. ACM 45 (4) (2002) 211–218.
- (2) A. A. Benczur, C. Castillo, M. Erdelyi, Z. Gyongyi, J. Masanes, M. Matthews, Ecml/pkdd 2010 discovery challenge data set, in: Crawled by the European Archive Foundation, 2010.
- (3) K. Crammer, Y. Singer, Pranking with ranking, in: NIPS, 2002, p. 14.
- (4) P. Li, C. Burges, Q. Wu, Mcrank: Learning to rank using multiple classification and gradient boosting, in: NIPS, 2007.
- (5) T. Joachims, Optimizing search engines using clickthrough data, in: SIGKDD, 2002.
- (6) Y. Freund, R. Iyer, R. E. Schapire, Y. Singer, An efficient boosting algorithm for combining preferences, Journal of Machine Learning Research (2003) 4.
- (7) H. Valizadegan, R. Jin, R. Zhang, J. Mao, Learning to rank by optimizing ndcg measure, in: NIPS, 2009.
- (8) G.-G. Geng, X.-B. Jin, X.-C. Zhang, D.-X. Zhang, Evaluating web content quality via multi-scale features, in: ECML/PKDD 2010 Workshop on Discovery Challenge 2010, 2010.
- (9) L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–140.
- (10) J. R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann Publishers, 1993.
- (11) T. Fawcett, An introduction to roc analysis, Pattern Recogn. Letters 27 (2006) 861–874.
- (12) J. Fürnkranz, E. Hüllermeier, S. Vanderlooy, Binary decomposition methods for multipartite ranking, ECML PKDD ’09, 2009, pp. 359–374.
- (13) E. Frank, M. Hall, A simple approach to ordinal classification, in: Proceedings of the 12th European Conference on Machine Learning, EMCL ’01, 2001, pp. 145–156.
- (14) T. Hastie, R. Tibshirani, Classification by pairwise grouping, in: NIPS, 1998.
- (15) E. L. Allwein, R. E. Schapire, Y. Singer, Reducing multiclass to binary: A unifying approach for margin classifiers, in: ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2000, pp. 9–16.
- (16) N. Nilsson, Learning Machines, McGraw-Hill, 1965.
- (17) T. Qin, T.-Y. Liu, J. Xu, H. Li, Letor: A benchmark collection for research on learning to rank for information retrieval, in: Information Retrieval Journal, 2010.
- (18) L. Reyzin, R. E. Schapire, How boosting the margin can also boost classifier complexity, in: ICML’06, 2006, pp. 753–760.