Rank Aggregation via Nuclear Norm Minimization
The process of rank aggregation is intimately intertwined with the structure of skew-symmetric matrices. We apply recent advances in the theory and algorithms of matrix completion to skew-symmetric matrices. This combination of ideas produces a new method for ranking a set of items. The essence of our idea is that a rank aggregation describes a partially filled skew-symmetric matrix. We extend an algorithm for matrix completion to handle skew-symmetric data and use that to extract ranks for each item. Our algorithm applies to both pairwise comparison and rating data. Because it is based on matrix completion, it is robust to both noise and incomplete data. We show a formal recovery result for the noiseless case and present a detailed study of the algorithm on synthetic data and Netflix ratings.
Rank Aggregation via Nuclear Norm Minimization
|David F. Gleich|
|Sandia National Laboratories††thanks: Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.|
|University of Chicago|
nuclear norm, skew symmetric, rank aggregation
One of the classic data mining problems is to identify the important items in a data set; see Tan and Jin (2004) for an interesting example of how these might be used. For this task, we are concerned with rank aggregation. Given a series of votes on a set of items by a group of voters, rank aggregation is the process of permuting the set of items so that the first element is the best choice in the set, the second element is the next best choice, and so on. In fact, rank aggregation is an old problem and has a history stretching back centuries (Condorcet, 1785); one famous result is that any rank aggregation requires some degree of compromise (Arrow, 1950). Our point in this introduction is not to detail a history of all the possible methods of rank aggregation, but to give some perspective on our approach to the problem.
Direct approaches involve finding a permutation explicitly – for example, computing the Kemeny optimal ranking (Kemeny, 1959) or the minimum feedback arc set problem. These problems are NP-hard (Dwork et al., 2001; Ailon et al., 2005; Alon, 2006). An alternate approach is to assign a score to each item, and then compute a permutation based on ordering these items by their score, e.g. Saaty (1987). In this manuscript, we focus on the second approach. A key advantage of the computations we propose is that they are convex problems and efficiently solvable.
While the problem of rank aggregation is old, modern applications – such as those found in web-applications like Netflix and Amazon – pose new challenges. First, the data collected are usually cardinal measurements on the quality of each item, such as 1–5 stars, received from voters. Second, the voters are neither experts in the rating domain nor experts at producing useful ratings. These properties manifest themselves in a few ways, including skewed and indiscriminate voting behaviors (Ho and Quinn, 2008). We focus on using aggregate pairwise data about items to develop a score for each item that predicts the pairwise data itself. This approach eliminates some of the issues with directly utilizing voters ratings, and we argue this point more precisely in Section Rank Aggregation via Nuclear Norm Minimization.
To explain our method, consider a set of items, labeled from to . Suppose that each of these items has an unknown intrinsic quality , where implies that item is better than item . While the ’s are unknown, suppose we are given a matrix where . By finding a rank-2 factorization of , for example
we can extract unknown scores. The matrix is skew-symmetric and describes any score-based global pairwise ranking. (There are other possible rank-2 factorizations of a skew-symmetric matrix, a point we return to later in Section Rank Aggregation via Nuclear Norm Minimization).
Thus, given a measured , the goal is to find a minimum rank approximation of that models the elements, and ideally one that is rank-. Phrased in this way, it is a natural candidate for recent developments in the theory of matrix completion (Candès and Tao, to appear; Recht et al., to appear). In the matrix completion problem, certain elements of the matrix are presumed to be known. The goal is to produce a low-rank matrix that respects these elements – or at least minimizes the deviation from the known elements. One catch, however, is that we require matrix completion over skew-symmetric matrices for pairwise ranking matrices. Thus, we must solve the matrix completion problem inside a structured class of matrices. This task is a novel contribution of our work. Recently, Gross (2010) also developed a technique for matrix completion with Hermitian matrices.
With a “completed” matrix , the norm of the residual gives us a certificate for the validity of our fit – an additional piece of information available in this model.
To continue, we briefly summarize our main contributions and our notational conventions.
We propose a new method for computing a rank aggregation based on matrix completion, which is tolerant to noise and incomplete data.
We solve a structured matrix-completion problem over the space of skew-symmetric matrices.
We prove a recovery theorem detailing when our approach will work.
We perform a detailed evaluation of our approach with synthetic data and an anecdotal study with Netflix ratings.
We try to follow standard notation conventions. Matrices are bold, upright roman letters, vectors are bold, lowercase roman letters, and scalars are unbolded roman or Greek letters. The vector consists of all ones, and the vector has a in the th position and ’s elsewhere. Linear maps on matrices are written as script letters. An index set is a group of index pairs. Each is a pair and we assume that the ’s are numbered arbitrarily, i.e. . Please refer to Table Rank Aggregation via Nuclear Norm Minimization for reference.
Sym. Interpretation a linear map from a matrix to a vector a vector of all ones a vector with in the th entry, 0 elsewhere the nuclear norm a rating matrix (voters-by-items) a fitted or model pairwise comparison matrix a measured pairwise comparison matrix an index set for the known entries of a matrix Table \thetable: Notation for the paper. Mean Log-odds (all) Arithmetic Mean (30) LOTR III: Return … LOTR III: Return … LOTR III: Return … LOTR I: The Fellowship … LOTR I: The Fellowship … LOTR I: The Fellowship … LOTR II: The Two … LOTR II: The Two … LOTR II: The Two … Lost: Season 1 Star Wars V: Empire … Lost: S1 Battlestar Galactica: S1 Raiders of the Lost Ark Star Wars V: Empire … Fullmetal Alchemist Star Wars IV: A New Hope Battlestar Galactica: S1 Trailer Park Boys: S4 Shawshank Redemption Star Wars IV: A New Hope Trailer Park Boys: S3 Star Wars VI: Return … LOTR III: Return … Tenchi Muyo! … LOTR III: Return … Raiders of the Lost Ark Shawshank Redemption The Godfather The Godfather Veronica Mars: S1 Toy Story Shawshank Redemption Ghost in the Shell: S2 Lost: S1 Star Wars VI: Return … Arrested Development: S2 Schindler’s List Gladiator Simpsons: S6 Finding Nemo Simpsons: S5 Inu-Yasha CSI: S4 Schindler’s List Table \thetable: The top 15 movies from Netflix generated by our ranking method (middle and right). The left list is the ranking using the mean rating of each movie and is emblematic of the problems global ranking methods face when infrequently compared items rocket to the top. We prefer the middle and right lists. See Section Rank Aggregation via Nuclear Norm Minimization and Figure Rank Aggregation via Nuclear Norm Minimization for information about the conditions and additional discussion. LOTR III appears twice because of the two DVDs editions, theatrical and extended.
Before proceeding further, let us outline the rest of the paper. First, Section Rank Aggregation via Nuclear Norm Minimization describes a few methods to take voter-item ratings and produce an aggregate pairwise comparison matrix. Additionally, we argue why pairwise aggregation is a superior technique when the goal is to produce an ordered list of the alternatives. Next, in Section Rank Aggregation via Nuclear Norm Minimization, we describe formulations of the noisy matrix completion problem using the nuclear norm. In our setting, the lasso formulation is the best choice, and we use it throughout the remainder. We briefly describe algorithms for matrix completion and focus on the svp algorithm (Jain et al., 2010) in Section Rank Aggregation via Nuclear Norm Minimization. We then show that the svp algorithm preserves skew-symmetric structure. This process involves studying the singular value decomposition of skew-symmetric matrices. Thus, by the end of the section, we’ve shown how to formulate and solve for a scoring vector based on the nuclear norm. The following sections describe alternative approaches and show our recovery results. At the end, we show our experimental results. In summary, our overall methodology is
Pairwise comparisons ()
Ranking scores ()
Finally, we provide our computational and experimental codes so that others may reproduce our results:
To begin, we describe methods to aggregate the votes of many voters, given by the matrix , into a measured pairwise comparison matrix . These methods have been well-studied in statistics (David, 1988). In the next section, we show how to extract a score for each item from the matrix .
Let be a voter-by-item matrix. This matrix has rows corresponding to each of the voters and columns corresponding to the items of the dataset. In all of the applications we explore, the matrix is highly incomplete. That is, only a few items are rated by each voter. Usually all the items have a few votes, but there is no consistency in the number of ratings per item.
Instead of using directly, we compute a pairwise aggregation. Pairwise comparisons have a lengthy history, dating back to the first half of the previous century (Kendall and Smith, 1940). They also have many nice properties. First, Miller (1956) observes that most people can evaluate only 5 to 9 alternatives at a time. This fact may relate to the common choice of a -star rating (e.g. the ones used by Amazon, eBay, Netflix, YouTube). Thus, comparing pairs of movies is easier than ranking a set of movies. Furthermore, only pairwise comparisons are possible in certain settings such as tennis tournaments. Pairwise comparison methods are thus natural for analyzing ranking data. Second, pairwise comparisons are a relative measure and help reduce bias from the rating scale. For these reasons, pairwise comparison methods have been popular in psychology, statistics, and social choice theory (David, 1988; Arrow, 1950). Such methods have also been adopted by the learning to rank community; see the contents of Li et al. (2008). A final advantage of pairwise methods is that they are much more complete than the ratings matrix. For Netflix, is 99% incomplete, whereas is only 0.22% incomplete and most entries are supported by many comparisons. See Figure Rank Aggregation via Nuclear Norm Minimization for information about the number of pairwise comparisons in Netflix and MovieLens.
More critically, an incomplete array of user-by-product ratings is a strange matrix – not every 2-dimensional array of numbers is best viewed as a matrix – and using the rank of this matrix (or its convex relaxation) as a key feature in the modeling needs to be done with care. Consider, if instead of rating values 1 to 5, 0 to 4 are used to represent the exact same information, the rank of this new rating matrix will change. Furthermore, whether we use a rating scale where 1 is the best rating and 5 is worst, or one where 5 is the best and 1 is the worst, a low-rank model would give the exact same fit with the same input values, even though the connotations of the numbers is reversed.
On the other hand, the pairwise ranking matrix that we construct below is invariant under monotone transformation of the rating values and depends only on the degree of relative preference of one alternative over another. It circumvents the previously mentioned pitfalls and is a more principled way to employ a rank/nuclear norm model.
We now describe five techniques to build an aggregate pairwise matrix from the rating matrix . Let denote the index of a voter, and and the indices of two items. The entries of are . To each voter, we associate a pairwise comparison matrix . The aggregation is usually computed by something like a mean over .
Arithmetic mean of score differences The score difference is . The arithmetic mean of all voters who have rated both and is
These comparisons are translation invariant.
Geometric mean of score ratios Assuming , the score ratio refers to . The (log) geometric mean over all voters who have rated both and is
These are scale invariant.
Binary comparison Here . Its average is the probability difference that the alternative is preferred to than vice versa
These are invariant to a monotone transformation.
Strict binary comparison This method is almost the same as the last method, except that we eliminate cases where users rated movies equally. That is,
Again, the average has a similar interpretation to binary comparison, but only among people who expressed a strict preference for one item over the other. Equal ratings are ignored.
Logarithmic odds ratio This idea translates binary comparison to a logarithmic scale:
Figure \thefigure: A histogram of the number of pairwise comparisons between movies in MovieLens (left) and Netflix (right). The number of pairwise comparisons is the number of users with ratings on both movies. These histograms show that most items have more than a small number of comparisons between them. For example, 18.5% and 34.67% of all possible pairwise entries have more than 30 comparisons between them. Largely speaking, this figure justifies dropping infrequent ratings from the comparison. This step allows us to take advantage of the ability of the matrix-completion methods to deal with incomplete data.
Thus far, we have seen how to compute an aggregate pairwise matrix from ratings data. While has fewer missing entries than – roughly 1-80% missing instead of almost 99% missing – it is still not nearly complete. In this section, we discuss how to use the theory of matrix completion to estimate the scoring vector underlying the comparison matrix . These same techniques apply even when is not computed from ratings and is measured through direct pairwise comparisons.
Let us now state the matrix completion problem formally (Candès and Recht, 2009; Recht et al., to appear). Given a matrix where only a subset of the entries are known, the goal is to find the lowest rank matrix that agrees with in all the non-zeros. Let be the index set corresponding to the known entries of . Now define as a linear map corresponding to the elements of , i.e. is a vector where the th element is defined to be
and where we interpret as the entry of the matrix for the index pair . Finally, let be the values of the specified entries of the matrix . This idea of matrix completion corresponds with the solution of
Unfortunately, like the direct methods at permutation minimization, this approach is NP-hard (Vandenberghe and Boyd, 1996).
To make the problem tractable, an increasingly well-known technique is to replace the function with the nuclear norm (Fazel, 2002). For a matrix , the nuclear norm is defined
where is the th singular value of . The nuclear norm has a few other names: the Ky-Fan -norm, the Schatten -norm, and the trace norm (when applied to symmetric matrices), but we will just use the term nuclear norm here. It is a convex underestimator of the rank function on the unit spectral-norm ball , i.e. and is the largest convex function with this property. Because the nuclear norm is convex,
is a convex relaxation of (3) analogous to how the -norm is a convex relaxation of the -norm.
In (5) we have , which is called a noiseless completion problem. Noisy completion problems only require . We present four possibilities inspired by similar approaches in compressed sensing. For the compressed sensing problem with noise:
there are four well known formulations: lasso (Tibshirani, 1996), qp (Chen et al., 1998), ds (Candès and Tao, 2007) and bpdn (Fuchs, 2004). For the noisy matrix completion problem, the same variations apply, but with the nuclear norm taking the place of the -norm:lasso
[2ex] qp Mazumder et al. (2009)
[2ex] bpdn Mazumder et al. (2009)
Returning to rank-aggregation, recall the perfect case for the matrix : there is an unknown quality associated with each item and . We now assume that the pairwise comparison matrix computed in the previous section approximates the true . Given such a , our goal is to complete it with a rank-2 matrix. Thus, our objective:
where corresponds to the filled entries of . We adopt the lasso formulation because we want , and underestimates rank as previously mentioned. This problem only differs from the standard matrix completion problem in one regard: the skew-symmetric constraint. With a careful choice of solver, this additional constraint comes “for-free” (with a few technical caveats). It should also be possible to use the skew-Lanczos process to exploit the skew-symmetry in the SVD computation.
Algorithms for matrix completion seem to sprout like wildflowers in spring: Lee and Bresler (2009); Cai et al. (2008); Toh and Yun (2009); Dai and Milenkovic (2009); Keshavan and Oh (2009); Mazumder et al. (2009); Jain et al. (2010). Each algorithm fills a slightly different niche, or improves a performance measure compared to its predecessors.
We first explored crafting our own solver by adapting projection and thresholding ideas used in these algorithms to the skew-symmetrically constrained variant. However, we realized that many algorithms do not require any modification to solve the problem with the skew-symmetric constraint. This result follows from properties of skew-symmetric matrices we show below.
Thus, we use the svp algorithm by Jain et al. (2010). For the matrix completion problem, they found their implementation outperformed many competitors. It is scalable and handles a lasso-like objective for a fixed rank approximation. For completeness, we restate the svp procedure in Algorithm 1.
If the constraint comes from a skew-symmetric matrix, then this algorithm produces a skew-symmetric matrix as well. Showing this involves a few properties of skew-symmetric matrices and two lemmas.
We begin by stating a few well-known properties of skew-symmetric matrices. Let be skew-symmetric. Then all the eigenvalues of are pure-imaginary and come in complex-conjugate pairs. Thus, a skew-symmetric matrix must always have even rank. Let be a square real-valued matrix, then the closest skew-symmetric matrix to (in any norm) is . These results have elementary proofs. We continue by characterizing the singular value decomposition of a skew-symmetric matrix.
Let be an skew-symmetric matrix with eigenvalues , where and . Then the SVD of is given by
for and given in the proof.
Using the Murnaghan-Wintner form of a real matrix (Murnaghan and Wintner, 1931), we can write
for a real-valued orthogonal matrix and real-valued block-upper-triangular matrix , with -by- blocks along the diagonal. Due to this form, must also be skew-symmetric. Thus, it is a block-diagonal matrix that we can permute to the form:
Note that the SVD of the matrix
We can use this expression to complete the theorem:
Both the matrices and are real and orthogonal. Thus, this form yields the SVD of .
We now use this lemma to show that – under fairly general conditions – the best rank- approximation to a skew-symmetric matrix is also skew-symmetric.
Let be an -by- skew-symmetric matrix, and let be even. Let k
This lemma follows fairly directly from Lemma missing1. Recall that the best rank- approximation of in an orthogonally invariant norm is given by the largest singular values and vectors. By assumption of the theorem, there is a gap in the spectrum between the th and +1-st singular value. Thus, taking the SVD form from Lemma missing1 and truncating to the largest singular values produces a skew-symmetric matrix.
Finally, we can use this second result to show that the missingmissingsvp
algorithm for the lasso problem preserves skew-symmetry in all the iterates .
Given a set of skew-symmetric constraints , the solution of the lasso problem from the svp solver is a skew-symmetric matrix if the target rank is even and the dominant singular values stay separated as in the previous lemma.
In this proof, we revert to the notation and use to denote the matrix with non-zeros in and values from . We proceed by induction on the iterates generated by the missingmissingsvp algorithm. Clearly is skew-symmetric. In step 3, we compute the SVD of a skew-symmetric matrix: . The result, which is the next iterate, is skew-symmetric based on the previous lemma and conditions of this theorem.
The svp solver thus solves (6) for a fixed rank problem. A final step is to extract the scoring vector from a rank- singular value decomposition. If we had the exact matrix , then , which yields the score vector centered around 0. Using a simple result noted by Langville and Meyer (forthcoming), then is also the best least-squares approximation to in the case that is not an exact pairwise difference matrix. Formally, . The outcome that a rank-2 from svp is not of the form is quite possible because there are many rank-2 skew-symmetric matrices that do not have as a factor. However, the above discussion justifies using derived from this completed matrix.
Our complete ranking procedure is given by Algorithm 2.
Now, we briefly compare our approach with other techniques to compute ranking vectors from pairwise comparison data. An obvious approach is to find the least-squares solution . This is a linear least squares method, and is exactly what Massey (1997) proposed for ranking sports teams. The related Colley method introduces a bit of regularization into the least-squares problem (Colley, 2002). By way of comparison, the matrix completion approach has the same ideal objective, however, we compute solutions using a two-stage process: first complete the matrix, and then extract scores.
A related methodology with skew-symmetric matrices underlies recent developments in the application of Hodge theory to rank aggregation (Jiang et al., 2010). By analogy with the Hodge decomposition of a vector space, they propose a decomposition of pairwise rankings into consistent, globally inconsistent, and locally inconsistent pieces. Our approach differs because our algorithm applies without restriction on the comparisons. Freeman (1997) also uses an SVD of a skew-symmetric matrix to discover a hierarchical structure in a social network.
We know of two algorithms to directly estimate the item value from ratings (de Kerchov and van Dooren, 2007; Ho and Quinn, 2008). Both of these methods include a technique to model voter behavior. They find that skewed behaviors and inconsistencies in the ratings require these adjustments. In contrast, we eliminate these problems by using the pairwise comparison matrix. Approaches using a matrix or tensor factorization of the rating matrix directly often have to determine a rank empirically (Rendle et al., 2009).
The problem with the mean rating from Netflix in Table Rank Aggregation via Nuclear Norm Minimization is often corrected by requiring a minimum number of rating on an item. For example, IMDB builds its top-250 movie list based on a Bayesian estimate of the mean with at least ratings (imdb.com/chart/top). Choosing this parameter is problematic as it directly excludes items. In contrast, choosing the minimum number of comparisons to support an entry in may be easier to justify.
A hallmark of the recent developments on matrix completion is the existence of theoretical recoverability guarantees (see Candès and Recht (2009), for example). These guarantees give conditions under which the solution to the optimization problems posed in Section Rank Aggregation via Nuclear Norm Minimization is or is nearby the low-rank matrix from whence the samples originated. In this section, we apply a recent theoretical insight into matrix completion based on operator bases to our problem of recovering a scoring vector from a skew-symmetric matrix (Gross, 2010). We only treat the noiseless problem to present a simplified analysis. Also, the notation in this section differs slight from the rest of the manuscript, in order to match the statements in Gross (2010) better. In particular, is not necessarily the index set, represents , and most of the results are for the complex field.
The goal is this section is to apply Theorem 3 from Gross (2010) to skew-symmetric matrices arising from score difference vectors. We restate that theorem for reference.
Theorem 4 (Theorem 3, Gross (2010))
Let be a rank- Hermitian matrix with coherence with respect to an operator basis . Let be a random set of size . Then the solution of
is unique and is equal to with probability at least .
The definition of coherence follows shortly. On the surface, this theorem is useless for our application. The matrix we wish to complete is not Hermitian, it’s skew-symmetric. However, given a real-valued skew-symmetric matrix , the matrix is Hermitian; and hence, we will work to apply this theorem to this particular Hermitian matrix. Again, we adopt this approach for simplicity. It is likely that a statement of Theorem 4 with Hermitian replaced with skew-Hermitian also holds, although verifying this would require a reproduction of the proof from Gross (2010).
The following theorem gives us a condition for recovering the score vector using matrix completion. As stated, this theorem is not particularly useful because may be recovered from noiseless measurements by exploiting the special structure of the rank-2 matrix . For example, if we know then given we can find . This argument may be repeated with an arbitrary starting point as long as the known index set corresponds to a connected set over the indices. Instead we view the following theorem as providing intuition for the noisy problem.
Consider the operator basis for Hermitian matrices:
Let be centered, i.e., . Let where and . Also, let be a random set of elements with size where . Then the solution of
is equal to with probability at least .
The proof of this theorem follows directly by Theorem 4 if has coherence with respect to the basis . We now show this result.
Definition 6 (Coherence, Gross (2010))
Let be , rank-, and Hermitian. Let be an orthogonal projector onto . Then has coherence with respect to an operator basis if both
For with :
Let , , and . Note that because is Hermitian with no real-valued entries, both quantities and are . Also, because is symmetric, . The remaining basis elements satisfy:
Thus, has coherence with from Theorem 5 and with respect to . And we have our recovery result. Although, this theorem provides little practical benefit unless both and are , which occurs when is nearly uniform.
We implemented and tested this procedure in two synthetic scenarios, along with Netflix, movielens, and Jester joke-set ratings data. In the interest of space, we only present a subset of these results for Netflix.
Figure \thefigure: An experimental study of the recoverability of a ranking vector. These show that we need about 6n log n entries of to get good recovery in both the noiseless (left) and noisy (right) case. See §Rank Aggregation via Nuclear Norm Minimization for more information.
The first experiment is an empirical study of the recoverability of the score vector in the noiseless and noisy case. In the noiseless case, Figure Rank Aggregation via Nuclear Norm Minimization (left), we generate a score vector with uniformly distributed random scores between 0 and 1. These are used to construct a pairwise comparison matrix . We then sample elements of this matrix uniformly at random and compute the difference between the true score vector and the output of steps 4 and 5 of Algorithm 2. If the relative -norm difference between these vectors is less than , we declare the trial recovered. For , the figure shows that, once the number of samples is about , the correct is recovered in nearly all the 50 trials.
Next, for the noisy case, we generate a uniformly spaced score vector between 0 and 1. Then , where is a matrix of random normals. Again, we sample elements of this matrix randomly, and declare a trial successful if the order of the recovered score vector is identical to the true order. In Figure Rank Aggregation via Nuclear Norm Minimization (right), we indicate the fractional of successful trials as a gray value between black (all failure) and white (all successful). Again, the algorithm is successful for a moderate noise level, i.e., the value of , when the number of samples is larger than .
Figure \thefigure: The performance of our algorithm (left) and the mean rating (right) to recovery the ordering given by item scores in an item-response theory model with 100 items and 1000 users. The various thick lines correspond to average number of ratings each user performed (see the in place legend). See §Rank Aggregation via Nuclear Norm Minimization for more information
Inspired by Ho and Quinn (2008), we investigate recovering item scores in an item-response scenario. Let be the center of user ’s rating scale, and be the rating sensitivity of user . Let be the intrinsic score of item . Then we generate ratings from users on items as:
where is the discrete levels function:
and is a noise parameter. In our experiment, we draw , , , and . Here, is a standard normal, and is a noise parameter. As input to our algorithm, we sample ratings uniformly at random by specifying a desired number of average ratings per user. We then look at the Kendall correlation coefficient between the true scores and the output of our algorithm using the arithmetic mean pairwise aggregation. A value of 1 indicates a perfect ordering correlation between the two sets of scores.
Figure Rank Aggregation via Nuclear Norm Minimization shows the results for users and items with and ratings per user on average. We also vary the parameter between and . Each thick line with markers plots the median value of in 50 trials. The thin adjacency lines show the th and th percentiles of the 50 trials. At all error levels, our algorithm outperforms the mean rating. Also, when there are few ratings per-user and moderate noise, our approach is considerably more correlated with the true score. This evidence supports the anecdotal results from Netflix in Table Rank Aggregation via Nuclear Norm Minimization.
See Table Rank Aggregation via Nuclear Norm Minimization for the top movies produced by our technique in a few circumstances using all users. The arithmetic mean results in that table use only elements of with at least pairwise comparisons (it is a am all 30 model in the code below). And see Figure Rank Aggregation via Nuclear Norm Minimization for an analysis of the residuals generated by the fit for different constructions of the matrix . Each residual evaluation of Netflix is described by a code. For example, sb all 0 is a strict-binary pairwise matrix from all Netflix users and in Algorithm 2 (i.e. accept all pairwise comparisons). Alternatively, am 6 30 denotes an arithmetic-mean pairwise matrix from Netflix users with at least 6 ratings, where each entry in had 30 users supporting it. The other abbreviations are gm: geometric mean; bc: binary comparison; and lo: log-odds ratio.
These residuals show that we get better rating fits by only using frequently compared movies, but that there are only minor changes in the fits when excluding users that rate few movies. The difference between the score-based residuals (red points) and the svp residuals (blue points) show that excluding comparisons leads to “overfitting” in the svp residual. This suggests that increasing the parameter should be done with care and good checks on the residual norms.
To check that a rank- approximation is reasonable, we increased the target rank in the svp solver to to investigate. For the arithmetic mean (6,30) model, the relative residual at rank- is and at rank- is . Meanwhile, the nuclear norm increases from around 14000 to around 17000. These results show that the change in the fit is minimal and our rank-2 approximation and its scores should represent a reasonable ranking.
Existing principled techniques such as computing a Kemeny optimal ranking or finding a minimize feedback arc set are NP-hard. These approaches are inappropriate in large scale rank aggregation settings. Our proposal is (i) measure pairwise scores and (ii) solve a matrix completion problem to determine the quality of items. This idea is both principled and functional with significant missing data. The results of our rank aggregation on the Netflix problem (Table Rank Aggregation via Nuclear Norm Minimization) reveal popular and high quality movies. These are interesting results and could easily have a home on a “best movies in Netflix” web page. Computing a rank aggregation with this technique is not NP-hard. It only requires solving a convex optimization problem with a unique global minima. Although we did not record computation times, the most time consuming piece of work is computing the pairwise comparison matrix . In a practical setting, this could easily be done with a MapReduce computation.
To compute these solutions, we adapted the svp solver for matrix completion (Jain et al., 2010). This process involved (i) studying the singular value decomposition of a skew-symmetric matrix (Lemmas 1 and missing2) and (ii) showing that the svp solver preserves a skew-symmetric approximation through its computation (Theorem 3). Because the svp solver computes with an explicitly chosen rank, these techniques work well for large scale rank aggregation problems.
We believe the combination of pairwise aggregation and matrix completion is a fruitful direction for future research. We plan to explore optimizing the svp algorithm to exploit the skew-symmetric constraint, extending our recovery result to the noisy case, and investigating additional data.
- Ailon et al. (2005) N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. In STOC ’05, pages 684–693, 2005. ISBN 1-58113-960-8. doi: 10.1145/1060590.1060692.
- Alon (2006) N. Alon. Ranking tournaments. SIAM J. Discret. Math., 20(1):137–142, 2006. ISSN 0895-4801. doi: 10.1137/050623905.
- Arrow (1950) K. J. Arrow. A difficulty in the concept of social welfare. J. Polit. Econ., 58(4):328–346, August 1950. ISSN 00223808. URL http://www.jstor.org/stable/1828886.
- Cai et al. (2008) J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. arXiv, math.OC:0810.3286v1, October 2008. URL http://arxiv.org/abs/0810.3286.
- Candès and Tao (2007) E. Candès and T. Tao. The Dantzig selector: Statistical estimation when is much larger than . Ann. Stat., 35(6):2313–2351, 2007. doi: 10.1214/009053606000001523.
- Candès and Recht (2009) E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9(6):717–772, December 2009. doi: 10.1007/s10208-009-9045-5.
- Candès and Tao (to appear) E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theory, to appear.
- Chen et al. (1998) S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comp., 20(1):33–61, 1998. doi: 10.1137/S1064827596304010.
- Colley (2002) W. N. Colley. Colley’s bias free college football ranking method: The Colley matrix explained. Technical report, Princeton University, 2002.
- Condorcet (1785) J.-A.-N. d. C. Condorcet. Essai sur l’application de l’analyse à la probabilité des décisions… de L’imprimerie Royale, Paris, 1785. URL http://gallica2.bnf.fr/ark:/12148/bpt6k417181.
- Dai and Milenkovic (2009) W. Dai and O. Milenkovic. Set: an algorithm for consistent matrix completion. arXiv, September 2009. URL http://arxiv.org/abs/0909.2705.
- David (1988) H. A. David. The method of paired comparisons. Number 41 in Griffin’s Statistical Monographs and Courses. Charles Griffin, 1988. ISBN 0195206169.
- de Kerchov and van Dooren (2007) C. de Kerchov and P. van Dooren. Iterative filtering for a dynamical reputation system. arXiv, cs.IR:0711.3964, 2007. URL http://arXiv.org/abs/0711.3964.
- Dwork et al. (2001) C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In WWW ’01, pages 613–622, New York, NY, USA, 2001. ACM. ISBN 1-58113-348-0. doi: 10.1145/371920.372165.
- Fazel (2002) M. Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University, March 2002. URL http://faculty.washington.edu/mfazel/thesis-final.pdf.
- Freeman (1997) L. C. Freeman. Uncovering organizational hierarchies. Computational and Mathematical Organization Theory, 3(1):5–18, 1997. ISSN 1381-298X. doi: 10.1023/A:1009690520577.
- Fuchs (2004) J.-J. Fuchs. Recovery of exact sparse representations in the presence of noise. In ICASSP ’04, volume 2, pages ii–533–6 vol.2, May 2004. doi: 10.1109/ICASSP.2004.1326312.
- Gross (2010) D. Gross. Recovering low-rank matrices from few coefficients in any basis. arXiv, cs.NA:0910.1879v5, 2010. URL http://arxiv.org/abs/0910.1879.
- Ho and Quinn (2008) D. E. Ho and K. M. Quinn. Improving the presentation and interpretation of online ratings data with model-based figures. Amer. Statist., 62(4):279–288, November 2008. doi: 10.1198/000313008X366145. URL http://pubs.amstat.org/doi/abs/10.1198/000313008X366145.
- Jain et al. (2010) P. Jain, R. Meka, and I. Dhillon. Guaranteed rank minimization via singular value projection. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 937–945, 2010. URL http://books.nips.cc/papers/files/nips23/NIPS2010_0682.pdf.
- Jiang et al. (2010) X. Jiang, L.-H. Lim, Y. Yao, and Y. Ye. Statistical ranking and combinatorial hodge theory. Mathematical Programming, 127(1):1–42, 2010. ISSN 0025-5610. doi: 10.1007/s10107-010-0419-x. 10.1007/s10107-010-0419-x.
- Kemeny (1959) J. G. Kemeny. Mathematics without numbers. Daedalus, 88(4):577–591, Fall 1959. ISSN 00115266. URL http://www.jstor.org/stable/20026529.
- Kendall and Smith (1940) M. G. Kendall and B. B. Smith. On the method of paired comparison. Biometrika, 31(3-4):324–345, 1940. doi: 10.1093/biomet/31.3-4.324.
- Keshavan and Oh (2009) R. H. Keshavan and S. Oh. A gradient descent algorithm on the grassman manifold for matrix completion. arXiv, October 2009. URL http://arxiv.org/abs/0910.5260.
- Langville and Meyer (forthcoming) A. N. Langville and C. D. Meyer. Who’s #1:The Science of Rating and Ranking. Princeton University Press, Princeton, NJ, forthcoming.
- Lee and Bresler (2009) K. Lee and Y. Bresler. Admira: Atomic decomposition for minimum rank approximation. arXiv, May 2009. URL http://arxiv.org/abs/0905.0044.
- Li et al. (2008) H. Li, T.-Y. Liu, and C. Zhai, editors. Proceedings of the SIGIR 2008 Workshop: Learning to Rank for Information Retrieval. 2008. URL http://research.microsoft.com/en-us/um/beijing/events/lr4ir-2008/PROCEEDINGS-LR4IR%202008.PDF.
- Massey (1997) K. Massey. Statistical models applied to the rating of sports teams. Master’s thesis, Bluefield College, 1997.
- Mazumder et al. (2009) R. Mazumder, T. Hastie, and R. Tibshirani. Regularization methods for learning incomplete matrices. arXiv, June 2009. URL http://arxiv.org/abs/0906.2034v1.
- Miller (1956) G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev., 101(2):343–352, 1956. URL http://www.psych.utoronto.ca/users/peterson/psy430s2001/Miller%20GA%20Magical%20Seven%20Psych%20Review%201955.pdf.
- Murnaghan and Wintner (1931) F. D. Murnaghan and A. Wintner. A canonical form for real matrices under orthogonal transformations. PNAS, 17(7):417–420, July 1931. URL http://www.pnas.org/content/17/7/417.full.pdf+html.
- Recht et al. (to appear) B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solution of linear matrix equations via nuclear norm minimization. SIAM Rev., to appear.
- Rendle et al. (2009) S. Rendle, L. Balby Marinho, A. Nanopoulos, and L. Schmidt-Thieme. Learning optimal ranking with tensor factorization for tag recommendation. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 727–736, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9. doi: 10.1145/1557019.1557100.
- Saaty (1987) T. L. Saaty. Rank according to Perron: A new insight. Math. Mag, 60(4):211–213, October 1987. ISSN 0025570X. URL http://www.jstor.org/stable/2689340.
- Tan and Jin (2004) P.-N. Tan and R. Jin. Ordering patterns by combining opinions from multiple sources. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 695–700, New York, NY, USA, 2004. ACM. ISBN 1-58113-888-1. doi: http://doi.acm.org/10.1145/1014052.1014142. URL http://doi.acm.org/10.1145/1014052.1014142.
- Tibshirani (1996) R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267–288, 1996. ISSN 00359246. URL http://www.jstor.org/stable/2346178.
- Toh and Yun (2009) K.-C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Opt. Online, November 2009. URL http://www.optimization-online.org/DB_FILE/2009/03/2268.pdf.
- Vandenberghe and Boyd (1996) L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Rev., 38(1):49–95, March 1996. ISSN 0036-1445. doi: 10.1137/1038003.