Preference Completion: Large-scale Collaborative Ranking from Pairwise Comparisons
In this paper we consider the collaborative ranking setting: a pool of users each provides a small number of pairwise preferences between possible items; from these we need to predict each userâs preferences for items they have not yet seen. We do so by fitting a rank score matrix to the pairwise data, and provide two main contributions:
(a) we show that an algorithm based on convex optimization provides good generalization guarantees once each user provides as few as pairwise comparisons – essentially matching the sample complexity required in the related matrix completion setting (which uses actual numerical as opposed to pairwise information), and
(b) we develop a large-scale non-convex implementation, which we call AltSVM, that trains a factored form of the matrix via alternating minimization (which we show reduces to alternating SVM problems), and scales and parallelizes very well to large problem settings. It also outperforms common baselines on many moderately large popular collaborative filtering datasets in both NDCG and in other measures of ranking performance.
This paper considers the following recommendation system problem: given a set of items, a set of users, and non-numerical pairwise comparison data, find the underlying preference ordering of the users. In particular, we are interested in the setting where data is of the form “user preferes item over item ”, for different ordered user-item-item triples . Pairwise preference data is wide-spread; indeed, almost any setting where a user is presented with a menu of options – and chooses one of them – can be considered to be providing a pairwise preference between the chosen item and every other item that is presented.
Crucially, we are interested in the collaborative filtering setting, where (a) on the one hand the number of such pairwise preferences we have for any one user is woefully insufficient to infer anything for that user in isolation; and (b) on the other hand, we aim for personalization, i.e. for every user to possibly have different inferred preferences from every other. To reconcile these two requirements, our method relates the preferences of users to each other via a low-rank matrix, which we (implicitly) assume governs the observed preferences. Essentially, we fit a low-rank users items score matrix to pairwise comparison data by trying to ensure that is positive when user prefers item to item .
We present two algorithms to infer the score matrix from training data; once inferred, this can be used for predicting future preferences. While there has been some recent work on fitting low-rank score matrices to pairwise preference data (which we review and compare to below), in this paper we present the following two contributions:
(a) A statistical analysis for the convex relaxation: we bound the generalization error of the solution to our convex program. Essentially, we show that the minimizer of the empirical loss also almost minimizes the true expected loss. We also give a lower bound showing that our error rate is sharp up to logarithmic factors.
(b) A large-scale non-convex implementation: We provide a non-convex algorithm that we call Alternating Support Vector Machine (AltSVM). This non-convex algorithm is more practical than the convex program in a large-scale setting; it explicitly parameterizes the low-rank matrix in factored form and minimizes the hinge loss. Crucially, each step in this algorithm can be formulated as a standard SVM that updates one of the two factors; the algorithm proceeds by alternating updates to both factors. We apply a stochastic version of dual coordinate descent [7, 22] with lock-free parallelization. This exploits the problem structure and ensures it parallelizes well. We show that our algorithm outperforms several existing collaborative ranking algorithms in both speed and prediction accuracy, and it achieves significant speedups as the number of cores increases.
1.1 Related Work
Ranking/learning preferences is a classical problem that has been considered in a large amount of work. There are many different settings for this problem, which we discuss below.
Learning to Rank
The main problem in this community has been to estimate a ranking function from given feature vectors and relevance scores. Depending on its application, a feature vector may correpond to a user-item pair or a single item. While there have been algorithms that use pairwise comparisons [6, 12] of the training samples, our setting is different in that our data consists only of pairwise comparisons. We refer the reader to the survey .
One ranking with pairwise comparisions
In a single-user model, we are asked to learn a single ranking given pairwise comparisons. Jamieson & Nowak  and Ailon  consider an active query model with noiseless responses; Jamieson & Nowak  give an algorithm for exactly recovering the true ranking under a low-rank assumption similar to ours, while Ailon  approximately recovers the true ranking without such an assumption. Wauthier et al.  and Negahban et al.  learn a ranking from noisy pairwise comparisions; Negahban et al.  consider a Bradley-Terry-Luce model similar to ours and attempt to learn an underlying score vector, while Wauthier et al.  get by without structure assumptions, but only attempt to learn the ranking itself. Hajek et al.  considered a problem to learn a single ranking given a more generalized partial rankings from the Plackett-Luce model and provided a minimax-optimal algorithm.
Many rankings with pairwise comparisions
Given multiple users with different rankings, one could of course attempt to learn their rankings by simply applying an algorithm from the previous section to each user individually. However, it is more efficient – both statistically and computationally – to postulate some global structure and use it to relate the many users’ rankings. This is the same idea that has been applied so successfully in collaborative filtering. Rendle et al.  and Liu et al.  were the first to take this approach. They modeled the observations as coming from a BTL model with low-rank structure (i.e., very similar to our model) and gave algorithms for learning the model parameters. Yi et al.  took a purely optimization-based approach. Rather than assuming a probabilistic model, they minimized a convex objective using the hinge loss on a low-rank matrix. In a slightly different model, Hu et al.  and Shi et al.  consider the problem of learning from latent feedback. Recently, Lu & Negahban  analyzed an algorithm which is very similar to ours for the Bradley-Terry-Luce model independently from our work.
Many rankings with 1-bit ratings
Instead of moving to pairwise comparisons, some work has suggested avoiding the difficulties of numerical ratings by instead asking users to give 1-bit ratings to items; that is, each user only indicates whether they like or dislike an item. In this setting, the work of Davenport et al.  is most closely related to ours, in that they assume an underlying low-rank structure and give an algorithm based on convex optimization. Also, our theoretical analysis owes a lot to their work. Xu et al.  consider a slightly different goal: rather than attempting to recover the preferences of each user, they try to cluster similar users and similar items together. Yun et al.  proposed an optimization problem motivated from robust binary classification and used stochastic gradient descent to solve the problem in a large-scale setting.
Many rankings with numerical ratings
The goal in this setting is the same as ours, except that the data is in the form of numerical ratings instead of pairwise comparisons. Weimer et al.  attempted to directly optimize Normalized Discounted Cumulative Gain (NDCG), a widely used performance measure for ranking problems. Balakrishnan & Chopra , and Volkovs & Zemel  converted this problem into a learning-to-rank problem and solved it using the existing algorithms. While these works considered the low-rank matrix model, different models are proposed by Weston et al.  and Lee et al. . Weston et al.  proposed a tensor model to rank items for different queries and users, and  proposed a weighted sum of low-rank matrix models.
2 Empirical Risk Minimization (ERM)
Let us first formulate the problem mathematically. The task is to estimate rankings of multiple users on multiple items. We denote the numbers of users by , and the number of items by . We are given a set of triples , where the preference of user between items and is observed if . The observed comparison is then given by where if user prefers item over item , and otherwise. Let denote the set of item pairs that user has compared.
We predict rankings for multiple users by estimating a score matrix such that means that user prefers item over item . Then the sorting order for each row provides the predicted ranking for the corresponding user.
We propose (as have others) that is low-rank or close to low-rank, the intuition being that each user bases their preferences on a small set of features that are common among all the items. Then the empirical risk minimization (ERM) framework can naturally be formulated as
where is a monotonically non-increasing loss function which induces if , and otherwise. (e.g., hinge loss, logistic regression loss, etc.)
Solving (1) is NP-hard because of the rank constraint. As a first alternative, we propose a straightforward convex relaxation.
3 Convex Relaxation
Our first method is the convex relaxation of (1), which involves a nuclear norm constraint.
Here, for any matrix , the nuclear/trace norm denotes the sum of its singular values; it is a well-recognized convex surrogate for low-rank structure (most famously in matrix completion).
The only parameter of this algorithm is , which governs the trade-off between better optimizing the likelihood of the observed data, and the strictness in imposing approximate low-rank structure. Since we motivated our algorithm with the assumption that has low rank, we should point out how our algorithm’s parameter compares to the rank: note that if is a rank- matrix whose largest absolute entry is bounded by then . In other words, is a parameter that takes into account both the rank of and the size of its elements, and it is roughly proportional to the rank.
3.1 Analytic results
We analyze (2) by assuming a standard model for pairwise comparisons. Then we provide a statistical guarantee of the method under the model.
Recall the classical Bradley-Terry-Luce model [3, 17] for pairwise preferences of a single user, which assumes that the probability of item being preferred over is given by a logistic of the difference of the underlying preference scores of the two items. For multiple users, we assume that there is some true score matrix and
Assume that each user-item-item triple independently belongs to with probability , and let be the expected size of . We will assume that the are approximately balanced in the sense that no user-item pair is observed too frequently:
There is a constant such that for every ,
Note that if in Assumption 3.1 then the are all equal, meaning that each user-item-item triple has an equal chance to be observed.
In order to state our error bounds, we first introduce some notation: let be the distribution of (i.e. the complete distribution of all pairwise preferences, even those that are not observed).
Our main upper bound shows that if is sufficiently large then our algorithm finds a solution with almost minimal risk. Given a loss function , define the expected risk of by
where the expectation is with respect to the distribution parametrized by the true parameters .
Suppose that is 1-Lipschitz, and let and be distributed as for some matrix . Under Assumption 3.1,
where is a universal constant.
We recall that the parameter is related to rank in that if is a rank- matrix whose largest absolute entry is bounded by then . In other words, is a parameter that takes into account both the rank of and the size of its elements, and it is roughly proportional to the rank. In particular, Theorem 3.1 shows that once we observe pairwise comparisons, then we can accurately estimate the probability of any user preferring any item over any other. In other words, we need to observe about comparisons per user, which is substantially less than the comparisons that we would have required if each user were modelled in isolation. Moreover, our lower bound (below) shows that at least comparisons per user are required, which is only a logarithmic factor from the upper bound.
Suppose that . Let be any algorithm that receives as input and produces as output. For any and , there exists with such that when and are distributed according to then with probability at least ,
where is a constant depending only on .
3.1.1 Maximum likelihood estimation of
By specializing the loss function , Theorem 3.1 has a simple corollary for maximum-likelihood estimation of . Recall that if and are two probability distributions on a finite set the the Kullback-Leibler divergence between them is
under the convention that . We recall that although is not a metric it is always non-negative, and that implies .
Let and be distributed as for some matrix . Define the loss function by . Under Assumption 3.1,
where is a universal constant.
Note that the loss function in Corollary 3.3 is exactly the negative logarithm of the logistic function, and so in Corollary 3.3 is the maximum-likelihood estimate for . Thus, Corollary 3.3 shows that the distribution induced by the maximum-likelihood estimator is close to the true distribution in Kullback-Leibler divergence.
4 Large-scale Non-convex Implementation
While the convex relaxation is statistically near optimal, it is not ideal for large-scale datasets because it requires the solution of a convex program with variables. In this section we develop a non-convex variant which both scales and parallelizes very well, and has better empirical performance as compared to several existing empirical baseline methods.
Our approach is based on the following steps:
We represent the low-rank matrix in explicit factored form and replace the regularizer appropriately. This results in a non-convex optimization problem in and , where is the rank parameter.
We solve the non-convex problem by alternating between updating while keeping fixed, and vice versa. With the hinge loss (which we found works best in experiments), each of these becomes an SVM problem - hence we call our algorithm AltSVM.
The problem is of course not symmetric in and because users rank items but not vice versa. For the update, each user vector naturally decouples and can be done in parallel (and in fact just reduces to the case of rankSVM ).
For the update, we show that this can also be made into an SVM problem; however it involves coupling of all item vectors, and all user ratings. We employ several tricks (detailed below) to speed up and effectively parallelize this step.
The non-convex problem can be written as
where we replace the nuclear norm regularizer using the property . and denote the th rows of and , respectively. While this is a non-convex algorithm for which it is hard to find the global optimum, it is computationally more efficient since only variables are involved. We propose to use L2 hinge loss, i.e., .
In the alternating minimization of (3), the subproblem for is to solve
while is fixed. This can be decomposed into independent problems for ’s where each solves for
This part is in general a small-scale problem as the dimension is , and the sample size is for each user .
On the other hand, solving for with fixed can be written as
where is such that the th row of is if , if , and otherwise. It is a much larger SVM problem than (5) as the dimension is and the sample size is .
We note that the feature matrices are highly sparse since in each feature matrix only out of the elements are nonzero. This motivates us to apply the stochastic dual coordinate descent algorithm [7, 22], which not only converges fast but also takes advantages of feature sparsity in linear SVMs. Each coordinate descent step takes computation, and iterations over coordinates provide linear convergence .
where is the convex conjugate of . At each coordinate descent step for , we find the value of minimizing (7) while all the other variables are fixed. If we maintain , then the coordinate descent step is simply to find minimizing
and update .
The dual problem of (6) is to solve
where is the dual vector for the subproblem (6). Similarly to , the coordinate descent step for is to replace by where minimizes
and maintain .
The detailed description of AltSVM is presented in Algorithm 1. In each subproblem, we run the stochastic dual coordinate descent, in which a pairwise comparison is chosen uniformly at random, and the dual coordinate descent for or is computed. We note that each coordinate descent step takes the same computational cost in both subproblems, while the subproblem sizes are much different.
For each subproblem, we parallelize the stochastic dual coordinate descent algorithm asynchronously without locking. Given processors, each processor randomly sample a triple and update the corresponding dual variable and the user or item vectors. We note that this update is for a sparse subset of the parameters. In the user part, a coordinate descent step for one sample updates only out of the variables. In the item part, one coordinate descent step for a sample update only out of the variables. This motivates us not to lock the variables when updated, so that we ignore the conflicts. This lock-free parallelism is shown to be effective in  for stochastic gradient descent (SGD) on the sum of sparse functions. Moreover, in , it is also shown that the stochastic dual coordinate descent scales well without locking. We implemented the algorithm using the OpenMP framework. In our implementations, we also parallelized steps 3 and 13 of Algorithm 1. We show in the next section that our proposed algorithm scales up favorably.
4.2 Remark on the implementation
In Algorithm 1, the subproblem for comes first, and then it solves for the user vectors . We empirically observed that this order gives better convergence on practical datasets. We also note that each subproblem reuses the dual variables in the previous outer iteration. When almost converged, the features ( for solving , and for solving ) do not change too much. By reusing the dual variables in the previous iteration we can start with a feasible solution close to the optimum.
5 Experimental results
5.1 Pairwise data
We used the MovieLens 100k dataset, which contains 100,000 ratings given by 943 users on 1682 movies. The ratings are given as integers from one to five, but we converted them into preference data by declaring that a user preferred one movie to another if they gave it a higher rating (if two movies received the same rating, we treated it as though the user did not provide a preference). Then we held out of the data as a test set.
We compared our algorithm to the following two:
Bayesian Personalized Ranking (BPR) : This algorithm is based on a similar model to ours, but a different optimization procedure (essentially, a variant of stochastic gradient descent).
Matrix completion from pairwise differences : A standard matrix completion algorithm that observes – for various triples – the difference between user ’s ratings for item and item . Note that this algorithm has an advantage over (2) because it sees the magnitude of this difference instead of only its sign. Nevertheless, the matrix completion algorithm does not perform any better than (2). A similar phenomenon was also observed in .
We evaluate our performance by computing the proportion of pairwise comparisons in the test set for which we correctly infer the user’s preference.
This is similar to the AUC statistic measured by Rendle et al. , and if the data were fully observed then it would measure Kendall’s distance between each user’s true preferences and the learned ones. However, our main reason for choosing this measure of performance is that, as an average accuracy over all pairwise comparisions, it resembles the quantity that we study in our theoretical bounds.
Unsurprisingly, we were more accurate at correctly inferring strong preferences; therefore, we have also shown the accuracy obtained by only measuring performance on pairs whose rankings differ by two or more. Both the methods we considered do measurably better at predicting these orderings.
5.2 Large-scale experiments on rating data
Now we demonstrate that our algorithm performs well as a collaborative ranking method on rating data. We used the datasets specified in Table 1. Given a training set of ratings for each user, our algorithm will only use non-tying pairwise comparisons from the set, while other competing algorithms use the ratings themselves. Hence, they have more information than ours. The competing algorithms are those with publicly available codes provided by the authors.
Local Collaborative Ranking (LCR) 222http://prea.gatech.edu, We run the code with each of the 48 sets of loss function and parameters given in the main code, and the best result is reported. We could not run this algorithm on the Netflix dataset due to time constraint. : The main idea is to predict preferences from the weighted sum of multiple low-rank matrices model.
RobiRank 333https://bitbucket.org/d_ijk_stra/robirank, We used the part for collaborative ranking from binary relevence score. We left the parameter settings as provide with the implementation. : This algorithm uses stochastic gradient descent to optimize the loss function motivated from robust binary classification.
Global Ranking : To see the effect of personalized ranking, we compare the results with a global ranking of the items. We fixed to all ones and solved for .
The algorithms are compared in terms of two standard performance measures of ranking, which are NDCG and Precision@. NDCG@ is the ranking measure for numerical ratings. NDCG@ for user is defined as
and is the index of the th ranked item of in our prediction. is the true rating of item by user in the given dataset, and is the permutation that maximizes DCG@. This measure counts only the top items in our predicted ranking and put more weights on the prediction of highly ranked items. We measured NDCG@ in our experiments. Precision@ is the ranking measure for binary ratings. Precision@ for user is defined as
where is the binary rating on item by user given in the dataset. This counts the number of relevant items in the predicted top recommendation. These two measures are averaged over all of the users.
We first compare our algorithm with numerical rating based algorithms, CofiRank and LCR. We follow the standard setting that are used in the collaborative ranking literature [28, 2, 26, 13]. For each user, we subsampled ratings, used them for training, and took the rest of the ratings for test. The users with less than ratings were dropped out. Table 2 compares AltSVM with numerical rating based algorithms. While is too small so that a global ranking provides the best NDCG, our algorithm performs the best with larger . We also ran our algorithm with subsampled pairwise comparions with the largest numerical gap (AltSVM-sub), which are as many as for each user (the number of numerical ratings used in the other algorithms). Even with this, we could achieve better NDCG. We can also observe that the statistical performance is better with the hinge loss than with the logistic loss.
We have also experimented with collaborative ranking on binary ratings. We compare our algorithm against RobiRank , which is a recently proposed algorithm for collaborative ranking with binary ratings. We ran an experiment on a binarized version of the Movielens1m dataset. In this case, the movies rated by a user is assumed to be relevant to the user, and the other items are not. Since it is inefficient to take all possible comparisons which are in average a half million per user, we subsampled comparisons for each user. Both algorithms are set to estimate rank-100 matrices. Table 3 shows that our algorithm provides better performance than RobiRank.
5.3 Computational speed and Scalability
We now show the computational speed and scalability of our practical algorithm, AltSVM. The experiments were run on a single 16-core machine in the Stampede Cluster at University of Texas.
Figures 2a and 2b show NDCG@10 over time of our algorithms with 1, 4, and 16 threads, compared to CofiRank. Figure 2c shows Precision@10 over time of our algorithm with . We note that our algorithm converges faster, while the sample size for our algorithm is larger than the number of training ratings that are used in the competing algorithms. Table 4 shows the scalability of AltSVM. We measured the time to achieve tolerance on the binarized MovieLens1m dataset. As can be seen in the table, we could achieve significant speedup.
We considered the collaborative ranking problem where one fits a low-rank matrix to the pairwise comparisons by multiple users. We showed that the convex relaxation of the empirical risk minimization provides good generalization guarantees. For the large-scale practical settings, we also proposed a non-convex algorithm, which alternately solves two SVM problems. Our algorithm was shown to outperform the existing ones and parallelizes well.
- Ailon  Ailon, Nir. Active learning ranking from pairwise preferences with almost optimal query complexity. In Advances in Neural Information Processing Systems (NIPS), pp. 810–818, 2011.
- Balakrishnan & Chopra  Balakrishnan, Suhrid and Chopra, Sumit. Collaborative ranking. In ACM International Conference on Web Search and Data Mining (WSDM), 2012.
- Bradley & Terry  Bradley, Ralph Allan and Terry, Milton E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, pp. 324–345, 1952.
- Davenport et al.  Davenport, Mark A, Plan, Yaniv, Berg, Ewout van den, and Wootters, Mary. 1-bit matrix completion. Information and Inference, 3(3):189–223, 2014.
- Hajek et al.  Hajek, Bruce, Oh, Sewoong, and Xu, Jiaming. Minimax-optimal inference from partial rankings. In Advances in Neural Information Processing Systems (NIPS), 2014.
- Herbrich et al.  Herbrich, Ralf, Graepel, Thore, and Obermayer, Klaus. Large Margin Rank Boundaries for Ordinal Regression, chapter 7, pp. 115–132. MIT Press, January 2000.
- Hsieh et al.  Hsieh, Cho-Jui, Chang, Kai-Wei, Lin, Chih-Jen, Keerthi, S. Sathiya, and Sundararajan, S. A dual coordinate descent method for large-scale linear SVM. In International Conference on Machine Learning (ICML), 2008.
- Hsieh et al.  Hsieh, Cho-Jui, Yu, Hsiang-Fu, and Dhillon, Inderjit S. PASSCoDe: Parallel asynchronous stochastic dual co-ordinate descent. In International Conference on Machine Learning (ICML), 2015.
- Hu et al.  Hu, Yifan, Koren, Yehuda, and Volinsky, Chris. Collaborative filtering for implicit feedback datasets. In IEEE International Conference on Data Mining (ICDM), pp. 263–272. IEEE, 2008.
- Jamieson & Nowak [2011a] Jamieson, K. G. and Nowak, R. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems (NIPS), 2011a.
- Jamieson & Nowak [2011b] Jamieson, Kevin G. and Nowak, Robert D. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems (NIPS), 2011b.
- Joachims  Joachims, Thorsten. Optimizing search engines using clickthrough data. In SIGKDD, 2002.
- Lee et al.  Lee, Joonseok, Bengio, Samy, Kim, Seungyeon, Lebanon, Guy, and Singer, Yoram. Local collaborative ranking. In International World Wide Web Conference (WWW), 2014.
- Liu et al.  Liu, Nathan N, Zhao, Min, and Yang, Qiang. Probabilistic latent preference analysis for collaborative filtering. In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 759–766. ACM, 2009.
- Liu  Liu, Tie-Yan. Learning to Rank for Information Retrieval. Now Publishers Inc., 2009.
- Lu & Negahban  Lu, Yu and Negahban, Sahand. Individualized rank aggregation using nuclear norm regularization. ArXiv e-prints: 1410.0860, Oct 2014.
- Luce  Luce, Duncan R. Individual Choice Behavior. Wiley, 1959.
- Negahban et al.  Negahban, Sahand, Oh, Sewoong, and Shah, Devavrat. Iterative ranking from pair-wise comparisons. In Advances in Neural Information Processing Systems (NIPS), 2012.
- Niu et al.  Niu, Feng, Recht, Benjamin, Ré, Christopher, and Wright, Stephen. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems (NIPS), 2011.
- Rendle et al.  Rendle, Steffen, Freudenthaler, Christoph, Gantner, Zeno, and Schmidt-Thieme, Lars. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 452–461. AUAI Press, 2009.
- Seginer  Seginer, Yoav. The expected norm of random matrices. Combinatorics Probability and Computing, 9(2):149–166, 2000.
- Shalev-Shwartz & Zhang  Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research (JMLR), pp. 567–599, 2013.
- Shi et al.  Shi, Yue, Karatzoglou, Alexandros, Baltrunas, Linas, Larson, Martha, Oliver, Nuria, and Hanjalic, Alan. Climf: collaborative less-is-more filtering. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pp. 3077–3081. AAAI Press, 2013.
- Srebro et al.  Srebro, Nathan, Rennie, Jason, and Jaakkola, Tommi. Maximum margin matrix factorization. In Advances in Neural Information Processing Systems (NIPS), 2004.
- Vershynin  Vershynin, Roman. Compressed sensing: theory and applications, chapter Introduction to the non-asymptotic analysis of random matrices. Cambridge University Press, 2012.
- Volkovs & Zemel  Volkovs, Maksims N. and Zemel, Richard S. Collaborative ranking with 17 parameters. In Advances in Neural Information Processing Systems (NIPS), 2012.
- Wauthier et al.  Wauthier, Fabian L., Jordan, Michael I., and Jojic, Nebojsa. Efficient ranking from pairwise comparisons. In International Conference on Machine Learning (ICML), 2013.
- Weimer et al.  Weimer, Markus, Karatzoglou, Alexandros, Le, Quoc V., and Smola, Alex. Cofirank: maximum margin matrix factorization for collaborative ranking. In Advances in Neural Information Processing Systems (NIPS), 2007.
- Weston et al.  Weston, Jason, Want, Chong, Weiss, Ron, and Berenzeig, Adam. Latent collaborative retrieval. In International Conference on Machine Learning (ICML), 2012.
- Xu et al.  Xu, Jiaming, Wu, Rui, Zhu, Kai, Hajek, Bruce, Srikant, R, and Ying, Lei. Jointly clustering rows and columns of binary matrices: Algorithms and trade-offs. In ACM Sigmetrics, 2013.
- Yi et al.  Yi, Jinfeng, Jin, Rong, Jain, Shaili, and Jain, Anil. Inferring usersâ preferences from crowdsourced pairwise comparisons: A matrix completion approach. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.
- Yun et al. [2014a] Yun, Hyokun, Raman, Parameswaran, and Vishwanathan, S. V. N. Ranking via robust binary classification and parallel parameter estimation in large-scale data. In Advances in Neural Information Processing Systems (NIPS), 2014a.
- Yun et al. [2014b] Yun, Hyokun, Yu, Hsiang-Fu, Hsieh, Cho-Jui, Viswanathan, S. V. N., and Dhillon, Inderjit S. NOMAD: Non-locking, stochastic multi-machine algorithm for asynchronous and descentralized matrix completion. In VLDB, 2014b.
Appendix A Proof of Theorem 3.1
We write for the function being optimized; i.e.,
Note that for any fixed , (where denotes the expectation taken with respect to future samples from , as distinct from which denotes the expectation over the samples used to generate ). Let be the set of matrices with nuclear norm at most . The proof of Theorem 3.1 proceeds in three main steps.
By some algebraic of manipulations , we reduce the problem to showing a uniform law of large numbers for the family of functions .
Using symmetrization and duality properties of , we reduce the problem to bounding the norm of a matrix whose entries are sums of random signs.
We bound the norm of using various concentration inequalities and a theorem of Seginer .
Since , by definition, minimizes , for any we can bound
In other words, it suffices to show a uniform law of large numbers for .
Let be i.i.d. -valued variables and let be the indicator that . By Giné-Zinn’s symmetrization (as in ),
Since is 1-Lipschitz, we obtain
where in the last line, we recognized that has the same distribution as . Now, let denote the matrix where . Then
Putting everything together, we have (for any )
Together with the following lemma (which we prove in Appendix B), this completes the proof of Theorem 3.1
Appendix B Proof of Lemma a.1
We will decompose into two parts, , with
Then . Since and have the same distribution,
and so we are reduced to studying , which has i.i.d. entries. Now, we apply Seginer’s theorem :
where denotes the th row of and denotes the th column, and denotes the Euclidean norm.
We will separate the task of bounding into two parts: if denotes the number of non-zero coordinates in and denotes then ; with the Cauchy-Schwarz inequality, this implies that
First, we will show that every row of is sparse. Let and let be the indicator that . Recalling that , we have (by Assumption 3.1) . Since takes non-negative integer values, we have . By Bernstein’s inequality, for any fixed
Integrating by parts, we have
Next, we will consider the size of the elements in . First of all, (this fairly crude bound will lose us a factor of ). Now, Bernstein’s inequality applied to gives
Taking a union bound over and , if then
Integrating by parts,
Going back to (12), we have shown that
The same argument applies to (but with instead of ), and so we conclude from (11) that
Appendix C Proof of Theorem 3.2
c.1 A sketch of the proof
The proof of Theorem 3.2 uses Fano’s inequality.
We construct matrices . These matrices all have small nuclear norm, and for every pair the KL-divergence between the induced observation distributions is . We construct these matrices randomly, using concentration inequalities and a union bound to show that we can take of the order .
We apply Fano’s inequality to show that if we generate data according to a randomly chosen , then any algorithm has a reasonable chance to choose a different (using the fact that the KL-divergence is ). Since the KL-divergence is , this implies that the algorithm incurs a substantial penalty whenever it makes a wrong choice.
In any application of Fano’s inequality, the key is to construct a large number of admissible models that are close to one another in KL-divergence. Specifically, if we can construct distributions with for all , then given a single sample from some , no algorithm can accurately identify which it came from. In order to apply this denote by the distribution of the data when the true parameters are . We will construct such that for all ,
for some constant , where denotes the expected risk when the true parameters are given by . Given a single observation from some , (13) will imply (by Fano’s inequality) that no algorithm can correctly identify which was the true parameter. On the other hand, (14) will imply that if the algorithm makes a mistake – say it chooses for – then its risk will be larger than the best in the class. In particular, if we can prove (13) and (14) with then it will imply Theorem 3.2.
We construct a set of matrices satisfying (13) and (14) using a probabilistic method. Supposing that , we choose a parameter and set to be an integer that is approximately . We define by filling its top block with independent, uniform entries, and then copying that top block times to fill the matrix. Then let be independent copies of . First of all, each because .
Now, let us consider . For a single triple, there is probability of having different from , in which case they differ by . If is bounded above, each different entry contributes to the KL-divergence between and . Since about entries are observed in , we see that
On the other hand, and differ by , because for a constant fraction of triples , the chance that is 1 differs by in and , and on the event that differs in these two models the loss differs by another factor.
c.2 Some concentration lemmas
We begin by quoting some standard concentration results (see, e.g. ).
A random variable is