A Channel Coding Perspective of Collaborative Filtering 111Preliminary results related to this submission were presented by us in  (ISIT 2009, Seoul, Korea).
We consider the problem of collaborative filtering from a channel coding perspective. We model the underlying rating matrix as a finite alphabet matrix with block constant structure. The observations are obtained from this underlying matrix through a discrete memoryless channel with a noisy part representing noisy user behavior and an erasure part representing missing data. Moreover, the clusters over which the underlying matrix is constant are unknown. We establish a sharp threshold result for this model: if the largest cluster size is smaller than (where the rating matrix is of size ), then the underlying matrix cannot be recovered with any estimator, but if the smallest cluster size is larger than , then we show a polynomial time estimator with diminishing probability of error. In the case of uniform cluster size, not only the order of the threshold, but also the constant is identified.
As new content mushrooms at a brisk pace, finding relevant information is increasingly a challenge. Consequently, recommendation systems are commonly being used to assist users: Amazon recommends books, Netflix recommends movies, LinkedIn recommends professional contacts, Google recommends webpages for a given query, etc. Such recommendation systems exploit various aspects to make suggestions: popularity amongst peers, similarity of content, available user-item ratings, etc. This paper is about collaborative filtering using the rating matrix: we are interested in making recommendations using only available ratings given by users to the items they have experienced. In a practical system, such a rating based collaborative filter is typically complemented by content-based analysis specific to the data.
There is vast literature on recommendation systems and collaborative filtering; see for example the special issue  and the survey paper . Given the massive datasets and the lack of good statistical model of user behavior, the dominant stream of work has been to propose methods and demonstrate their scalability on real data sets. However, recently the Netflix Prize  has popularized the problem to other research communities and several researchers have started exploring provably good methods. This paper falls in the latter category: we deal with fundamental limits of collaborative filters. In the remainder of this section, we first discuss related models and results, and then outline our model and results.
I-a Related Work
The Netflix data consists of rating matrix where the rows correspond to movies and the columns correspond to users. Only a small fraction of the entries are known and the goal is to estimate the missing entries, that is, this is a matrix completion problem. Several algorithms have been proposed and tested on this data set; see for example . Mathematically, without any further restriction, this is an ill-posed problem. Motivated by this, some authors have recently considered the matrix completion problem under the restriction of low-rank matrices. (This problem also arises in other contexts such as location estimation in sensor networks.) This problem has attracted much attention, and in the past year a number of results have been reported. In , using nuclear norm minimization proposed in , an upper bound on the number of samples needed for recovery asymptotically is derived in terms of the size and rank of the matrix. In , a lower bound is established on the number of samples needed by any algorithm. The order of this lower bound is shown to be achievable in . In , the problem of matrix recovery from linear measurements (of which sampling is a special case) is considered and a new algorithm is proposed. In , the problem of matrix completion under bounded noise is considered. A semi-definite programming based algorithm is proposed and shown to have recovery error proportional to the noise magnitude.
In this paper, we take an alternative channel coding viewpoint of the problem. Our results differ from the above works in several aspects outlined below.
We consider finite alphabet for the ratings and a different model for the rating matrix based on row and column clusters.
We consider noisy user behavior, and our goal is not to complete the missing entries, but to estimate an underlying “block constant” matrix (in the limit as the matrix size grows).
Since we consider a finite alphabet, even in the presence of noise, error free recovery is asymptotically feasible. Hence, unlike , which considers real-valued matrices, we do not allow any distortion.
We next outline our model and results.
I-B Summary of Our Model and Results
We consider a finite alphabet for the ratings. In this section, we briefly outline our model and results without any mathematical details; the details can be found in subsequent sections.
To motivate our model, consider an ideal situation where every user rates every item without any noise. In this ideal scenario, it is reasonable to expect that similar users rate similar items by the same value. We therefore assume that the users (items) are clustered into groups of similar users (items, respectively). The rating matrix in this ideal situation (say with size ) is then a block constant matrix (where the blocks correspond to cartesian product of row and column clusters). The observations are obtained from by passing its entries through a discrete memoryless channel (DMC) consisting of an erasure channel modeling missing data and a noisy DMC representing noisy user behavior. Moreover, the row and column clusters are unknown. The goal is to make recommendations by estimating based on the observations. The performance metric we use is the probability of block error: we make an error if any of the entries in the estimate is erroneous. Our goal is to identify conditions under which error free recovery is possible in the limit as the matrix size grows large. Thus we view the recommendation system problem as a channel coding problem.
The cluster sizes in our model represent the resolution: the larger the cluster, the smaller are the degrees of freedom (or rate of the channel code). If the channel is more noisy and the erasures are high, then we can only support a small number of codewords. The challenge is to find the exact order. For our model, we show that if the largest cluster size (defined precisely in Section III) is smaller than , where is a constant dependent on the channel parameters, then for any estimator the probability of error approaches one. On the other hand, if the smallest cluster size (defined precisely in Section III) is larger than , where is a constant dependent on the channel parameters, then we give a polynomial time algorithm that has diminishing probability of error. Thus we identify the order of the threshold exactly. In the case of uniform cluster size, the constants and are identical and thus in this special case, even the constant is identified precisely. Moreover, for the special case of binary ratings and uniform cluster size, the algorithm used to show the achievability part does nor depend on the cluster size, erasure parameter, and needs knowledge of a worst case parameter for the noisy part of the channel. These results are obtained by averaging over (as per the probability law specified in Section II).
The achievability part of our result is shown by first clustering the rows and columns, and then estimating the matrix entries assuming that the clustering is correct. The clustering is done by computing a normalized Hamming metric for every pair of rows and comparing with a threshold to determine if the rows are in same cluster or not. The converse is proved by considering the case when the clusters are known exactly. Our results for the average case show that the threshold is determined by the problem of estimating entries, and relatively, clustering is an easier task (see Figure 1 for an illustration).
I-C Organization of the Paper
The precise model for and the observations is stated in Section II. The case of uniform cluster size and binary ratings leads to sharper bounds and results. Hence results for this case are given in Section III. The case of general alphabets and non-uniform cluster sizes is considered in Section IV. The conclusion is given in Section V, while all the proofs are collected together in Section VI.
All the logarithms are to the natural base unless specified otherwise. denotes the KL divergence () between probability mass functions and . By we mean that for large enough, . By we denote the indicator variable, which is 1 if is true and 0 otherwise.
Ii Model and Assumptions
The main elements of our model are a block constant ensemble of rating matrices (whose blocks of constancy are not known) and an observation matrix obtained from the underlying rating matrix via a noisy channel and erasures. The noise in the observations represents the inherent noise in user-item ratings as well as the error in our model. The erasures denote missing entries. To be more precise, suppose is the unknown rating matrix with entries from a finite alphabet, where is the number of buyers and is the number of items. Let and be partitions of and respectively. We call the sets clusters and we call ’s (’s) the row (column) clusters. We denote the corresponding row and column cluster sizes by and , and the number of row clusters and the number of column clusters by and respectively. Thus , .
We state our results under two sets of conditions - the set of conditions A1)-A4) and B1)-B3) below. Conditions A1)-A4) are a special case of conditions B1)-B3). The results under A1)-A4) are sharper and illustrate the important concepts more easily. Hence they are stated separately. We begin by stating and discussing A1)-A4) first and then we state B1)-B3). (A few additional conditions needed in the results are stated at appropriate places.)
Conditions A1)-A4): The conditions A1)-A4) below correspond to binary rating matrix with equal size clusters and uniform probability of sampling entries.
The entries of are from .
The row (column) clusters are of equal size: , for all .
is constant over the cluster and the entries are i.i.d. Bernoulli(1/2) across the clusters.
The observed data ( denotes erasure) is obtained by passing the entries of through the cascade of a binary symmetric channel (BSC) with probability of error and an erasure channel with erasure probability .
The cluster sizes are representative of the resolution of - large cluster sizes correspond to a coarse structure with fewer degrees of freedom in choosing , while small cluster size corresponds to a fine structure. Condition A2) suggests that we can think of the cluster size as representative of the resolution of and it plays a central role in our results. If we think of all permissible as a channel code, then a higher corresponds to a smaller rate code. However, in order to interpret precisely, we also need to take into account condition A3). When the entries of the cluster are filled with i.i.d. Bernoulli(1/2) random variables as per A3), it is likely that rows in two clusters turn out to be the same, and hence these two row clusters can be merged to form a single bigger cluster. The following lemma shows that if the number of clusters is , then this happens with small probability and hence we should think of as the representative cluster size.
If , , then
and a similar result holds for the column clusters.
Each row is uniformly distributed over possibilities and rows in different clusters are independent. Hence the probability that any given pair of rows is same is . Since there are pairs, we then have
Since , we have
Hence if for some , then
Condition A3) also implies that in any row or column, for large matrices, roughly the number of 0s and 1s is same. This essentially implies that the opinions are diverse for any user or item. While this may seem unrealistic (and can indeed be fixed), we prefer the Bernoulli(1/2) model for the following reason: under this assumption no recommendations can be extracted from any row or column alone and thus collaborative filtering is necessary. Such a model is desirable for evaluation of collaborative filtering schemes. Moreover, one can pre-process data so that rows and columns with fraction of 1s far from 1/2 are removed (because they are relatively easy to recommend) and then assumption A3) is reasonable. We note that in condition A3), we only specify the probability law of given the clusters; the clusters are deterministic, even though they are unknown.
The BSC in A4) models the inherent noise in user-item ratings as well as modeling error, while the erasure channel models the missing data.
Conditions B1)-B3): These conditions are more general allowing any finite alphabet and non-uniform cluster sizes.
The entries of are from a finite alphabet .
is constant over the cluster and the entries across the clusters are i.i.d. with a uniform distribution over .
The observed data ( denotes erasure) is obtained from as follows
The entries of are passed through a DMC with probability law and output alphabet , resulting in .
The entries are then passed through an erasure channel with erasure probability .
Iii Binary Rating Matrix
In this section, we state our results under conditions A1)-A4). The main result of this section appears in Section III-A. It is obtained by studying two quantities: probability of error when the clustering is known (Section III-B) and probability error in clustering for a specific algorithm (Section III-C).
Iii-a Main Result
Our main result stated below identifies a threshold on the cluster size above which error free recovery is asymptotically feasible but below which error free recovery is not possible.
Suppose conditions A1)-A4) are true and the clusters are unknown. Let . Suppose that and , .
then for any estimator.
The proof is given in Section VI-A.
The result identifies as the cluster size threshold. The first part states that if the cluster size is too small, then any estimator makes an error with high probability. The second part states that if the cluster size is large enough, then diminishing probability of error can be achieved with a polynomial time estimator, which does not need knowledge of and needs only knowledge of a worst case bound on . The result is reminiscent of the channel coding theorem in the context of our model.
The proof of Part 1) of Theorem 1 relies on lower bounding by considering the case of known clustering (see Theorem 2 in Section III-B). The proof of Part 2) of Theorem 1 relies on showing that for the average case, the probability of error in clustering is much smaller than the probability of error in filling values when the clusters are known (see Theorem 3 in Section III-C). We illustrate this in Figure 1 by plotting various bounds: for , ranging from 10 to 150, and , we plot
upper and lower bounds for probability of error when clustering is known (from Theorem 2),
upper bound on probability of clustering error (from Theorem 3),
and the asymptotic threshold (from Theorem 1).
It is seen that around the asymptotic threshold, the probability of clustering error is dominated by the probability of error in filling values under known clustering.
Iii-B Known Clustering
In this section, we consider the case when the clusters are known. Under this assumption, the decoder only has to estimate the value in a cluster, and the minimum probability of error estimator under A3) is just a majority decoder. The analysis of this decoder is elementary and we state a stronger result for a fixed with possibly unequal cluster sizes. Let
where and are the row and column cluster sizes in .
Suppose conditions A1), A3) are true and in addition assume that the clusters are known. Let
Then the probability of error in filling in values satisfies
Suppose we are given a sequence of rating matrices of increasing size, that is, . Then the following are true.
The proof is given in Section VI-B.
We note that when all the clusters are of the same size (which happens with high probability as per Lemma 1), then the above result states that there is a sharp threshold: if the cluster size is smaller than , then exact recovery is not possible, but if it is larger, then we can make probability of error as small as we wish.
Iii-C Probability of Clustering Error
To get an upper bound on the probability of error , in this section we analyze a specific collaborative filter: we first cluster the rows and columns using the algorithm described below and then we fill in values using the majority decoder assuming that the clustering is correct. The majority decoder has already been analyzed in Section III-B and for proving Part 2 of Theorem 1, we only need to analyze the probability of error in clustering.
Clustering Algorithm: We cluster rows and columns separately. For rows , the normalized Hamming distance over commonly sampled entries is
where is the number of commonly sampled positions in rows and , given by
Let be equal to 1 if rows , belong to the same cluster and let it be 0 otherwise. The algorithm gives an estimate:
where is a treshold whose choice will be discussed later. A similar algorithm is used to cluster columns. We are interested in the probability that we make an error in row clustering averaged over the probability law on the rating matrices defined as
We note that this is a conservative definition of clustering error. As seen in Lemma 1, there is a small chance that rows in different clusters may be the same resulting in the merging of two clusters into a larger one. The above definition of error does not account for this and declares more errors. We use this conservative definition of clustering error to simplify analysis.
Suppose conditions A1)-A4) are true. Let , be constants and let be the smaller root of the quadratic equation
where . Suppose the threshold . Let
Then for the above clustering algorithm,
for a positive constant .
The proof is given in Section VI-C.
The proof uses the union bound and considers pairwise errors. The pairwise errors consists of two cases: error when the pair of rows is in the same cluster and error when they are in different clusters. The probability of the first kind of error is exponentially decaying in . The probability of the second kind of error is upper bounded by the minimum of and : while is tight for finite and large , the bound is useful for establishing asymptotic results (like Theorem 1) for all . For example, in Figure 1, the upper bound on clustering error is dominated by , while the proof of Part 2) of Theorem 1 uses . We note that both and have terms that decay exponentially in as well as . The terms decaying exponentially in are related to Lemma 1 and the conservative definition of clustering error as discussed before the statement of Theorem 3. These terms are the origin of the condition in Part 2) of Theorem 1 and can perhaps be avoided with more sophisticated analysis; however, we prefer to work with this condition since as per Lemma 1, the condition is anyway needed for interpreting as the representative cluster size.
Iv General Finite Alphabet and Non-uniform Clusters
In this section, we consider a general finite alphabet and non-uniform cluster sizes. We work with assumptions B1)-B3) described in the Section II and generalize the results in Section III. To state our results, we first introduce some notation. For , define
If , are i.i.d. uniform on and we pass them through the DMC to get outputs , , then
The following useful lemma sheds light on the relationship between and .
For any DMC, , with equality iff .
The proof is given in Section VI-E.
We next state our main result for general finite alphabet and non-uniform cluster size.
Suppose conditions B1)-B3) are true and the clusters are unknown. Then there exist constants , such that
then for any estimator.
Achievability: Suppose that there exist some such that and . (By Lemma 2, this ensures that .) If , , , and
then for the following polynomial time estimator:
Cluster rows and columns using the algorithm of Section III-C using the threshold (which does not depend on ).
Employ maximum likelihood decoding in a cluster assuming the clustering is correct.
The above result again identifies as the exact order of the cluster size threshold for asymptotic recovery. Similar to the binary alphabet and uniform cluster size case in Section III, the constants , arise from the case when the clusters are known (see Theorem 5 below). The gap between the constants can be made arbitrarily small: the proof of Theorem 5 identifies a constant (see equation (29)) such that for any ,
is a valid choice in Theorem 4.
We next consider the case when the clusters are known and extend Theorem 2.
Suppose conditions B1)-B3) are true and in addition assume that the clusters are known. Also let
where are as defined above. Then for a sequence of rating matrices of increasing size , the following are true.
The proof is given in Section VI-D.
Finally, we study the performance of the clustering algorithm and extend Theorem 3.
Suppose conditions B1)-B3) are true and in addition suppose that there exist some such that and . (By Lemma 2, this ensures that .) If we choose the threshold , then
for some positive constants , . Consequently, if , then as .
The proof is given in Section VI-F.
We take a channel coding perspective of collaborative filtering and identify the threshold on cluster size for perfect reconstruction of the underlying rating matrix. The result is similar in flavor to some recent results in completion of real-valued matrices. The advantage of our model is that the proofs are relatively simple relying on Chernoff bounds and noisy user behavior can be easily handled.
In the typical applications of recommendation systems, there is a lack of good models. We believe that our model has two characteristics that make it suitable for analytical comparison of various methods: a) in our model the user opinions are diverse and no single user/item reveals much information about itself, that is, collaborative filtering is necessary; b) as we have shown, the model is analytically tractable. There are several directions where this model may turn out to be useful: analysis of bit error probability instead of block error probability, analysis of local popularity based mechanisms, etc.
Vi Proofs of Results
Vi-a Proof of Theorem 1
When are known, under our model all feasible rating matrices are equally likely. Hence the ML decoder gives the minimum probability of error and so we have To prove Part 1), we lower bound . Let be the event that . Proceeding as in Lemma 1, we have for , , ,
Hence . Now,
But on the event , and hence we get
But from Part 1) of Theorem 2, for
This proves Part 1).
Next we prove Part 2). Let denote the event that the clustering is identified correctly. We note that the probability of error in estimating averaged over the probability law on the block constant matrices satisfies
Vi-B Proof of Theorem 2
Suppose in cluster we have non erased samples. Then the probability of correct decision in this cluster is given by
Averaging over the number of non erased samples, the probability of correct decision in cluster is given by
Since the erasure and BSC are memoryless
Upper Bound: The desired upper bound is obtained by deriving a lower bound on . First we note that from (11),
But for and , . Substituting this in the previous equation, we have
We note that for , . Hence
Where the first inequality holds for . This is true since . The upper bound follows by noting that
Lower Bound: The lower bound on is obtained from an upper bound on . From (11),
If is even, we have
Now from (12),
Using this bound on in (13), we have
Asymptotics: Now consider a sequence of rating matrices of increasing size. The upper bound on error in (1) is a decreasing function of . Hence if
The lower bound on error (1) is a decreasing function of , and hence substituting the above upper bound on , we have
where are some positive constants. Hence as .
Vi-C Proof of Theorem 3
Recall that is the number of commonly sampled positions in rows and , given by
From the Chernoff bound [10, Theorem 1], we have
To get a handle on the probability of error, we first analyze it conditioned on the erasure sequence and . Let denote the erasure matrix:
Rows in Same Cluster: Consider rows of and suppose , i.e. are in the same cluster. We wish to evaluate the probability of error . In this case, the random variable is given by
For any column such that , the indicator has mean . Hence, the above summation has i.i.d Bernoulli random variables of mean . An application of Chernoff bound [10, Theorem 1] yields
Rows in Different Clusters: Next consider the case , i.e. rows and are in different clusters. We wish to evaluate . For and fixed , , the random variable is given by
Note that for any column such that , the indicator has mean
if , and
Define as the number of columns such that and . Then from (23),we observe that the first sum in (23) has i.i.d Bernoulli random variables of mean and the second sum has i.i.d Bernoulli random variables of mean , all the random variables being independent. Using the Chernoff bound, we may then write