Matrix Approximationunder Local Low-Rank Assumption

# Matrix Approximation under Local Low-Rank Assumption

## Abstract

Matrix approximation is a common tool in machine learning for building accurate prediction models for recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is of low-rank. We propose a new matrix approximation model where we assume instead that the matrix is only locally of low-rank, leading to a representation of the observed matrix as a weighted sum of low-rank matrices. We analyze the accuracy of the proposed local low-rank modeling. Our experiments show improvements of prediction accuracy in recommendation tasks.

\nipsfinalcopy

## 1 Introduction

Matrix approximation is a common task in machine learning. Given a few observed matrix entries , matrix approximation constructs a matrix that approximates at its unobserved entries. In general, the problem of completing a matrix based on a few observed entries is ill-posed, as there are an infinite number of matrices that perfectly agree with the observed entries of . Thus, we need additional assumptions such that is a low-rank matrix. More formally, we approximate a matrix by a rank- matrix , where , , and . In this note, we assume that behaves as a low-rank matrix in the vicinity of certain row-column combinations, instead of assuming that the entire is low-rank. We therefore construct several low-rank approximations of , each being accurate in a particular region of the matrix. Smoothing the local low-rank approximations, we express as a linear combination of low-rank matrices that approximate the unobserved matrix . This mirrors the theory of non-parametric kernel smoothing, which is primarily developed for continuous spaces, and generalizes well-known compressed sensing results to our setting.

## 2 Global and Local Low-Rank Matrix Approximation

We describe in this section two standard approaches for low-rank matrix approximation (LRMA). The original (partially observed) matrix is denoted by , and its low-rank approximation by , where , , .

### Global LRMA

Incomplete SVD is a popular approach for constructing a low-rank approximation by minimizing the Frobenius norm over the set of observed entries of :

 (U,V)=\argminU,V∑(a,b)∈A([UVT]a,b−Ma,b)2. (1)

Another popular approach is minimizing the nuclear norm of a matrix (defined as the sum of singular values of the matrix) satisfying constraints constructed from the training set:

 ^M=\argminX∥X∥∗,  s.t.∥ΠA(X−M)∥F<α (2)

where is the projection defined by if and 0 otherwise, and is the Frobenius norm.

Minimizing the nuclear norm is an effective surrogate for minimizing the rank of . One advantage of (2) over (1) is that we do not need to constrain the rank of in advance. However, problem (1) is substantially easier to solve than problem (2).

### Local LRMA

In order to facilitate a local low-rank matrix approximation, we need to pose an assumption that there exists a metric structure over , where denotes the set of integers . Formally, reflects the similarity between the rows and and columns and . In the global matrix factorization setting above, we assume that the matrix has a low-rank structure. In the local setting, however, we assume that the model is characterized by multiple low-rank matrices. Specifically, we assume a mapping that associates with each row-column combination a low rank matrix that describes the entries of in its neighborhood (in particular this applies to the observed entries ): where . Note that in contrast to the global estimate in Global LRMA, our model now consists of multiple low-rank matrices, each describing the original matrix in a particular neighborhood. Figure 1 illustrates this model.

Without additional assumptions, it is impossible to estimate the mapping from a set of observations. Our additional assumption is that the mapping is slowly varying. Since the domain of is discrete, we assume that is Hölder continuous. Following common approaches in non-parametric statistics, we define a smoothing kernel , where , as a non-negative symmetric unimodal function that is parameterized by a bandwidth parameter . A large value of implies that has a wide spread, while a small corresponds to narrow spread of . We use, for example, the Epanechnikov kernel, defined as . We denote by the matrix whose -entry is .

Incomplete SVD (1) and compressed sensing (2) can be extended to local version as follows

 Incomplete SVD: ^T(a,b)=\argminX∥K(a,b)h⊙ΠA(X−M)∥F  s.t.rank(X)=r (3) Compressed Sensing: ^T(a,b)=\argminX∥X∥∗  s.t.∥K(a,b)h⊙ΠA(X−M)∥F<α, (4)

where denotes a component-wise product of two matrices, .

The two optimization problems above describe how to estimate for a particular choice of . Conceptually, this technique can be applied for each test entry , resulting in the matrix approximation , where . However, this requires solving a non-linear optimization problem for each test index and is thus computationally prohibitive. Instead, we use Nadaraya-Watson local regression with a set of local estimates , in order to obtain a computationally efficient estimate for all :

 ^^T(s)=q∑i=1Kh(si,s)∑qj=1Kh(sj,s)^T(si) . (5)

Equation (5) is simply a weighted average of , where the weights ensure that values of at indices close to contribute more than indices further away from .

Note that the local version can be faster than global SVD since (a) each low-rank approximation is independent of each other, so can be computed in parallel, and (b) the rank used in the local SVD model can be significantly lower than the rank used in a global one. If the kernel has limited support ( is often zero), the regularized SVD problems would be sparser than the global SVD problem, resulting in additional speedup.

## 3 Experiments

We compare local-LRMA to global-LRMA and other state-of-the-art techniques on popular recommendation systems datasets: MovieLens 10M and Netflix. We split the data into 9:1 ratio of train and test set. A default prediction value of 3.0 was used whenever we encounter a test user or item without training observations. We use the Epanechnikov kernel with , assuming a product form . For distance function , we use arccos distance, defined as . Anchor points were chosen randomly among observed training entries. regularization is used for local low-rank approximation.

Figure 2 graphs the RMSE of Local-LRMA and global-LRMA as well as the recently proposed method called DFC (Divide-and-Conquer Matrix Factorization) as a function of the number of anchor points. Both local-LRMA and global-LRMA improve as increases, but local-LRMA with rank outperforms global-LRMA with any rank. Moreover, local-LRMA outperforms global-LRMA in average with even a few anchor points (though the performance of local-LRMA improves further as the number of anchor points increases).

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters