Online high rank matrix completion

# Online high rank matrix completion

## Abstract

Recent advances in matrix completion enable data imputation in full-rank matrices by exploiting low dimensional (nonlinear) latent structure. In this paper, we develop a new model for high rank matrix completion (HRMC), together with batch and online methods to fit the model and out-of-sample extension to complete new data. The method works by (implicitly) mapping the data into a high dimensional polynomial feature space using the kernel trick; importantly, the data occupies a low dimensional subspace in this feature space, even when the original data matrix is of full-rank. We introduce an explicit parametrization of this low dimensional subspace, and an online fitting procedure, to reduce computational complexity compared to the state of the art. The online method can also handle streaming or sequential data and adapt to non-stationary latent structure. We provide guidance on the sampling rate required these methods to succeed. Experimental results on synthetic data and motion capture data validate the performance of the proposed methods.

\cvprfinalcopy

## 1 Introduction

In the past ten years, low rank matrix completion (LRMC) has been widely studied [CandesRecht2009, 5466511, mazumder2010spectral, MC_ShattenP_AAAI125165, MC_IRRN, hardt2014understanding, MC_Universal2014, liu2016low, Fan2017290]. For instance, Candès and Recht [CandesRecht2009] showed that any incoherent matrices of rank can be exactly recovered from uniformly randomly sampled entries with high probability through solving a convex problem of nuclear norm minimization (NNM). However, LRMC cannot recover high rank or full-rank matrices, even when the the data lies on a low dimensional (nonlinear) manifold. To address this problem, recently a few researchers have developed new high rank matrix completion (HRMC) methods [eriksson2012high, li2016structured, yang2015sparse] for data drawn from multiple subspaces [SSC_PAMIN_2013, NIPS2016_6357, FAN2018SFMC] or nonlinear models [NLMC2016, pmlr-v70-ongie17a, FANNLMC]. These HRMC methods can outperform LRMC methods for many real problems such as subspace clustering with missing data, motion data recovery [NIPS2016_6357, pmlr-v70-ongie17a], image inpainting, and classification [NLMC2016, FANNLMC].

All the aforementioned LRMC and HRMC methods are offline methods. However, for many problems, we obtain one sample at a time and would like to update the model as each new sample arrives using online optimization. In addition, compared to offline methods, online methods [rendle2008online, mairal2009online, yun2015streaming] often have lower space and time complexities and can adapt to changes in the latent data structure. For these reasons, online matrix completion has recently gained increasing attention [balzano2010online, dhanjal2014online, kawale2015efficient, lois2015online].

## 2 Related work and our contribution

#### Online matrix completion.

Sun and Luo [sun2016guaranteed] and Jin et al. [jin2016provable] proposed to use stochastic gradient descent (SGD) to solve the low rank factorization (LRF) problem with variables , , where and denotes the locations of observed entries of . Specifically, given an entry , the -th row of and -th row of are updated by gradient descent. Yun et al. [yun2015streaming] studied the streaming or online matrix completion problem when the columns of the matrix are presented sequentially. The GROUSE method proposed in [balzano2010online] used incremental gradient descent on the Grassmannian manifold of subspaces to learn a low rank factorization from incomplete data online. These online methods have a common limitation: they cannot recover high rank matrices. Mairal et al. [mairal2009online] also studied the online factorization problem with the goal of learning a dictionary for sparse coding: . A sparse factorization based matrix completion algorithm was proposed in [FAN2018SFMC]. It is possible to recover a high rank matrix online by combining ideas from [mairal2009online] with [FAN2018SFMC].

#### High rank matrix completion.

Elhamifar [NIPS2016_6357] proposed to use group-sparse constraint to complete high rank matrix consisting of data drawn from union of low-dimensional subspaces. Alameda-Pineda et al. [NLMC2016] proposed a nonlinear matrix completion method for classification. The method performs matrix completion on a matrix consisting of (nonlinear) feature-label pairs, where the unknown labels are regarded as missing entries. The method is inapplicable to general matrix completion problems in which the locations of all missing entries are not necessarily in a single block. Ongie et al. [pmlr-v70-ongie17a] assumed is given by an algebraic variety and proposed a method called VMC to recover the missing entries of through minimizing the rank of , where is a feature matrix given by polynomial kernel. Fan and Chow [FANNLMC] assumed the data are drawn from a nonlinear latent variable model and proposed a nonlinear matrix completion method (NLMC) that minimizes the rank of , where is composed of high-dimensional nonlinear features induced by polynomial kernel or RBF kernel.

#### Challenges in HRMC.

First, existing HRMC methods lack strong theoretical guarantee on the sample complexity required for recovery. For example, in VMC, the authors provide a lower bound of sampling rate (, equation (6) of [pmlr-v70-ongie17a]) only for low-order polynomial kernel and involved an unknown parameter owing to the algebraic variety assumption. In NLMC [FANNLMC], the authors only provided a coarse lower bound of sampling rate, i.e. , where is the dimension of latent variables. Second, existing HRMC methods are not scalable to large matrices. For example, VMC and NLMC require singular value decomposition on an kernel matrix in every iteration. The method of [NIPS2016_6357] is also not efficient because of the sparse optimization on an coefficients matrix. Third, existing HRMC methods have no out-of-sample extensions, which means they cannot efficiently complete new data. Last but not least, existing HRMC methods are offline methods and cannot handle online data.

#### Contributions.

In this paper, we aim to address these challenges. We propose a novel high rank matrix completion method based on kernelized factorization (KFMC). KFMC is more efficient and accurate than state-of-the-art methods. Second, we propose an online version for KFMC, which can outperform online LRMC significantly. Third, we propose an out-of-sample extension for KFMC, which enables us to use the pre-learned high rank model to complete new data directly. Finally, we analyze the sampling rate required for KFMC to succeed.

## 3 Methodology

### 3.1 High rank matrices

We assume the columns of are given by

 x=f(s)=[f1(s),f2(s),⋯,fm(s)]⊤, (1)

where () consists of uncorrelated variables and each , , is a -order polynomial with random coefficients. For example, when and , for , , where and . Lemma 1 shows that is of high rank when is large.

###### Lemma 1.

Suppose the columns of satisfy (1). Then with probability 1, .

###### Proof.

Expand the polynomial for each to write , where and . Each column of satisfies , where . The matrix can be written as , where . The variables are uncorrelated and the coefficients are random, so generically both and are full rank. Hence . ∎

In this paper, our goal is to recover from a few randomly sampled entries we denote by . When is large, is generically high rank and cannot be recovered by conventional LRMC methods.

Remark. Throughout this paper, we use the terms “low rank” or “high rank” matrix to mean a matrix whose rank is low or high relative to its side length.

Let be a -order polynomial feature map . Here . Write and consider its rank:

###### Theorem 1.

Suppose the columns of satisfy (1). Then with probability 1, .

###### Proof.

Define the -order polynomial map . Expanding as above, write the vector with and , and write the matrix with . As above, and are generically full rank, so with probability 1. ∎

While , Theorem 1 shows that is generically low rank when is small and is large. For example, when , , , , and , generically while : is high rank but is low rank.

### 3.2 Kernelized factorization

To recover the missing entries of , we propose to solve

 minimizeX,A,Z12∥ϕ(X)−AZ∥2Fsubject toXij=Mij, (i,j)∈Ω, (2)

where , , and . The solution to (2) completes the entries of using the natural low rank structure of . Problem (2) implicitly defines an estimate for , .

For numerical stability, we regularize and and solve

 minimizeX,A,Z 12∥ϕ(X)−AZ∥2F+α2∥A∥2F+β2∥Z∥2F, (3) subject to Xij=Mij, (i,j)∈Ω,

where and are regularization parameters, instead of (2).

It is possible to solve (3) directly but the computational cost is quite high if and are large. The following lemma shows that there is no need to model explicitly.

###### Lemma 2.

For any generated by (1), there exist and such that .

###### Proof.

Suppose are also generated by (1) (e.g. any columns of ), so and share their column space and (with probability 1) is full rank. More precisely, and , where is a basis for the column space, and both and are full rank. Define and the result follows. ∎

Hence any solution of the following also solves (3):

 minimizeX,D,Z 12∥ϕ(X)−ϕ(D)Z∥2F+α2∥ϕ(D)∥2F+β2∥Z∥2F subject to Xij=Mij, (i,j)∈Ω, (4)

where is much smaller than of (3). Use the trace function Tr to rewrite the objective in (4) as

 12Tr(ϕ(X)⊤ϕ(X)−2ϕ(X)⊤ϕ(D)Z+Z⊤ϕ(D)⊤ϕ(D)Z) +α2Tr(ϕ(D)⊤ϕ(D))+β2∥Z∥2F.

Now we use the kernel trick to avoid explicitly computing the feature map . Define , so , , and , where , , and are the corresponding kernel matrices. The most widely-used kernels are the polynomial kernel (Poly) and the radial basis function kernel (RBF)

 Poly: k(x,y)=(x⊤y+c)q (5) RBF: k(x,y)=exp(−1σ2∥x−y∥2),

with hyperparameters , , and . The (implicit) feature maps of Poly and RBF are the -order and infinite-order polynomial maps respectively. Rewrite (4) to define kernelized factorizatiom matrix completion (KFMC)

 minimizeX,D,Z ℓ(Z,D,X) (KFMC) subject to Xij=Mij, (i,j)∈Ω

where For the RBF kernel, is a constant and can be dropped from the objective.

### 3.3 Optimization for KFMC

The optimization problem (KFMC) is nonconvex and has three blocks of variables. We propose using coordinate descent over these three blocks to find a stationary point.

#### Update Z.

To begin, complete entries of arbitrarily and randomly initialize . Define the identity . Fix and and update as

 Z ←arg minZℓ(Z,D,X) =arg minZ−Tr(K\scalebox.6$XD$Z)+12Tr(Z⊤K\scalebox.6$DD$Z)+β2∥Z∥2F =(K\scalebox.6$DD$+βIr)−1K⊤\scalebox.6$XD$, (6)

#### Update D.

There is no closed form solution for the minimization of with respect to due to the kernel matrices. Instead, we propose the additive update . We compute using a relaxed Newton method, described below for the Poly and RBF kernels.

For the polynomial kernel, rewrite the terms in the objective in which appears as

 ℓ(Z,D,X):=−Tr((W1⊙(XTD+c))Z)+12Tr(ZT(W2⊙(DTD+c))Z)+α2Tr(W2⊙(DTD+c)). (7)

defining and , where is elementwise multiplication and denotes the element-wise -power. Inspired by iteratively reweighted optimization, fix and to approximate the gradient and Hessian of with respect to as

 g\scalebox.6$D$ :=−X(W1⊙Z⊤)+D((ZZ⊤+αIr⊙W2)) H\scalebox.6$D$ :=ZZ⊤⊙W2+αW2⊙Ir.

is positive definite by the Schur product theorem. Now choose for numerical stability and define the update

 Δ\scalebox.6$D$:=1τg\scalebox.6$D$H−1\scalebox.6$D$. (8)

The effectiveness of our update for is guaranteed by the following lemma. (The proof of the lemma and discussion about the role of are in the supplementary material.)

###### Lemma 3.

The update (8) is a relaxed Newton’s method and ensures sufficient decrease in the objective:

 ℓ(Z,D−Δ\scalebox.6$D$,X)−ℓ(Z,D,X)≤−12τ\textupTr(g\scalebox.6$D$H−1\scalebox.6$D$g⊤\scalebox.6$D$).

For the RBF kernel, the gradient is

 ∇\scalebox.6$D$ℓ=1σ2(XQ1−DΓ1)+2σ2(DQ2−DΓ2). (9)

(Throughout, we abuse notation to write for .) Here , , , and , where and are composed of 1s. The following lemma (proved in the supplementary material) indicates that in (9) is nearly a constant compared to , , and , provided that and are large enough:

###### Lemma 4.

, where is a small constant.

Therefore, we can compute an approximate Hessian neglecting . As in (8), we define

 Δ\scalebox.6$D$:=1τ∇\scalebox.6$D$ℓ(1σ2(2Q2−Γ1−2Γ2))−1. (10)

#### Update X.

Finally, fixing and , we wish to minimize (KFMC) over , which again has no closed-form solution. Again, we suggest updating using a relaxed Newton method . For the polynomial kernel,

 g\scalebox.6$X$ =X(W3⊙In)−D(W⊤4⊙Z) (11) =qX⊙(1mw⊤)−qD(W⊤4⊙Z),

where , , consists of 1s, and consists of the diagonal entries of . As above, we define

 Δ\scalebox.6$X$:=1τg\scalebox.6$X$⊙(1mw−T). (12)

When RBF kernel is used, we get

 ∇\scalebox.6$X$ℓ=1σ2(DQ3−XΓ3)+2σ2(XQ4−XΓ4). (13)

Here , , , and . As in (10), define

 Δ\scalebox.6$X$:=1τ∇\scalebox.6$X$ℓ(1σ2(2Q4−Γ3−2Γ4))−1. (14)

Here the computational cost is not high in practice because the matrix to be inverted is diagonal.

We can also use a momentum update to accelerate the convergence of and :

 {ˆΔ\scalebox.6$D$←ηˆΔ\scalebox.6$D$+Δ\scalebox.6$D$, D←D−ˆΔ\scalebox.6$D$ˆΔ\scalebox.6$X$←ηˆΔ\scalebox.6$X$+Δ\scalebox.6$X$, X←X−ˆΔ\scalebox.6$X$ (15)

where is a constant. The optimization method is summarized as Algorithm 1. The following lemma (with proof in the supplement) shows the method converges.

###### Lemma 5.

For sufficiently small , Algorithm 1 converges to a stationary point.

### 3.4 Online KFMC

Suppose we get an incomplete sample at time and need to update the model of matrix completion timely or solve the optimization online. In (4), we can put the constraint into the objective function directly and get the following equivalent problem

 minimize[X]¯Ω,D,Z n∑j=112∥ϕ(xj)−ϕ(D)zj∥2+α2n∥ϕ(D)∥2F+β2∥zj∥2, (16)

where denotes the unknown entries of . Denote

 ℓ([xj]ωj,D):=minzj,[xj]¯ωj 12∥ϕ(xj)−ϕ(D)zj∥2 (17) +α2n∥ϕ(D)∥2F+β2∥zj∥2,

where () denotes the observed (unknown) entries of and () denotes the corresponding locations. Then (16) minimizes the empirical cost function

 gn(D):=1nn∑j=1ℓ([xj]ωj,D). (18)

The expected cost is

 g(D):=E[x]ω[ℓ([x]ω,D)]=missinglimn→∞gn(D). (19)

To approximately minimize (19) online, we propose the following optimization for a given incomplete sample

 minimize[x]¯ω,D,z ^ℓ(z,[x]¯ω,D):=12∥ϕ(x)−ϕ(D)z∥2 (20) +α2∥ϕ(D)∥2F+β2∥z∥2.

With randomly initialized , we first compute and via alternately minimizing , which is equivalent to

 minimize[x]¯ω,z 12k\scalebox.6$xx$−k\scalebox.6$xD$z+12z⊤K\scalebox.6$DD$z+β2∥z∥2. (21)

Specifically, in each iteration, is updated as

 z=(K\scalebox.6$DD$+βIr)−1k⊤\scalebox.6$xD$. (22)

We propose to update by Newton’s method, i.e., . When polynomial kernel is used, we obtain

 ∇\scalebox.6$x$^ℓ=w1x−D(w⊤2⊙z) (23)

where , . Then

 Δ\scalebox.6$x$=1τw1∇\scalebox.6$x$^ℓ. (24)

When RBF kernel is used, we have

 ∇\scalebox.6$x$^ℓ=1σ2(Dq−γx), (25)

where and . Then

 Δ\scalebox.6$x$=σ2τγ∇\scalebox.6$x$^ℓ. (26)

The derivations of (24) and (26) are similar to those of (8) and (10). Then we repeat (22)(26) until converged.

After and are computed, we compute via minimizing , which is equivalent to

 minimize\scalebox.6$D$−k\scalebox.6$xD$z+12z⊤K\scalebox.6$DD$z+α2Tr(K\scalebox.6$DD$). (27)

We propose to use SGD to update , i.e., . When polynomial kernel is used, we have

 ∇\scalebox.6$D$^ℓ=−x(w1⊙z⊤)+D(zz⊤⊙W2)+αD(W2⊙Ir)), (28)

where and . Then we have

 Δ\scalebox.6$D$=1τ∇\scalebox.6$D$^ℓ/∥zz⊤⊙W2+αW2⊙Ir∥2. (29)

Here we cannot use the method of (8) because is not as stable as . In addition, the following lemma (proved in the supplementary material) ensures the effectiveness of updating :

###### Lemma 6.

Updating as does not diverge and provided that , where .

When RBF kernel is used, the derivative is

 ∇\scalebox.6$D$^ℓ=1σ2(xQ1−DΓ1)+2σ2(DQ2−DΓ2), (30)

where , , , and . Similar to (29) and Lemma 6, we obtain

 Δ\scalebox.6$D$=1τ∇\scalebox.6$D$^ℓ/∥1σ2(2Q2−Γ1−2Γ2)∥2. (31)

Similar to offline KFMC, we also use momentum to accelerate the optimization of online KFMC. The optimization steps are summarized in Algorithm 2. The error of online matrix completion can be reduced with multi-pass optimization or increasing the number of samples. In Algorithm 2, the sequence defined in (17) may not decrease continuously because the incomplete sample can introduce high uncertainty. However, the sequence , the empirical cost function defined in (79), is convergent because for , is minimized through optimizing , , and .

### 3.5 Out-of-sample extension of KFMC

The matrix obtained from offline matrix completion (1) or online matrix completion (2) can be used to recover the missing entries of new data without updating the model. We can also compute from complete training data: the corresponding algorithm is similar to Algorithms 1 and 2, but does not require the update. We can complete a new (incomplete) sample by solving

 minimize[xnew]¯ωnew,znew 12∥ϕ(xnew)−ϕ(D)znew∥2+β2∥znew∥2, (32)

where denotes unknown entries of . This out-of-sample extension of KFMC is displayed as Algorithm 3.