1-Bit Matrix Completion under Exact Low-Rank Constraint

1-Bit Matrix Completion under Exact Low-Rank Constraint

Department of Electrical Engineering, Stanford University, Stanford, CA 94304, USA
Abstract

We consider the problem of noisy 1-bit matrix completion under an exact rank constraint on the true underlying matrix M^{*}. Instead of observing a subset of the noisy continuous-valued entries of a matrix M^{*}, we observe a subset of noisy 1-bit (or binary) measurements generated according to a probabilistic model. We consider constrained maximum likelihood estimation of M^{*}, under a constraint on the entry-wise infinity-norm of M^{*} and an exact rank constraint. This is in contrast to previous work which has used convex relaxations for the rank. We provide an upper bound on the matrix estimation error under this model. Compared to the existing results, our bound has faster convergence rate with matrix dimensions when the fraction of revealed 1-bit observations is fixed, independent of the matrix dimensions. We also propose an iterative algorithm for solving our nonconvex optimization with a certificate of global optimality of the limiting point. This algorithm is based on low rank factorization of M^{*}. We validate the method on synthetic and real data with improved performance over existing methods.

I INTRODUCTION

The problem of recovering a low rank matrix from an incomplete or noisy sampling of its entries arises in a variety of applications, including collaborative filtering [1] and sensor network localization [2, 3]. In many applications, the observations are not only missing, but are also highly discretized, e.g. binary-valued (1-bit) [4, 5], or multiple-valued [6]. For example, in the Netflix problem where a subset of the users’ ratings is observed, the ratings take integer values between 1 and 5. Although one can apply existing matrix completion techniques to discrete-valued observations by treating them as continuous-valued, performance can be improved by treating the values as discrete [4].

In this paper we consider the problem of completing a matrix from a subset of its entries, where instead of observing continuous-valued entries, we observe a subset of 1-bit measurements. Given M^{*}\in\mathbb{R}^{m\times n}, a subset of indices \Omega\subseteq[m]\times[n], and a twice differentiable function f\,:\,\mathbb{R}\rightarrow[0,1], we observe (“w.p.” stands for “with probability”)

One important application is the binary quantization of Y_{ij}=M^{*}_{ij}+Z_{ij}, where Z is a noise matrix with i.i.d entries. If we take f to be the cumulative distribution function of -Z_{11}, then the model in (1) is equivalent to observing

Recent work in the 1-bit matrix completion literature has followed the probabilistic model in (1)-(2) for the observed matrix Y and has estimated M^{*} via solving a constrained maximum likelihood (ML) optimization problem. Under the assumption that M^{*} is low-rank, these works have used convex relaxations for the rank via the trace norm [4] or max-norm [5]. An upper bound on the matrix estimation error is given under the assumptions that the entries are sampled according to a uniform distribution [4], or in [5], following a non-uniform distribution.

In this paper, we follow [4, 5] in seeking an ML estimate of M^{*} but use an exact rank constraint on M^{*} rather than a convex relaxation for the rank. We follow the sampling model of [7] for \Omega which includes the uniform sampling of [4] as well as non-uniform sampling. We provide an upperbound on the Frobenius norm of matrix estimation error, and show that our bound yields faster convergence rate with matrix dimensions than the existing results of [4, 5] when the fraction of revealed 1-bit observations is fixed independent of the matrix dimensions. Lastly, we present an iterative algorithm for solving our nonconvex optimization problem with a certificate of global optimality under mild conditions. Our algorithm outperforms [4, 5] in the presented simulation example.

Notation: For matrix A with (i,j)-th entry A_{ij}, we use the notation \|A\|_{\infty}=\underset{i,j}{\max}|A_{ij}| for the entry-wise infinity-norm, \|A\|_{F} for the Frobenius norm and \|A\|_{2} for its operator norm. We use A_{i,\cdot} to denote the i-th row and A_{\cdot,j} to denote the j-th column. Taking \mathcal{S} to be a set, we use |\mathcal{S}| to denote the cardinality of \mathcal{S}. The notation [n] represents the set of integers \{1,\ldots,n\}. We denote by \mathbf{1}_{n}\in\mathbb{R}^{n} the vector of all ones, by \tilde{\mathbf{1}}_{n} the unit vector \mathbf{1}_{n}/\sqrt{n}, and by \mathbb{I}_{\mu} the indicator function, i.e. \mathbb{I}_{\mu}=1 when \mu is true, else \mathbb{I}_{\mu}=0.

II MODEL ASSUMPTIONS

We wish to estimate unknown M^{*} using a constrained ML approach. We use M\in\mathbb{R}^{m\times n} to denote the optimization variable. Then the negative log-likelihood function for the given problem is

 \displaystyle F_{\Omega,Y}(M)=-\sum_{(i,j)\in\Omega} \displaystyle\Big{\{}\mathbb{I}_{(Y_{ij}=1)}\log(f(M_{ij})) \displaystyle+\mathbb{I}_{(Y_{ij}=-1)}\log(1-f(M_{ij}))\Big{\}} (3)

Note that (3) is a convex function of X when the function f is log-concave. Two common choices for which the function f is log-concave are: (i) Logit model with logistic function f(x)=1/(1+e^{-x/\sigma}) and parameter \sigma>0, or equivalently Z_{ij} in (2) is logistic with scale parameter \sigma; (ii) Probit model with f(x)=\Phi(x/\sigma) where \sigma>0 and \Phi(x) is the cumulative distribution function of {\cal N}(0,1). We assume that M^{*} is a low-rank matrix with rank bounded by r, and that the true matrix M^{*} satisfies \|M^{*}\|_{\infty}\leq\alpha, which helps make the recovery of M^{*} well-posed by preventing excessive “spikiness” of the matrix. We refer the reader to [4, 5] for further details.

The constrained ML estimate of interest is the solution to the optimization problem (s.t.: subject to):

 \widehat{M}=\arg\min_{M}F_{\Omega,Y}(M)\,\;\,\mbox{s.t.}\,\|M\|_{\infty}\leq% \alpha,\,{\rm rank}(M)\leq r. (4)

In many applications, such as sensor network localization, collaborative filtering, or DNA haplotype assembly, the rank r is known or can be reliably estimated [8].

We now discuss our assumptions on the set \Omega. Consider a bipartite graph G=([m],[n],E), where the edge set E\subseteq[m]\times[n] is related to the index set of revealed entries \Omega as (i,j)\in E iff (i,j)\in\Omega. Abusing the notation, we use G for both the graph and its bi-adjacency matrix where G_{ij}=1 if (i,j)\in E, G_{ij}=0 if (i,j)\not\in E. We denote the association of {G} to {\Omega} by {G}\backslash{\Omega}. Without loss of generality we take m\geq n. We assume that each row of G has d nonzero entries (thus |\Omega|=md) with the following properties on its SVD:

• The left and right top singular vectors of {G} are \mathbf{1}_{m}/\sqrt{m} and \mathbf{1}_{n}/\sqrt{n}, respectively. This implies that \sigma_{1}(G)=d\sqrt{m/n}\geq d, where \sigma_{1}(G) denotes the largest singular value of G, and that each column of G has (md/n) nonzero entries.

• We have \sigma_{2}({G})\leq C\sqrt{d}, where \sigma_{2}(G) denotes the second largest singular value of G and C>0 is some universal constant.

Thus we require {G} to have a large enough spectral gap. As discussed in [7], an Erdös-Renyi random graph with average degree d\geq c\log(m) satisfies this spectral gap property with high probability, and so do stochastic block models for certain choices of inter- and intra-cluster edge connection probabilities. Thus, this sampling scheme is more general than a uniform sampling assumption, used in [4], and it also includes the stochastic block model [7] resulting in non-uniform sampling.

III PERFORMANCE UPPERBOUND

We now present a performance bound for the solution to (4). With \dot{f}(x):=({\rm d}f(x)/{\rm d}x), define

 \displaystyle\gamma_{\alpha}\leq \displaystyle\min\left(\inf_{|x|\leq\alpha}\left\{\frac{\dot{f}^{2}(x)}{f^{2}(% x)}-\frac{\ddot{f}(x)}{f(x)}\right\}\right., \displaystyle     \left.\inf_{|x|\leq\alpha}\left\{\frac{\dot{f}^{2}(x)}{(1-f(% x))^{2}}+\frac{\ddot{f}(x)}{1-f(x)}\right\}\right), (5)
 L_{\alpha}\geq\sup_{|x|\leq\alpha}\left\{\frac{\left|\dot{f}(x)\right|}{f(x)(1% -f(x))}\right\}\,, (6)

where \alpha is the bound on the entry-wise infinity-norm of \widehat{M} (see (4)). For the logit model, we have L_{\alpha}=1/\sigma, and \gamma_{\alpha}=\frac{e^{\alpha/\sigma}}{\sigma^{2}(1+e^{\alpha/\sigma})^{2}}% \approx e^{-\alpha/\sigma}>0. For the probit model we obtain L_{\alpha}\leq\frac{4}{\sigma}\left(\frac{\alpha}{\sigma}+1\right), \gamma_{\alpha}\geq\frac{\alpha}{\sqrt{2\pi}\sigma^{3}}\exp\left(-\frac{\alpha% ^{2}}{2\sigma^{2}}\right)>0. For further reference, define the constraint set

 {\mathcal{C}}:=\left\{M\in\mathbb{R}^{m\times n}\,:\,\|M\|_{\infty}\leq\alpha,% \;{\rm rank}(M)\leq r\right\}\,. (7)
Theorem III.1

Suppose that M^{*}\in{\mathcal{C}}, and G\backslash\Omega satisfies assumptions (A1) and (A2), with m\geq n. Further, suppose Y is generated according to (1) and f(x) is log-concave in x. Then with probability at least 1-C_{1}\exp(-C_{2}m), any global minimizer \widehat{M} of (4) satisfies

 \displaystyle\frac{1}{\sqrt{mn}} \displaystyle\|\widehat{M}-M^{*}\|_{F}\leq\max\left(\frac{C_{1\alpha}r\sigma_{% 2}(G)}{\sigma_{1}(G)},\frac{C_{2\alpha}m\sqrt{r^{3}n}}{\sigma_{1}^{2}(G)}\right) (8) \displaystyle\leq\max\left(\frac{C_{1\alpha}Cr\sqrt{m}}{\sqrt{|\Omega|}},\frac% {C_{2\alpha}m^{3}\sqrt{r^{3}n}}{|\Omega|^{2}}\right)\,, (9)

provided \gamma_{\alpha}>0. Here, C_{1},C_{2}>0 are universal constants, C>0 is given by assumption (A2), and

with \gamma_{\alpha} and L_{2\alpha} given by (5), (6).

Proof of this theorem is given in Sec. VI. Of particular interest is the case where p=\frac{|\Omega|}{mn} is fixed and we let m and n become large, with m/n\equiv\delta\geq 1 fixed. In this case we have the following Corollary.

Corollary III.2

Assume the conditions of Theorem III.1. Let p=\frac{|\Omega|}{mn} be fixed independent of m and n. Then with probability at least 1-C_{1}\exp(-C_{2}m), any global minimum \widehat{M} to (4) satisfies

 \displaystyle\frac{1}{\sqrt{mn}} \displaystyle\|\widehat{M}-M^{*}\|_{F}\leq\mathcal{O}\left(\frac{\delta}{p^{2}% }\sqrt{\frac{r^{3}}{n}}\right). (10)

III-A Comparison with previous work

Consider M^{*}\in\mathbb{R}^{n\times n}, with p fraction of its entries sampled, such that \|M^{*}\|_{\infty}\leq\alpha (also assumed in [4, 5]) and \text{rank}(M^{*})\leq r. Then m=n, and |\Omega|=pn^{2}. The bounds proposed in [4] (and [5] in case of uniform sampling) yields

 \frac{1}{n^{2}}\|\widehat{M}-M^{*}\|_{F}^{2}\leq\mathcal{O}\left(\sqrt{\frac{r% }{pn}}\right)\,, (11)

whereas, applying our result (Corollary 3.2), we obtain

 \displaystyle\frac{1}{n^{2}}\|\widehat{M}-M^{*}\|_{F}^{2} \displaystyle\leq\mathcal{O}\left(\frac{r^{3}}{p^{4}n}\right)\,. (12)

Comparing (11) and (12), we see our method has faster convergence rate in n for fixed rank r and fraction of revealed entries p. Notice that if the number of missing entries scales with n according to p\sim\Theta(1/n), [4] yields bounded error while our bound grows with n; in our case we need p to be of order at least n^{-1/4}. We believe this to be an artifact of our proof, as our numerical results (Fig. 1) show our method outperforms [4], especially for low values of p and higher values of rank r.

IV OPTIMIZATION

We will solve the optimization problem (4) using a log-barrier penalty function approach [9, Sec. 11.2]. The constraint \max_{i,j}|M_{ij}|\leq\alpha translates to the log-barrier penalty function -\log\left(1-(M_{ij}/\alpha)^{2}\right). This leads to the regularized objective function

 \overline{F}_{\Omega,Y}(M)=F_{\Omega,Y}(M)-\lambda\sum_{(i,j)}\log\left(1-(M_{% ij}/\alpha)^{2}\right) (13)

and the optimization problem

 \widehat{M}=\arg\min_{M}\overline{F}_{\Omega,Y}(M)\;\mbox{subject to}\;{\rm rank% }(M)\leq r. (14)

We can account for the rank constraint in (14) via the factorization technique of [10, 11, 12] where instead of optimizing with respect to M in (4), M is factorized into two matrices U\in\mathbb{R}^{m\times k} and V\in\mathbb{R}^{n\times k} such that M=UV^{\top}. One then chooses k=r+1 and optimizes with respect to the factors U,V. The reformulated objective function is then given by

 \displaystyle\check{F}_{\Omega,Y}(U,V)=F_{\Omega,Y}(UV^{\top})-\lambda\sum_{(i% ,j)}\log\left(1-(U_{i,\cdot}V_{j,\cdot}^{\sf T}/\alpha)^{2}\right) (15)

where U_{i,\cdot} denotes the i\mbox{-}th row of U, and V_{j,\cdot} the j\mbox{-}th row of V. The parameter \lambda>0 sets the accuracy of approximation of \max_{i,j}|M_{ij}|\leq\alpha via the log-barrier function. We solve this factored version using a gradient descent method with backtracking line search, in a sequence of central path following solutions [9, Sec. 11.2], where one gradually reduces \lambda toward 0. Initial values of U,V are randomly picked and scaled to satisfy \|UV^{\top}\|_{\infty}\leq 0.95\alpha. Starting with a large \lambda_{0}, we solve for \lambda=\lambda_{0},\lambda_{0}/2,\lambda_{0}/4,\cdots via central path following and use 5-fold cross validation error over \lambda as the stopping criterion in selecting \lambda.

Remark IV.1

The hard rank constraint results in a nonconvex constraint set. Thus, (4) and (14) are nonconvex optimization problems; similarly for minimization of (15) for which the rank constraint is implicit in the factorization of M. However, the following result is shown in [10, Proposition 4], based on [11], for nonconvex problems of this form. If (U^{*},V^{*}) is a local minimum of the factorized problem, then \widetilde{M}=U^{*}{V^{*}}^{\top} is the global minimum of problem (14), so long as U^{*} and V^{*} are rank-deficient. (Rank deficiency is a sufficient condition, not necessary.) This result is utilized in [12] and [5] for problems of this form.

V NUMERICAL EXPERIMENTS

V-A Synthetic Data

In this section, we test our method on synthetic data and compare it with the methods of [4, 5]. We set m=n and construct M^{*}\in\mathbb{R}^{n\times n} as M^{*}=M_{1}M_{2}^{\top} where M_{1} and M_{2} are n\times r matrices with i.i.d. entries drawn from a uniform distribution on [-0.5,0.5] (as in [5, 4]). Then we scale M^{*} to achieve \|M^{*}\|_{\infty}=1=\alpha. We pick r=5,10, vary matrix sizes n=100,200, or 400. We generate the set \Omega of revealed indices via the Bernoulli sampling model of [4] with p fraction of revealed entries. We consider the probit (\sigma=0.18, as in [5, 4]) model. For Fig. 1, we take n=200 and vary p. The resulting relative mean-square error (MSE) \|\widehat{M}-M^{*}\|_{F}^{2}/\|M^{*}\|_{F}^{2} averaged over 20 Monte Carlo runs is shown in Fig. 1. As expected, the performance improves with increasing p. For comparison, we have also implemented the methods of [4, 5], labeled “trace norm”, and “max-norm” respectively. As we can see our proposed approach significantly outperforms [4, 5], especially for low values of p and high values of r.

In Fig. 2 we show the relative MSE for n=100,200,400, p=0.2,0.4,0.6 for the probit model using our approach. We also plot the line 1/n in Fig. 2 to show the scale of the upper bound \mathcal{O}\left(\frac{r^{3}}{p^{4}n}\right) established in Section III. As we can see, the empirical estimation errors follow approximately the same scaling, suggesting that our analysis is tight, up to a constant.

We additionally plot the MSE for n=200 and r=5 in Fig. 3, with varying p and keeping p+q=0.7, under the probit model. This enables us to study the performance of the model under nonuniform sampling. Note that when p=q=0.35, the spectral gap is largest and MSE is the smallest, and as p gets larger, the spectral gap decreases, leading to larger MSE.

V-B MovieLens (100k) Dataset

As in [4], we consider the MovieLens (100k) dataset (http://www.grouplens.org/node/73). This dataset consists of 100,000 movie ratings from 943 users on 1682 movies, with ratings on a scale from 1 to 5. Following [4], these ratings were converted to binary observations by comparing each rating to the average rating for the entire dataset. We used three splits of the data into training/test subsets and used 20 random realizations of these splits. The performance is evaluated by checking to see if the estimate of M^{*} accurately predicts the sign of the test set ratings (whether the observed ratings were above or below the average rating). As in [4], we determine the needed parameter values by performing a grid search and selecting the values that lead to the best performance; we fixed \alpha=1, and varied \lambda (i.e. central path following), \sigma and rank r. Our performance results are shown in Table I using a logistic model for three approaches: proposed, [4, 5]. These results support our findings on synthetic data that our method is preferable over [4, 5] for sparser data.

VI PROOF OF THEOREM III.1

Our proof is based on a second-order Taylor series expansion and a matrix concentration inequality.

Let \theta={\rm vec}(M)\in\mathbb{R}^{mn} and \tilde{F}_{\Omega,Y}(\theta)=F_{\Omega,Y}(M). The objective function F_{\Omega,Y}(M) is continuous in M and the set {\mathcal{C}} is compact, therefore, F_{\Omega,Y}(M) achieves a minimum in {\mathcal{C}}. If \widehat{\theta}={\rm vec}(\widehat{M}) minimizes \tilde{F}_{\Omega,Y}(\theta) subject to the constraints, then \tilde{F}_{\Omega,Y}(\widehat{\theta})\leq\tilde{F}_{\Omega,Y}(\theta^{*}) where \theta^{*}={\rm vec}(M^{*}). By the second-order Taylor’s theorem, expanding around \theta^{*} we have

 \displaystyle\tilde{F}_{\Omega,Y}(\theta)= \displaystyle\tilde{F}_{\Omega,Y}(\theta^{*})+\langle\nabla_{\theta}\tilde{F}_% {\Omega,Y}(\theta^{*}),\theta-\theta^{*}\rangle \displaystyle+\frac{1}{2}\langle\theta-\theta^{*},\left(\nabla^{2}_{\theta% \theta}\tilde{F}_{\Omega,Y}(\tilde{\theta})\right)(\theta-\theta^{*})\rangle (16)

where \tilde{\theta}=\theta^{*}+\gamma(\theta-\theta^{*}) for some \gamma\in[0,1], with corresponding matrix \tilde{M}=M^{*}+\gamma(M-M^{*}). We need several auxiliary results before we can prove Theorem III.1.

Using (3), it follows that

 \displaystyle\frac{\partial F_{\Omega,Y}(M)}{\partial M_{\ell k}} \displaystyle=\left(-\frac{\dot{f}(M_{\ell k})}{f(M_{\ell k})}\mathbb{I}_{(Y_{% \ell k}=1)}\right. \displaystyle\left.+\frac{\dot{f}(M_{\ell k})}{1-f(M_{\ell k})}\mathbb{I}_{(Y_% {\ell k}=-1)}\right)\mathbb{I}_{((\ell,k)\in\Omega)}, (17)
 \displaystyle\frac{\partial^{2}F_{\Omega,Y}(M)}{\partial M_{\ell k}^{2}}=\bigg% {[}\left(\frac{\dot{f}^{2}(M_{\ell k})}{f^{2}(M_{\ell k})}-\frac{\ddot{f}(M_{% \ell k})}{f(M_{\ell k})}\right)\mathbb{I}_{(Y_{\ell k}=1)} \displaystyle+\left(\frac{\ddot{f}(M_{\ell k})}{1-f(M_{\ell k})}+\frac{\dot{f}% ^{2}(M_{\ell k})}{(1-f(M_{\ell k}))^{2}}\right)\mathbb{I}_{(Y_{\ell k}=-1)}% \bigg{]}\,\mathbb{I}_{((\ell,k)\in\Omega)} (18)

and

 \frac{\partial^{2}F_{\Omega,Y}(M)}{\partial M_{\ell_{1}k_{1}}\partial M_{\ell_% {2}k_{2}}}=0\mbox{ if }(\ell_{1},k_{1})\neq(\ell_{2},k_{2}). (19)

Let w\equiv{\rm vec}(M-M^{*})=\theta-\theta^{*}. Note that by our notation,

 \nabla_{\theta}\tilde{F}_{\Omega,Y}(\theta^{*})={\rm vec}\left(\frac{\partial F% _{\Omega,Y}(M^{*})}{\partial M_{\ell k}}\right)\,.

We then have

 \langle\nabla_{\theta}\tilde{F}_{\Omega,Y}(\theta^{*}),w\rangle=\langle\nabla_% {M}F_{\Omega,Y}(M^{*}),M-M^{*}\rangle (20)

where \langle A,B\rangle:={\rm tr}(A^{\top}B). Let Z\equiv\nabla_{M}F_{\Omega,Y}(M^{*}). Therefore,

 \displaystyle Z_{ij}=\left(-\frac{\dot{f}(M_{ij})}{f(M_{ij})}\mathbb{I}_{(Y_{% ij}=1)}+\frac{\dot{f}(M_{ij})}{1-f(M_{ij})}\mathbb{I}_{(Y_{ij}=-1)}\right)% \mathbb{I}_{((i,j)\in\Omega)}\,.

Using (1) and (6), we have

 \mathbb{E}[Z_{ij}]=0,\;|Z_{ij}|\leq L_{\alpha}\;\implies\;\mathbb{E}[Z_{ij}^{2% }]\leq L_{\alpha}^{2}\,. (21)

We need the following result from [13] concerning spectral norms of random matrices for Lemma VI.2.

Lemma VI.1

[13, Theorem 8.4] Take any two numbers m and n such that 1\leq n\leq m. Suppose that A=[A_{ij}]_{1\leq i\leq m,1\leq j\leq n} is a matrix whose entries are independent random variables that satisfy, for some \sigma^{2}\in[0,1],

 \mathbb{E}[A_{ij}]=0,\;\mathbb{E}[A_{ij}^{2}]\leq\sigma^{2},\mbox{ and }|A_{ij% }|\leq 1\;a.s.

Suppose that \sigma^{2}\geq m^{-1+\varepsilon} for some \varepsilon>0. Then

 P\left(\|A\|_{2}\geq 2.01\sigma\sqrt{m}\right)\leq C_{1}(\varepsilon)e^{-C_{2}% \sigma^{2}m},

where C_{1}(\varepsilon) is a constant that depends only on \varepsilon and C_{2} is a positive universal constant. The same result is true when m=n and A is symmetric or skew-symmetric, with independent entries on and above the diagonal, all other assumptions remaining the same. Lastly, all results remain true if the assumption \sigma^{2}\geq m^{-1+\varepsilon} is changed to \sigma^{2}\geq m^{-1}(\log(m))^{6+\varepsilon}.

Lemma VI.2

Let w\equiv{\rm vec}(M-M^{*})=\theta-\theta^{*}, and M,M^{*}\in{\mathcal{C}}. Then with probability at least 1-C_{1}(\varepsilon)\exp(-C_{2}m), we have

 \displaystyle\left|\langle\nabla_{\theta}\tilde{F}_{\Omega,Y}(\theta^{*}),w% \rangle\right| \displaystyle\leq 2.01L_{\alpha}\sqrt{2rm}\|M-M^{*}\|_{F}\,,

where \varepsilon\in(0,1), C_{1}(\varepsilon) is a constant that depends only on \varepsilon and C_{2} is a positive universal constant.

{proof}

Using (20), we have

 \displaystyle|\langle\nabla_{\theta} \displaystyle\tilde{F}_{\Omega,Y}(\theta^{*}),w\rangle|=|\langle\nabla_{M}F_{% \Omega,Y}(M^{*}),M-M^{*}\rangle| \displaystyle\leq\|\nabla_{M}F_{\Omega,Y}(M^{*})\|_{2}\|M^{*}-M\|_{*}. (22)

Let \tilde{Z}\equiv L_{\alpha}^{-1}\nabla_{M}F_{\Omega,Y}(M^{*}). Then we have \mathbb{E}[\tilde{Z}_{ij}]=0, |\tilde{Z}_{ij}|\leq 1 and \mathbb{E}[\tilde{Z}_{ij}^{2}]\leq 1. Applying Lemma VI.1 to \tilde{Z} with \sigma=1, we obtain \|\tilde{Z}\|_{2}\leq 2.01\sqrt{m} with probability at least 1-C_{1}(\varepsilon)\exp(-C_{2}m) for some positive constants C_{1}(\varepsilon) and C_{2}. W note that for any matrix A of rank r, \|A\|_{*}\leq\sqrt{r}\|A\|_{F} with \|A\|_{*} denoting the nuclear norm. Hence \|M^{*}-M\|_{*}\leq\sqrt{2r}\|M^{*}-M\|_{F}, yielding the desired result.

Lemma VI.3

Let w={\rm vec}(M-M^{*})=\theta-\theta^{*} and M,M^{*}\in{\mathcal{C}}. Then for any \tilde{\theta}=\theta^{*}+\gamma(\theta-\theta^{*}) and any \gamma\in[0,1], we have

 \displaystyle\langle w,\left[\nabla_{\theta\theta}^{2}\tilde{F}_{\Omega,Y}(% \tilde{\theta})\right]w\rangle \displaystyle\geq\gamma_{\alpha}\left\|\left(M-M^{*}\right)_{\Omega}\right\|_{% F}^{2}.
{proof}

Using (5), (18) and (19), we have

 \displaystyle\langle w, \displaystyle\left[\nabla_{\theta\theta}^{2}\tilde{F}_{\Omega,Y}(\tilde{\theta% })\right]w\rangle \displaystyle=\sum_{(i,j)\in\Omega}\left(\frac{\partial^{2}F_{\Omega,Y}(\tilde% {M})}{\partial M_{ij}^{2}}\right)(M_{ij}-M^{*}_{ij})^{2} \displaystyle\geq\gamma_{\alpha}\sum_{(i,j)\in\Omega}(M_{ij}-M^{*}_{ij})^{2}=% \gamma_{\alpha}\left\|\left(M-M^{*}\right)_{\Omega}\right\|_{F}^{2}\,, (23)

which completes the proof.

We need a result similar to [7, Theorem 4.1] regarding closeness of a fixed matrix to its sampled version, which is proved therein for square matrices M^{*} under an incoherence assumption on M^{*}. In Lemma VI.4 we prove a similar result for rectangular Z with bounded \|Z\|_{\infty}. Define

 \|Z\|_{\max}\equiv\inf\{\max(\|U\|_{2,\infty}^{2},\|V\|_{2,\infty}^{2}):\,Z=UV% ^{\sf T}\}\,,

where for a matrix A, \|A\|_{2,\infty} denotes the largest \ell_{2} norm of the rows in A , i.e, \|A\|_{2,\infty}\equiv\max_{i}\|U_{i,\cdot}\|_{2}.

For Z\in\mathbb{R}^{m\times n}, m\geq n, and define the operator {\cal R}_{\Omega} as

 Z_{\Omega}\equiv{\cal R}_{\Omega}(Z)=\begin{cases}Z_{ij}&\mbox{ if }(i,j)\in% \Omega,\\ 0&\mbox{ otherwise. }\end{cases}
Lemma VI.4

Let G\backslash\Omega satisfy assumptions (A1) and (A2) in Section IV. Let Z\in\mathbb{R}^{m\times n} with rank(Z)\leq r. Then we have

 \displaystyle\left\|\left(\frac{\sqrt{mn}}{\sigma_{1}(G)}R_{\Omega}-I\right)(Z% )\right\|_{2}\leq\frac{\sqrt{mn}\sigma_{2}(G)}{\sigma_{1}(G)}\|Z\|_{\max} (24) \displaystyle\leq\frac{\sqrt{rmn}\sigma_{2}(G)}{\sigma_{1}(G)}\|Z\|_{\infty}% \leq Cm\sqrt{\frac{nr}{|\Omega|}}\|Z\|_{\infty}. (25)
{proof}

By definition of \|Z\|_{\max}, there exist U\in\mathbb{R}^{m\times k} and V\in\mathbb{R}^{n\times k} for some 1\leq k\leq\min(m,n) such that Z=UV^{\top}, \|U\|_{2,\infty}^{2}\leq\|Z\|_{\max} and \|V\|_{2,\infty}^{2}\leq\|Z\|_{\max}. Since \text{rank}(Z)\leq r, we have k\leq r, but this fact is not needed in our proof. By the variational definition of operator norm,

 \|\frac{\sqrt{mn}}{\sigma_{1}(G)}R_{\Omega}(Z)-Z\|_{2}
 =\max_{x,y:\,\|x\|_{2}=1=\|y\|_{2}}y^{\top}\left(\frac{\sqrt{mn}}{\sigma_{1}(G% )}R_{\Omega}(Z)-Z\right)x.

We also have R_{\Omega}(Z)=Z\circ G where \circ denotes the Hadamard (elementwise) product. Letting U_{\cdot,\ell} and V_{\cdot,\ell} respectively denote the \ell-th column of U and V, we write

 Z=\sum_{\ell=1}^{k}U_{\cdot,\ell}V_{\cdot,\ell}^{\top}\,,

We therefore have

 \displaystyle y^{\top}\left(\frac{\sqrt{mn}}{\sigma_{1}(G)}{\cal R}_{\Omega}(Z% )-Z\right)x \displaystyle=\sum_{i=1}^{k}\left(\frac{\sqrt{mn}}{\sigma_{1}(G)}(y\circ U_{% \cdot,\ell})^{\top}G(x\circ V_{\cdot,\ell})-(y^{\top}U_{\cdot,\ell})(x^{\top}V% _{\cdot,\ell})\right)\,.

Normalize \mathbf{1}_{m} to unit norm as \tilde{\mathbf{1}}_{m}=\mathbf{1}_{m}/\sqrt{m}, and similarly for \tilde{\mathbf{1}}_{n}. Let y\circ U_{\cdot,\ell}=\alpha_{\ell}\tilde{\mathbf{1}}_{m}+\beta_{\ell}\tilde{% \mathbf{1}}_{m\perp}^{\ell} where \tilde{\mathbf{1}}_{m\perp}^{\ell} is a unit norm vector orthogonal to \tilde{\mathbf{1}}_{m}. Then \alpha_{\ell}=\tilde{\mathbf{1}}_{m}^{\top}(y\circ U_{\cdot,\ell})=y^{\top}U_{% \cdot,\ell}/\sqrt{m}. Hence

 \displaystyle y^{\top} \displaystyle\left(\frac{\sqrt{mn}}{\sigma_{1}(G)}{\cal R}_{\Omega}(Z)-Z\right)x \displaystyle=\sum_{\ell=1}^{k}\Big{(}\frac{\sqrt{mn}}{\sigma_{1}(G)}\Big{[}% \frac{1}{\sqrt{m}}y^{\top}U_{\cdot,\ell}\tilde{\mathbf{1}}_{m}^{\top}G(x\circ V% _{\cdot,\ell}) \displaystyle\quad\quad+\beta_{\ell}\tilde{\mathbf{1}}_{m\perp}^{\ell\top}G(x% \circ V_{\cdot,\ell})\Big{]}-(y^{\top}U_{\cdot,\ell})(x^{\top}V_{\cdot,\ell})% \Big{)} \displaystyle=\sum_{\ell=1}^{k}\left(\frac{\sqrt{mn}}{\sigma_{1}(G)}\beta_{% \ell}\tilde{\mathbf{1}}_{m\perp}^{\ell\top}G(x\circ V_{\cdot,\ell})\right), (26)

where we used the facts that \tilde{\mathbf{1}}_{m}^{\top}G=\sigma_{1}(G)\tilde{\mathbf{1}}_{n}^{\top} and \tilde{\mathbf{1}}_{n}^{\top}(x\circ V_{\cdot,\ell})=x^{\top}V_{\cdot,\ell}/% \sqrt{n}. Since \tilde{\mathbf{1}}_{m} is the top left singular vector of G, we have

 |\tilde{\mathbf{1}}_{m\perp}^{\ell\top}Gz|\leq\sigma_{2}(G)\|z\|_{2}\mbox{ for% any }z\in\mathbb{R}^{n}.

Using the above inequality in (26) we obtain

 \displaystyle y^{\top} \displaystyle\left(\frac{\sqrt{mn}}{\sigma_{1}(G)}R_{\Omega}(Z)-Z\right)x \displaystyle\leq\frac{\sqrt{mn}}{\sigma_{1}(G)}\sigma_{2}(G)\sum_{\ell=1}^{k}% |\beta_{\ell}|\|x\circ V_{\cdot,\ell}\|_{2} \displaystyle\leq\frac{\sqrt{mn}}{\sigma_{1}(G)}\sigma_{2}(G)\sqrt{\sum_{\ell=% 1}^{k}\beta_{\ell}^{2}}\sqrt{\sum_{\ell=1}^{k}\|x\circ V_{\cdot,\ell}\|_{2}^{2% }}. (27)

We have \beta_{\ell}=\tilde{\mathbf{1}}_{m\perp}^{\ell\top}(y\circ U_{\cdot,\ell}). Hence, |\beta_{\ell}|\leq\|y\circ U_{\cdot,\ell}\|_{2}. Therefore,

 \displaystyle\sum_{\ell=1}^{k}\beta_{\ell}^{2} \displaystyle\leq\sum_{\ell=1}^{k}\|y\circ U_{\cdot,\ell}\|_{2}^{2}=\sum_{i=1}% ^{m}\sum_{\ell=1}^{k}y_{i}^{2}U_{i,\ell}^{2} \displaystyle=\sum_{i=1}^{m}y_{i}^{2}\|U_{i,\cdot}\|_{2}^{2}\leq\|U\|_{2,% \infty}^{2}\sum_{i=1}^{m}y_{i}^{2}\leq\|Z\|_{\max} (28)

where we used \sum_{i=1}^{m}y_{i}^{2}=1. Similarly, we have

 \displaystyle\sum_{\ell=1}^{k} \displaystyle\|x\circ V_{\cdot,\ell}\|_{2}^{2}=\sum_{j=1}^{n}\sum_{\ell=1}^{k}% x_{j}^{2}V_{j,\ell}^{2} \displaystyle=\sum_{j=1}^{n}x_{j}^{2}\|V_{j,\cdot}\|_{2}^{2}\leq\|V\|_{2,% \infty}^{2}\sum_{j=1}^{n}x_{j}^{2}\leq\|Z\|_{\max}\,. (29)

It then follows from (27)-(29) that

 \displaystyle y^{\top}\left(\frac{\sqrt{mn}}{\sigma_{1}(G)}R_{\Omega}(Z)-Z% \right)x \displaystyle\leq\frac{\sqrt{mn}\sigma_{2}(G)}{\sigma_{1}(G)}\|Z\|_{\max}

This establishes (24). Now use \|Z\|_{\max}\leq\sqrt{r}\|Z\|_{\infty} [5] and |\Omega|=md to establish (25).

Lemma VI.5

Let M,M^{*}\in{\mathcal{C}}. Then we have

 \displaystyle\left\|\left(M-M^{*}\right)_{\Omega}\right\|_{F}\geq\frac{\sigma_% {1}(G)}{\sqrt{2rmn}}\left\|M-M^{*}\right\|_{F}-2\alpha\sqrt{r}\sigma_{2}(G).
{proof}

Let Z\equiv M-M^{*}, a=\sqrt{mn}/\sigma_{1}(G), and b=(\sigma_{2}(G)/\sigma_{1}(G))\sqrt{rmn}. Then by Lemma VI.4 and the fact that rank(Z)\leq{\rm rank}(M)+{\rm rank}(M^{*})\leq 2r, we have

 \displaystyle\left|a\|Z_{\Omega}\|_{2}-\|Z\|_{2}\right| \displaystyle\leq\|aZ_{\Omega}-Z\|_{2}\leq b\|Z\|_{\infty}. (31)

Using \|Z\|_{\infty}=\|M-M^{*}\|_{\infty}\leq\|M\|_{\infty}+\|M^{*}\|_{\infty}\leq 2\alpha, (31) can be expressed as \|Z\|_{2}\leq a\|Z_{\Omega}\|_{2}+2\alpha b. Since \|A\|_{2}\leq\|A\|_{F} \forall A, we then have \|Z\|_{2}\leq a\|Z_{\Omega}\|_{F}+2\alpha b. Since \|A\|_{F}\leq\sqrt{{\rm rank}(A)}\|A\|_{2} \forall A, we have \|Z\|_{F}\leq\sqrt{2r}\|Z\|_{2}\leq\sqrt{2r}a\|Z_{\Omega}\|_{F}+2\sqrt{2r}\alpha b, leading to the desired result.

Proof of Theorem III.1: Consider \tilde{F}_{\Omega,Y}(\theta)={F}_{\Omega,Y}(M). The objective function F_{\Omega,Y}(M) is continuous in M and the set {\mathcal{C}} is compact, therefore, F_{\Omega,Y}(M) achieves a minimum in {\mathcal{C}}. Now suppose that \widehat{M}\in{\mathcal{C}} minimizes {F}_{\Omega,Y}(M). Then {F}_{\Omega,Y}(\widehat{M})\leq{F}_{\Omega,Y}(M) \forall M\in{\mathcal{C}}, including M=M^{*}. Define

Using (16) and Lemmas VI.2 and VI.3, we have w.h.p. (specified in Lemma VI.2)

 \displaystyle{F}_{\Omega,Y}(M) \displaystyle\geq{F}_{\Omega,Y}(M^{*})-c_{g}\|M-M^{*}\|_{F}+\frac{\gamma_{% \alpha}}{2}\|(M-M^{*})_{\Omega}\|_{F}^{2}.

Since \widehat{M} minimizes {F}_{\Omega,Y}(M), we have

 \displaystyle 0 \displaystyle\geq{F}_{\Omega,Y}(\widehat{M})-{F}_{\Omega,Y}(M^{*}) \displaystyle\geq-c_{g}\|\widehat{M}-M^{*}\|_{F}+\frac{\gamma_{\alpha}}{2}\|(% \widehat{M}-M^{*})_{\Omega}\|_{F}^{2}. (33)

Set \eta=2\alpha r(\sigma_{2}(G)/\sigma_{1}(G))\sqrt{2mn} and \eta_{0}=\sigma_{1}(G)/\sqrt{2rmn}. Then Lemma VI.5 implies \|(M-M^{*})_{\Omega}\|_{F}\geq\eta_{0}\left[\|M-M^{*}\|_{F}-\eta\right]. Now consider two cases: (i) \|\widehat{M}-M^{*}\|_{F}<2\eta, (ii) \|\widehat{M}-M^{*}\|_{F}\geq 2\eta. In case (i), we clearly have an obvious upperbound on \|\widehat{M}-M^{*}\|_{F}. Turning to case (ii), we have

 \displaystyle\|\widehat{M}-M^{*}\|_{F}-\eta \displaystyle\geq\|\widehat{M}-M^{*}\|_{F}-\frac{1}{2}\|\widehat{M}-M^{*}\|_{F} \displaystyle=\frac{1}{2}\|\widehat{M}-M^{*}\|_{F}. (34)

Using (33), (34) and Lemma VI.5 with M=\widehat{M}, we have

 \displaystyle 0 \displaystyle\geq{F}_{\Omega,Y}(\widehat{M})-{F}_{\Omega,Y}(M) \displaystyle\geq-c_{g}\|\widehat{M}-M\|_{F}+{c_{h}}\|\widehat{M}-M\|_{F}^{2} \displaystyle=\|\widehat{M}-M\|_{F}\left[-c_{g}+{c_{h}}\|\widehat{M}-M\|_{F}% \right]. (35)

In order for (35) to be true, we must have \|\widehat{M}-M^{*}\|_{F}\leq c_{g}/c_{h} otherwise the right-side of (35) is positive violating (35). Combining the two cases, we obtain

 \displaystyle\|\widehat{M}- \displaystyle M^{*}\|_{F}\leq\max\left(2\eta,\frac{c_{g}}{c_{h}}\right) \displaystyle=\max\left(4\alpha r\sqrt{2mn}\frac{\sigma_{2}(G)}{\sigma_{1}(G)}% ,\frac{32.16\sqrt{2}L_{\alpha}(rm)^{1.5}n}{\gamma_{\alpha}\sigma_{1}^{2}(G)}\right) (36)

This is the bound stated in (8) of the theorem after division by \sqrt{mn}. The high probability stated in the theorem follows from Lemma VI.2 after setting \varepsilon=0.5. Finally, we use \sigma_{2}(G)/\sigma_{1}(G)\leq C/\sqrt{d}=C\sqrt{m}/\sqrt{|\Omega|} and 1/\sigma_{1}^{2}(G)\leq 1/d^{2}=m^{2}/|\Omega|^{2} to derive (9). \hfill\;\;\blacksquare

References

• [1] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
• [2] Y. Shang, W. Ruml, Y. Zhang, and M. Fromherz, “Localization from connectivity in sensor networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 11, pp. 961–974, 2004.
• [3] A. Karbasi and S. Oh, “Robust localization from incomplete local information,” IEEE Transactions on Networking, vol. 21, pp. 1131 – 1144, August 2013.
• [4] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters, “1-bit matrix completion,” Information and Inference, vol. 3(3), pp. 189–223, 2014.
• [5] T. Cai and W.-X. Zhou, “A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion,” Journal of Machine Learning Research, vol. 14, pp. 3619–3647, 2013.
• [6] A. S. Lan, C. Studer, and R. G. Baraniuk, “Matrix recovery from quantized and corrupted measurements,” in IEEE ICASSP, Florence, Italy, May 2014.
• [7] S. Bhojanapalli and P. Jain, “Universal matrix completion,” in Proceedings of the 31st International Conference on Machine Learning, ser. ICML ’14, 2014.
• [8] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” Information Theory, IEEE Transactions on, vol. 56, no. 6, pp. 2980–2998, 2010.
• [9] S. Boyd and L. Vandenberghe, Convex Optimization.   Cambridge Univ Press, 2004.
• [10] F. Bach, J. Mairal, and J. Ponce, “Convex sparse matrix factorizations,” 2008, arXiv:0812.1869v1 .
• [11] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Mathematical Programming (series B), vol. 95, pp. 329–357, 2003.
• [12] B. Recht and C. Re, “Parallel stochastic gradient algorithms for large-scale matrix completion,” Math. Program. Comput., vol. 5, no. 2, pp. 201–226, 2013.
• [13] S. Chatterjee, “Matrix estimation by universal singular value thresholding,” 2013, arXiv:1212.1247v5 .
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters