Noisy Inductive Matrix Completion Under Sparse Factor Models

Noisy Inductive Matrix Completion Under Sparse Factor Models

Abstract

Inductive Matrix Completion (IMC) is an important class of matrix completion problems that allows direct inclusion of available features to enhance estimation capabilities. These models have found applications in personalized recommendation systems, multilabel learning, dictionary learning, etc. This paper examines a general class of noisy matrix completion tasks where the underlying matrix is following an IMC model i.e., it is formed by a mixing matrix (a priori unknown) sandwiched between two known feature matrices. The mixing matrix here is assumed to be well approximated by the product of two sparse matrices—referred here to as “sparse factor models.” We leverage the main theorem of [1] and extend it to provide theoretical error bounds for the sparsity-regularized maximum likelihood estimators for the class of problems discussed in this paper. The main result is general in the sense that it can be used to derive error bounds for various noise models. In this paper, we instantiate our main result for the case of Gaussian noise and provide corresponding error bounds in terms of squared loss.

Noisy Inductive Matrix Completion Under Sparse Factor Models

 Akshay Soni, Troy Chevalier Yahoo! Research Sunnyvale, CA, USA {akshaysoni,troyc}@yahoo-inc.com

 Swayambhoo Jain Department of ECE University of Minnesota – Twin Cities jainx174@umn.edu

1 Introduction

In recent years, matrix completion has been a venue for constant research—both from theoretical as well as algorithmic front. The applications include collaborative filtering, multiclass learning, dimensionality reduction etc. In most general terms, the aim of these problems is to estimate the elements of a matrix from noisy observations collected on only a subset of its locations. Of course, such estimation problems are ill-posed without further assumptions, since the value of at the unobserved locations can be arbitrary. A common approach is to augment the inference method with the assumption that the underlying matrix to be estimated exhibits some form of intrinsic low-dimensional structure, making, in some cases, the problem feasible. For example, the assumption of low-rank on the target matrix has been the main focus of research for these problems. Even though the rank constrained problem is non-convex and difficult to analyze, convex relaxations to these problems have been well studied in noiseless [2], additive uncertainty [3], and even in settings where observations are nonlinear (e.g., one-bit quantized [4]). However, due to the computational complexity reasons, the most popular algorithms to solve these problems are variants of alternating minimization [1]; the low-rank structure is imposed explicitly via bilinear models i.e., is assumed to be a product of two factor-matrices as

 X∗=D∗A∗, (1)

where , and . Slight modifications of this formulation also allows for explicit control and facilitates imposing additional constraints like positivity [5], sparsity [6], or even tree-sparsity [7] on the factors.

In terms of the well-known user-item recommendation problem [8]: pertains to the rating matrix, and denote the matrices of user and item latent-factors, respectively. Apart from the rating information, generally, we also have other user and item side-features (like age, gender, etc) that can be used to further improve the quality of estimation. These features can be included via a Bayesian framework [9, 10] or by directly changing the model as in Inductive Matrix Completion (IMC) [11, 12] where we have the following model assumption on :

 X∗=AW∗B=AP∗Q∗B, (2)

where , are known matrices, for example, user and item feature matrices, is a mixing matrix whose low-rank structure is explicitly imposed via a bilinear model , and .

The goal of this paper is to provide estimation error bounds for matrix completion under inductive setting, as in (2), by using the framework of regularized maximum likelihood estimation [1]. For generality, we derive these bounds under the assumption that both and can be sparse, and noise can follow any generic distribution. This kind of sparse-structure arises in recommendation problems where the desire is to have sparse latent vectors due to requirement of low computational and storage complexity [13], and variant of dictionary learning problems where apart from the coefficient matrix being sparse, dictionary itself is modeled to be sparse [14].

1.1 Related Works

Our inference tasks here essentially entails learning two structured factors in a bilinear model. With a few notable exceptions (e.g., low-rank matrices and certain non-negative matrices [15, 16]), the non-convex bilinear models studied in this paper are difficult to analyze. Several recent efforts in the dictionary learning literature have established theoretical guarantees on identifiability, as well as algorithmic analyses of a number of optimization approaches [17, 18, 19, 20]. The most closely related to our work are [1, 6, 17], where authors provide a framework for deriving estimation error bounds for the models of form (1) under various noise distributions and sparsity structure on by using a regularized maximum-likelihood approach based on the ideas from [21]. The Overall theme of the estimation bounds obtained using this framework is that the error of their estimate, of , decays as the ratio of number of degrees of freedom to estimate over the number of observations.

The idea of including side features in the factorization models to improve recommendations is considered using Bayesian formulations in [22] or via directly incorporating these features in a generative model in IMC [12], Factorization Machines [23]. The statistical performance of IMC has been theoretically analyzed via convex relaxations based on nuclear-norm in noiseless [12] and noisy settings [24]. From the algorithmic perspective, provable alternating-minimization based approach with bilinear models has been recently proposed and analyzed [11]. Our results in this paper provide additional insights into these problems under a general setting of noise and sparsity of the bilinear factors. It is to be noted here that the aim of all these related works is to accurately estimate the mixing matrix ; we provide error bounds for the estimation of .

Many recent works have used IMC framework for large scale recommendations [25] or extended IMC to automatically construct better features [26].

1.2 Specific Contributions

In this paper we provide estimation error guarantees for matrix completion using regularized maximum likelihood estimator under the model described in (2) for the cases where the factor matrices are dense (either one or both), or when is itself sparse. In order to arrive at these bounds, we extend the fundamental theorem proposed in [1] to the inductive matrix completion setting studied here. This extension is detailed in Theorem 2.1. The main utility of this theorem is that it can provide estimation error guarantees for a variety of noise distributions; for example, Gaussian, Poisson, Laplace, and even under the extreme quantized setting of 1-bit observations. Here we instantiate it for the Gaussian noise and obtain corresponding error guarantees.

1.3 Preliminaries

To set the stage for the statement of our main result, we recall several information-theoretic preliminaries. When and denote the pdf (or pmf) of a real-valued random variable , the Kullback-Leibler divergence (or KL divergence) of from is denoted and given by

 D(p∥q)=Ep[logp(Y)q(Y)]

where the logarithm is taken to be the natural log. By definition, is finite only if the support of is contained in the support of . Further, the KL divergence satisfies and when . We also use the Hellinger affinity denoted by and given by

 A(p,q)=Ep[√q(Y)p(Y)]=Eq[√p(Y)q(Y)]

It is known that . When and are parameterized by elements and of matrices and , respectively, so that and , we use the shorthand notation and . Finally, for a matrix we denote by its number of nonzero elements, and the magnitude of its largest element (in absolute value). Also, .

2 Problem Statement and Recovery Result

Our focus here is to estimate a matrix that admits a model as in (2) from noisy observations collected at a subset of its locations. Sampling locations are randomly chosen in the sense that for an integer satisfying and , each location is observed independently via a Bernoulli() model. Going forward, we assume the following bounds on the participating matrices:

 ∥X∗∥max≤Xmax/2, ∥P∗∥max≤1, ∥Q∗∥max≤Qmax, ∥A∥max≤Amax, % and ∥B∥max≤Bmax,

for some positive constants , , and .

Given and , we write the joint pdf (or pmf) of the observations as

 pX∗S(YS)≜∏(i,j)∈SpX∗i,j(Yi,j), (3)

where denotes the corresponding scalar pdf (or pmf), and we use the shorthand to denote the collection of elements of indexed by . Then our task may be described succinctly as follows: given and corresponding noisy observations of distributed according to (3), our goal is to estimate under the assumption that it admits a model as in (2).

Our approach will be to estimate via sparsity-penalized maximum likelihood; we consider estimates of the form

 ˆX=argminX=APQB∈X {−logpXS(YS)+λP∥P∥0+λQ∥Q∥0}, (4)

where are user-specified regularization parameter, is shorthand for the collection of entries of indexed by , and is an appropriately constructed class of candidate estimates. To facilitate our analysis here, we take to be a countable class of estimates constructed as follows: for a specified , we set and construct to be the set of all matrices whose elements are either , or are discretized to one of uniformly spaced levels in the range and to be the set of all matrices whose elements either take the value zero, or are discretized to one of uniformly-spaced levels in the range . Then, we let

 X≜{X=APQB : P∈P, Q∈Q, ∥X∥max≤Xmax}. (5)

Our main result establishes error bounds under general noise or corruption models. We state the result here as a theorem; its proof appears in Section 3.

Theorem 2.1.

Let , , , , , and be as defined above and is any constant satisfying

 CD≥maxX∈Xmaxi,j D(pX∗i,j∥pXi,j).

Then for any

 (6) (7)

the complexity penalized maximum likelihood estimator (4) satisfies the (normalized, per-element) error bound

 ES,YS[−2logA(pˆXλ,pX∗)]n1n2≤8CDlogmm+ 3⋅minX∈X{D(pX∗∥pX)n1n2+ ((λP+λQ)+4CD(β+4)log(n1∨n2)3)(∥P∥0+∥Q∥0m)}.

Before proceeding to specific case of Gaussian noise, we note a few salient points about this result. First, as alluded above, our results is general and can be used to analyze the error performance under a variety of noise models. Specialization to a given noise model requires us to compute the upper bounds for KL divergences in terms of problem parameters, and the lower bound of negative log Hellinger affinities in terms of the required error function. Second, the above result is kind of an oracle bound, in that it is specified in terms of a minimum over . We can construct a valid upper bound by evaluating this for any . We construct a specific that is close to , and evaluate the oracle bound for the constructed point to get a non-trivial upper bound.

Finally, we note that the optimization problem (4) is non-convex, as often is the case, not only because of penalty, or discretized space () used for proofs in this paper, but due the fundamental challenge posed by bilinear models.

2.1 Implications for Gaussian Noise

Below we provide scaling behavior of error bound with problem parameters for Gaussian noise with known variance by instantiating Theorem 2.1. The joint likelihood of observations can be written as

 pX∗S(YS)=1(2πσ2)|S|/2exp(−12σ2 ∥YS−X∗S∥2F), (8)

where we have used the shorthand notation . To that end, we fix as

 β=max{1,1+log(6√m⋅(r⋅r1⋅r2)⋅AmaxBmaxQmax/Xmax)log(n1∨n2)} (9)

for describing the number of discretization levels in the elements of each factor matrix and regularization parameters equal to right hand side of (6). In this setting we have the following result; its proof appears in Section 3.2.

Corollary 2.1.

Let , , and be as defined above with . The estimate obtained via (4) satisfies

 ES,YS[∥X∗−ˆX∥2F]n1n2= O((σ2+X2max)(∥P∗∥0+∥Q∗∥0m)log(n1∨n2)).

A few comments are in order regarding these error guarantees. We may interpret the quantity as the number of degrees of freedom to be estimated, and in this sense we see that the error rate of the penalized maximum likelihood estimator exhibits characteristics of the well-known parametric rate (modulo the logarithmic factor). If and are instead dense, then we get the familiar error bound that decays as with . Compared this with the case where matrix is just low-rank (as in formal matrix completion tasks), the error rates are of the form [1]. Clearly, the IMC model has a better error performance if or even better if the factor matrices are sparse.

The presence of the factor is akin to the incoherence conditions prevalent in the matrix-completion literature. In essence, it pertains to the spikiness of the underlying matrix we are trying to estimate [27].

3 Proofs

3.1 Proof of Theorem 2.1

We utilize the following Lemma from [1] and apply it to our problem.

Lemma 3.1.

Let be an matrix whose elements we aim to estimate, and let be a countable collection of candidate reconstructions of , each with corresponding penalty , so that the collection of penalties satisfies the summability condition . Given , , , , and as explained earlier, we have that for any , the complexity penalized maximum likelihood estimator

 ˆXξ=argminX∈X{−logpXS(YS)+ξ⋅pen(X)},

satisfies the (normalized, per-element) error bound

 3⋅minX∈X{D(pX∗∥pX)n1n2+(ξ+4CDlog23)pen(X)m},

where, as denoted, the expectation is with respect to the joint distribution of and .

Now, consider any discretized matrix factors and , as described in Section 2. For , we encode each nonzero element of using bits to denote its location and bits for its amplitude. We can then encode with nonzero entries by using bits. Similarly, we can encode using where . Now, we let be set of all such , and let the binary code for each be the concatenation of the binary code for followed by the binary code for . It follows that we may assign penalties to all whose length satisfy

 pen(X)=∥P∥0(log2(LPloc⋅Llev))+∥Q∥0(log2(LQloc⋅Llev)), (11)

given that and are known matrices.

It is easy to see that such codes are (by construction) uniquely decodable, so we have that by the well-known Kraft-McMillan Inequality [28]. Now let be any subset of . For randomly subsampled and noisy observations our estimates take the form

 ˆXξ = argminX=APQB∈X{−logpXS(YS)+ξ⋅pen(X)}.

Further, using Lemma 3.1 we have

 3⋅minX∈X{D(pX∗∥pX)n1n2+ (ξ+4CDlog23)(log2Lloc+log2Llev)(∥P∥0+∥Q∥0m)},

where . Finally, letting

 λP = ξ⋅(log2LPloc+log2Llev) λQ = ξ⋅(log2LQloc+log2Llev)

and using the fact that

 log2LPloc+log2Llev ≤ 8log(r1)+2βlog(n1∨n2) log2LQloc+log2Llev ≤ 8log(r2)+2βlog(n1∨n2) log2Lloc+log2Llev ≤ (β+4)⋅log(n1∨n2)⋅2log2(e)

which follows by our selection of and and the fact that , ; it follows (after some straightforward simplification) that () follow (6), and the estimate (4) satisfies

 ES,YS[−2logA(pˆX,pX∗)]n1n2≤8CDlogmm+ 3⋅minX∈X{D(pX∗∥pX)n1n2+ ((λP+λQ)+4CD(β+4)log(n1∨n2)3)(∥P∥0+∥Q∥0m)},

as claimed.

3.2 Proof of Corollary 2.1

Proof.

For as specified and any , using the model (8) we have

 D(pX∗i,j∥pXi,j)=(X∗i,j−Xi,j)22σ2

for any fixed . It follows that , and using the fact that the amplitudes of entries of and all are no larger than , it is clear that we may choose . Further, for any and any fixed it is easy to show that in this case

 −2logA(pXi,j,pX∗i,j)=(X∗i,j−Xi,j)24σ2,

so that . It follows that

 ES,YS[−2logA(pˆX,pX∗)]=ES,YS[∥X∗−ˆX∥2F]4σ2.

Incorporating this into Theorem 2.1, we obtain that for regularization parameters as in (6) with substituted, the sparsity penalized maximum-likelihood estimate satisfies the per-element mean-square error bound

 ES,YS[∥X∗−ˆX∥2F]n1n2≤64X2maxlogmm +6⋅minX∈X{∥X∗−X∥2Fn1n2 + (2σ2(λP+λQ)+16X2max(β+4)log(n1∨n2)3) (∥P∥0+∥Q∥0m)}.

We now establish the error bound for the case where and are sparse with corresponding cardinality and . Consider a candidate reconstruction of the form , where the elements of are the closest discretized surrogates of the nonzero entries of , and the entries of and are the closest discretized surrogates of the nonzero entries of . Denote and . Then it is easy to see that

 X∗q−X∗=A(P∗△Q∗+△P∗Q∗+△P∗△Q∗)B.

Given the range limits on allowable and and that each range is quantized to levels, we have that and . Now, we can obtain a bound on the magnitudes of the elements of that hold uniformly over all , as follows

 ∥X∗q−X∗∥max (12) = ≤ 6⋅r⋅r1⋅r2⋅AmaxBmaxQmaxLlev,

where the inequality follows from a application of triangle inequality followed by the bounds on and and the entry-wise bounds on elements of allowable and , and because . Now, it is straight-forward to show that our choice of in (9) implies

 Llev≥12⋅r⋅r1⋅r2⋅AmaxBmaxQmaxXmax, (13)

so each entry of is bounded in magnitude by . It follows that each element of the candidate constructed above is bounded in magnitude by , so is indeed a valid element of the set .

Further, the approximation error analysis above also implies directly that

 ∥X∗−X∗q∥2Fn1n2 = 1n1n2∑i∈[n1],j∈[n2](AP∗qQ∗qB−AP∗Q∗B)2i,j ≤ 36⋅(r⋅r1⋅r2)2(AmaxBmaxQmax)2L2lev ≤ X2maxm,

where the last line follows from the fact that our specific choice of in (9) also implies

 Llev≥6√m⋅(r⋅r1⋅r2)⋅AmaxBmaxQmaxXmax. (14)

Now, evaluating the oracle term at the candidate , and using the fact that and , we have

 ES,YS[∥X∗−ˆX∥2F]n1n2≤70X2maxlogmm+ 48(σ2+2X2max)(β+4)log(n1∨n2)(∥P∗∥0+∥Q∗∥0m).

References

• [1] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Noisy matrix completion under sparse factor models,” IEEE Transactions on Information Theory, vol. 62, no. 6, pp. 3636–3661, June 2016.
• [2] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009.
• [3] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from noisy entries,” Journal of Machine Learning Research, vol. 11, pp. 2057–2078, 2010.
• [4] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters, “1-bit matrix completion,” Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.
• [5] A. Soni and J. Haupt, “Estimation error guarantees for poisson denoising with sparse and structured dictionary models,” in 2014 IEEE International Symposium on Information Theory, June 2014, pp. 2002–2006.
• [6] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Error bounds for maximum likelihood matrix completion under sparse factor models,” in Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, Dec 2014, pp. 399–403.
• [7] A. Soni and J. Haupt, “On the fundamental limits of recovering tree sparse vectors from noisy linear measurements,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 133–149, Jan 2014.
• [8] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
• [9] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King, “Sorec: social recommendation using probabilistic matrix factorization,” in Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008, pp. 931–940.
• [10] Ian Porteous, Arthur U Asuncion, and Max Welling, “Bayesian matrix factorization with side information and dirichlet process mixtures.,” in AAAI, 2010.
• [11] Prateek Jain and Inderjit S Dhillon, “Provable inductive matrix completion,” arXiv preprint arXiv:1306.0626, 2013.
• [12] Miao Xu, Rong Jin, and Zhi-Hua Zhou, “Speedup matrix completion with side information: Application to multi-label learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2301–2309.
• [13] E. Acar, G. GÃ¼rdeniz, M. A. Rasmussen, D. Rago, L. O. Dragsted, and R. Bro, “Coupled matrix factorization with sparse factors to identify potential biomarkers in metabolomics,” in 2012 IEEE 12th International Conference on Data Mining Workshops, Dec 2012, pp. 1–8.
• [14] R. Rubinstein, M. Zibulevsky, and M. Elad, “Double sparsity: Learning sparse dictionaries for sparse signal approximation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1553–1564, March 2010.
• [15] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra, “Computing a nonnegative matrix factorization–provably,” Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 145–162, 2012.
• [16] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model for nonnegative matrix factorization and dimensionality reduction on physical space,” IEEE Transactions on Image Processing, vol. 21, no. 7, pp. 3239–3252, 2012.
• [17] J. D. Haupt, N. D. Sidiropoulos, and G. B. Giannakis, “Sparse dictionary learning from 1-bit data,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 7664–7668.
• [18] R. Gribonval and K. Schnass, “Dictionary identification – Sparse matrix-factorization via minimization,” IEEE Transactions on Information Theory, vol. 56, no. 7, pp. 3523–3539, 2010.
• [19] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them,” Linear algebra and its applications, vol. 416, no. 1, pp. 48–67, 2006.
• [20] D. A. Spielman, H. Wang, and J. Wright, “Exact recovery of sparsely-used dictionaries,” in Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, 2013, pp. 3087–3090.
• [21] A. R. Barron, C. Huang, J. Q. Li, and X. Luo, “The MDL principle, penalized likelihoods, and statistical risk,” Festschrift for Jorma Rissanen. Tampere University Press, Tampere, Finland, 2008.
• [22] Deepak Agarwal and Bee-Chung Chen, “flda: matrix factorization through latent dirichlet allocation,” in Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010, pp. 91–100.
• [23] Steffen Rendle, “Factorization machines,” in 2010 IEEE International Conference on Data Mining. IEEE, 2010, pp. 995–1000.
• [24] Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon, “Matrix completion with noisy side information,” in Advances in Neural Information Processing Systems, 2015, pp. 3447–3455.
• [25] Donghyuk Shin, Suleyman Cetintas, Kuang-Chih Lee, and Inderjit S Dhillon, “Tumblr blog recommendation with boosted inductive matrix completion,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015, pp. 203–212.
• [26] Si Si, Kai-Yang Chiang, Cho-Jui Hsieh, Nikhil Rao, and Inderjit S. Dhillon, “Goal-directed inductive matrix completion,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), aug 2016.
• [27] S. Negahban and M. J. Wainwright, “Restricted strong convexity and weighted matrix completion: Optimal bounds with noise,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 1665–1697, 2012.
• [28] T. M. Cover and J. A. Thomas, Elements of information theory, John Wiley & Sons, 2006.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters