Exponential Family Matrix Completion under Structural Constraints
We consider the matrix completion problem of recovering a structured matrix from noisy and partial measurements. Recent works have proposed tractable estimators with strong statistical guarantees for the case where the underlying matrix is low–rank, and the measurements consist of a subset, either of the exact individual entries, or of the entries perturbed by additive Gaussian noise, which is thus implicitly suited for thin–tailed continuous data. Arguably, common applications of matrix completion require estimators for (a) heterogeneous data–types, such as skewed–continuous, count, binary, etc., (b) for heterogeneous noise models (beyond Gaussian), which capture varied uncertainty in the measurements, and (c) heterogeneous structural constraints beyond low–rank, such as block–sparsity, or a superposition structure of low–rank plus elementwise sparseness, among others. In this paper, we provide a vastly unified framework for generalized matrix completion by considering a matrix completion setting wherein the matrix entries are sampled from any member of the rich family of exponential family distributions; and impose general structural constraints on the underlying matrix, as captured by a decomposable norm regularizer . We propose a simple convex regularized –estimator for this generalized framework, and provide a unified and novel statistical analysis for this class of estimators. We finally corroborate our theoretical results on simulated datasets.
Matrix Completion, Exponential Families, High Dimensional Prediction, Low Rank Approximation, Nuclear Norm Minimization
In the general problem of matrix completion, we seek to recover a structured matrix from noisy and partial measurements. This problem class encompasses a wide range of practically important applications such as recommendation systems, recovering gene–protein interactions, and analyzing document collections in language processing, among others. In recent years, leveraging developments in sparse estimation and compressed sensing, there has been a surge of work on computationally tractable estimators with strong statistical guarantees, specifically for the setting where a subset of entries of a low–rank matrix are observed either deterministically, or perturbed by additive noise that is Gaussian Candes and Plan (2010), or more generally sub–Gaussian Keshavan et al. (2010b); Negahban and Wainwright (2012). While such a Gaussian noise model is amenable to the subtle statistical analyses required for the ill–posed problem of matrix completion, it is not always practically suitable for all data settings encountered in matrix completion problems. For instance, such a Gaussian error model might not be appropriate in recommender systems based on movie ratings that are either binary (likes or dislikes), or range over the integers one through five. The first question we ask in this paper is whether we can generalize the statistical estimators for matrix completion as well as their analyses to general noise models? Note that a noise model captures the uncertainty underlying the matrix measurements, and is thus an important component of the problem specification given any application; and it is thus vital for broad applicability of the class of matrix completion estimators to extend to general noise models.
Though this might seem like a narrow technical, although important question, it is related to a broader issue. A Gaussian observation model implicitly assumes the matrix values are continuous–valued (and that they are thin–tail–distributed). But in modern applications, matrix data span the gamut of heterogeneous data–types, for instance, skewed–continuous, categorical–discrete including binary, count–valued, among others. This, thus gives rise to the second question of whether we can generalize the standard matrix completion estimators and statistical analyses, suited for thin–tailed continuous data, to more heterogeneous data–types? Note that there has been some recent work for the specific case of binary data by Davenport et al. (2012), but generalizations to other data–types and distributions is largely unexplored.
The recent line of work on matrix completion, moreover, enforces the constraint that the underlying matrix be either exactly or approximately low–rank. Aside from the low–rank constraints, further assumptions to eliminate overly “spiky” matrices are required for well–posed recovery under partial measurements Candes and Recht (2009). Early work provided generalization error bounds for various low–rank matrix completion algorithms, including algorithms based on nuclear norm minimization Candes and Recht (2009); Candes and Tao (2010); Candes and Plan (2010); Recht (2011), max–margin matrix factorization Srebro et al. (2004), spectral algorithms Keshavan et al. (2010a, b), and alternating minimization Jain et al. (2013). These work made stringent matrix incoherence assumptions to avoid “spiky” matrices. These assumptions have been made less stringent in more recent results Negahban and Wainwright (2012), which moreover extend the guarantees to approximately low–rank matrices. Such (approximate) low–rank structure is one instance of general structural constraints which are now understood to be necessary for consistent statistical estimation under high–dimensional settings (with very large number of parameters and very few observations). Note that the high–dimensional matrix completion problem is particularly ill–posed, since the measurements are typically both very local (e.g. individual matrix entries), and partial (e.g. covering a decaying fraction of entries of the entire matrix). However, the specific (approximately) low–rank structural constraint imposed in the past work on matrix completion does not capture the rich variety of other qualitatively different structural constraints such as row–sparseness, column–sparseness, or a superposition structure of low–rank plus elementwise sparseness, among others. For instance, in the classical introductory survey on matrix completion Laurent (2009), the authors discuss structural constraints of a contraction matrix, and a Euclidean distance matrix. Thus, the third question we ask in this paper is whether we can generalize the recent line of work on low–rank matrix completion to the more general structurally constrained case.
In this paper, we answer all of the three questions above in the affirmative, and provide a vastly unified framework for generalized matrix completion. We address the first two questions by considering a general matrix completion setting wherein observed matrix entries are sampled from any member of a rich family of natural exponential family distributions. Note that this family of distributions encompass a wide variety of popular distributions including Gaussian, Poisson, binomial, negative–binomial, Bernoulli, etc. Moreover, the choice of the exponential family distribution can be made depending on the form of the data. For instance, thin–tailed continuous data is typically modeled using the Gaussian distribution; count–data is modeled through an appropriate distribution over integers (Poisson, binomial, etc.), binary data through Bernoulli, categorical–discrete through multinomial, etc. We address the last question by considering general structural constraints upon the underlying matrix, as captured by a general regularization function . Our general matrix completion setting thus captures heterogeneous noise–channels, for heterogeneous data–types, and heterogeneous structural constraints.
In a key contribution, we propose a simple regularized convex –estimator for recovering the structurally constrained underlying matrix in this general setting; and moreover provide a unified and novel statistical analysis for our general matrix completion problem. Following a standard approach Negahban (2012), we (a) first showed that the negative log–likelihood of the subset of observed entries satisfies a form of Restricted Strong Convexity (RSC) (Definition 4); and (b) under this RSC condition, our proposed –estimator satisfies strong statistical guarantees. We note that proving these individual components for our general matrix completion problem under general structural constraints required a fairly delicate and novel analysis, particularly the first component of showing the RSC condition, which we believe would be of independent interest. A key corollary of our general framework is matrix completion under sub–Gaussian samples and low–rank constraints, where we show that our theorem recovers results comparable to the existing literature Candes and Plan (2010); Keshavan et al. (2010b); Negahban and Wainwright (2012). Finally, we corroborate our theoretical findings via simulated experiments.
1.1 Notations and Preliminaries
In this subsection we describe the notations and definitions frequently used throughout the paper. Matrices are denoted by capital letters, , , , etc. For a matrix , and are the column and row of respectively, and denotes the entry of . The transpose, trace, and rank of a matrix are denoted by , , and , respectively. The inner product between two matrices is given by .
For a matrix of rank , with singular values , commonly used matrix norms include the nuclear norm , the spectral norm , the Frobenius norm , and the maximum norm .
Given any matrix norm , the dual norm, is given by .
Definition 1 (Natural Exponential Family).
A distribution of a random variable , in a normed vector space is said to belong to the natural exponential family, if its probability density function, characterized by the parameter in the dual vector space, is given by:
where , called the log–partition function, is strictly convex, and analytic.
Definition 2 (Bregman Divergence).
Let be a strictly convex function differentiable in the relative interior of . The Bregman divergence (associated with ) between and is defined as:
Definition 3 (Subspace compatibility constants).
Given a matrix norm , we define the following maximum and minimum subspace compatibility constants of w.r.t the subspace :
In the rest of the paper, to avoid notational clutter, the dependence of and on is suppressed. Thus, and are denoted as and , respectively.
Definition 4 (Restricted Strong Convexity).
A loss function is said to satisfy Restricted Strong Convexity with respected to a subspace , if for some ,
Definition 5 (Sub–Gaussian Distributions).
A mean zero random variable is said to have a sub–Gaussian distribution with parameter if , the distribution satisfies . Further, if is sub–Gaussian with parameter and , then (Vershynin (2010)).
2 Exponential Family Matrix Completion
Denote the underlying target matrix by . We then assume that individual entries are observed indirectly via a noisy channel: specifically, via a sample drawn from the corresponding member of natural exponential family (see Definition 1):
where is a strictly convex, and analytic function called the log–partition function.
Consider the random matrix , where each entry is drawn independently from the corresponding distribution in (1); it can be seen that:
where we overload the notation to denote as , and the base measure as .
Uniformly Sampled Observations: In a “fully observed” setting, we would observe all the entries of the observation matrix . However, we consider a partially observed setting, where we observe entries over a subset of indices . We assume a uniform sampling model, so that
Note that, under the above described sampling scheme, an index can be sampled multiple times, in such cases we include the multiple instances of in (and not just the unique indices in ). Given , we define the following matrix as
The matrix completion task can then be stated as the estimation of from , where . As noted earlier, this problem is ill–posed in general. However, as we will show, under structural constraints imposed on the parameter matrix , we are able to design an –estimator with a near optimal deviation from .
Gaussian (fixed ) is typically used to model continuous data, , such as measurements with additive errors, affinity datasets. Here, .
Bernoulli is a popular distribution of choice to model binary data, , with . Some examples of data suitable for Bernoulli model include social networks, gene protein interactions, etc.
Binomial (fixed ) is used to model number of successes in trials. Here, , and . Some applications include predicting success/failure rate, survey outcomes, etc.
Poisson is used to model count data , such as arrival times, events per unit time, click–throughs among others. Here, .
Exponential is often used to model positive valued continuous data , specially inter arrival times between events. Here, .
Denote the gradient map:
It can then be verified that the mean and variance of the distribution are given by , and , respectively. The Fenchel conjugate of the log partition function , is denoted by: .
A useful consequence of the exponential family is that the negative log–likelihood is convex in the natural parameters , and moreover has a bijection with a large class of Bregman divergences (Definition 2). The following relationship was first noted by Forster and Warmuth (2002), and later established by Banerjee et al. (2005) [Theorem 4]:
2.3 Discussion and directions for future work
We consider the standard matrix–completion setting where the distribution of the observation matrix in (2) corresponds to its entries being drawn independently from the other entries. Further, the probability of observing a specific entry , under uniform sampling is independent of the noise channel or the distribution . However, in some applications, it might be beneficial to have a sampling scheme involving dependencies; for instance, when a user gives a movie a bad rating, we might want to vary the sampling scheme to sample an entirely different region of the matrix. It would be interesting to extend the analysis of this paper to such a dependent sampling setting.
The form of the observation random matrix distribution in (2), given the individual observations in (1), can be seen to have connotations of multi–task learning: here recovering each individual matrix entry together with its corresponding noise model forms a single task, and all these tasks can be performed jointly given the shared structural constraint on . It would be interesting to generalize this to form a more general statistical framework for partial multi-task learning.
We use the general class of exponential family distributions as the underlying probabilistic model capturing the measurement uncertainties. The richness of the class of exponential family distributions has been used in other settings to provide general statistical frameworks. Kakade et al. (2010) provide a generalization of compressed sensing problem to general exponential family distributions. Note however that results from compressed sensing cannot be immediately extended to matrix completion case, since the sampling operator does not satisfy the typical assumptions (restricted isometry or restricted eigenvalues) made in such settings; see (Candes and Recht, 2009) for additional discussion. There have been extensions of classical probabilistic PCA Tipping and Bishop (1999) from Gaussian noise models to exponential family distributions Collins et al. (2001); Mohamed et al. (2008); Gordon (2002). There have also been recent extensions of probabilistic graphical model classes, beyond Gaussian and Ising models, to multivariate extensions of exponential family distributions (Yang et al., 2012, 2013).
More complicated probabilistic models have also been proposed in the context of collaborative filtering Mnih and Salakhutdinov (2007); Salakhutdinov and Mnih (2008), but these typically involve non–convex optimization, and it is difficult to extend the rigorous statistical analyses of the form in this paper (and in the matrix completion literature) to these models.
3 Main Result and Consequences
As noted in the introduction, we consider the matrix completion setting with general structural constraints on the underlying target matrix . To formalize the notion of such structural constraints, we follow (Negahban, 2012), and assume that satisfies , for some subspace , which contains parameter matrices that are structured similar to the target (the corresponding structural constraints such as low rankness, low rankness+sparsity etc); we also allow the flexibility of working with a superset of the model subspace that is potentially easier to analyze. Moreover, we use their definition of a decomposable norm regularization function, , which suitably captures these structural constraints:
(Decomposable Norm Regularizer) We assume that is a matrix norm, and is decomposable over , i.e. if , then,
We provide some examples of such decomposable regularizers and structural constraint subspaces, and refer to (Negahban, 2012) for more examples and discussion.
Example 1. Low–rank: This is a common structure assumed in numerous matrix estimation problems, particularly those in collaborative filtering, PCA, spectral clustering, etc. The corresponding structural constraint subspaces in this case correspond to a linear span of specific one–rank matrices; we discuss these in further detail in Section 3.2, where we derive a corollary of our general theorem to the specific case of recovery guarantees for low–rank constrained matrix completion. The nuclear norm , has been shown to be decomposable with respect to these constraint subspaces Fazel et al. (2001).
Example 2. Block sparsity: Another important structural constraint for a matrix is block–sparsity, where each row is either all zeros or mostly non–zero, and the number of non–zero rows is small. The structural constraint subspaces in this case correspond to a linear span of specific Frobenius–norm–one matrices that are non–zero in a small subset of the rows (dependent on ); it has been shown that norms Negahban and Wainwright (2008); Obozinski et al. (2011) are decomposable with respect to such structural constraint subspaces. Recalling that is the row of , the norm is defined as:
Example 3. Low–rank plus sparse: This structure is often used to model low–rank matrices which are corrupted by a sparse outlier noise matrix. The structural constraint subspaces corresponding to these consist of the linear span of weighted sum of specific rank–one matrices and sparse matrices with non–zero components on specified positions; and appropriate regularization function decomposable with respect to such structural constraints is the infimum convolution of the weighted nuclear norm with weighted elementwise norm, Candes et al. (2011); Yang and Ravikumar (2013):
The second assumption we make is on the curvature of the log–partition function. This is required to establish a form of RSC (Definition 4) for the loss function.
The second derivative of the log–partition function has atmost an exponential decay, i.e,
It can be verified that commonly used members of natural exponential family obey this assumption.
Finally, we make an assumption to avoid “spiky” target matrices. As Candes and Recht (2009) show with numerous examples, low–rank and presumably other such structural constraints as above, by themselves are not sufficient for accurate recovery, in part due to the infeasibility of recovering overly “spiky” matrices with very few large entries. Early work Candes and Plan (2010); Keshavan et al. (2010a, b), assumed stringent matrix incoherence conditions to preclude such matrices, while more recent work Davenport et al. (2012); Negahban and Wainwright (2012), relax these assumptions to restricting the spikiness ratio, defined as follows:
There exists a known , such that
3.1 –estimator for Generalized Matrix Completion
We propose a regularized –estimate as our candidate parameter matrix . The norm regularizer used is a convex surrogate for the structural constraints, and is assumed to satisfy A 1. For a suitable ,
In the above estimator, for simplicity we have assumed that the domain of the minimizing function spans all or . In cases where this is violated, additional constraints to restrict to the domain could be imposed on the estimator and the results and analysis in the following section still hold. The above optimization problem is a convex program, and can be solved by any off–the shelf convex solvers.
3.2 Main Results
Without loss of generality, suppose that . Let be the dual norm of the regularizer . Further, let and be the maximum and minimum subspace compatibility constants of w.r.t the model subspace (Definition 3)
where the expectation is over the random sampling index set , and over the Rademacher sequence ; here , are the standard basis. This quantity captures the interaction between the sampling scheme and the structural constraint as captured by the regularizer (specifically its dual ). Note that it depends only on (), and on the size of .
In the above theorem, and are constants from Assumptions 2 and 3, respectively. Our proof uses elements from Negahban (2012), as well as Negahban and Wainwright (2012) where they analyze the case of low–rank structure and additive noise, and establish a form of restricted strong convexity (RSC) for squared loss over subset of matrix entries (closely relates to the special case, when the exponential family distribution assumed in (2) is Gaussian). However, showing such an RSC condition for our general loss function over a subset of structured matrix entries involved some delicate arguments. Further, we provide a much simpler proof that moreover only required a low–spikiness condition rather than a multiplicative spikiness and structural constraint. Moreover, we are able to provide an RSC condition broadly for general structures, and the negative log–likelihood loss associated with general exponential family distribution.
An important special case of the problem is when the parameter matrix , is assumed to be of a low–rank . The most commonly used convex regularizer to enforce low–rank is the nuclear norm. The intuition behind the low–rank assumption on the parameter matrix is as follows: the parameter , can be thought of as the true measure of affinity between the entities corresponding to row and column , respectively; and the data , is a sample from a noisy channel parametrized by . It is hypothesized that , are obtained from a weighted interaction of a small number of latent variables as, .
Let and , be the left and right singular vectors, respectively of . Let the column and row span of be and , respectively. Define:
It can be verified that, , however, .
Remark 1: Note that the above results hold for the minimizer of the convex program in (6), optimized for any ; in particular it holds with , where . While in practice is chosen through cross–validation, the theoretical bound in Corollary 1 can be tightened to the following (if ):
Similar bound can be obtained for Theorem 1.
Remark 2: is a measure of noise per entry; . Note that as we do not make stronger matrix incoherence assumptions, only an approximate recovery is guaranteed even as .
4.1 Proof of Theorem 1
The proof of our main theorem involves two key steps:
When the RSC condition holds, the result follows from a few simple calculations; we handle the case where RSC does not hold separately.
Let . We state two results of interest.
Recall the notation . We now consider two cases, depending on whether the following condition holds for the constant dictated by Theorem 1:
Theorem 2 (Restricted Strong Convexity).
4.2 Proof of Corollary 1
From the definition of in (7), we have . Let and , be the projection matrices onto the column and row spaces (, ) of , respectively. We have, , . Also, . Thus, , ; and hence,
If is sampled using uniform sampling and , then for a Rademacher sequence ,
This follows from Lemma of Negahban and Wainwright (2012), using .
Thus, for large enough , using in Theorem 2, for some we have:
Finally, to prove the corollary, we derive a bound on using the Ahlswede–Winter Matrix bound (Appendix A.3). Let ; and let , such that, .
From the equivalence of sub-Gaussian definitions in Lemma of Vershynin (2010), there exists a constant such that , . Since, has a single element, , we have, . Further,
where follows from Fubini’s Theorem, follows as is -sub–Gaussian, and follows from the uniform sampling model. Similarly, . Define
In Lemma 5, using , , and , we have:
If for large enough , then for any constant , using ,
4.3 Proof of Theorem 2
This proof uses symmetrization arguments and contractions (Ledoux and Talagrand (1991) Ch.4& 6). We observe that, , , s.t.
where holds as , and (Assumption 2).
We provide simulated experiments to corroborate our theoretical guarantees, focusing specifically on Corollary 1, where we consider the special case where the underlying parameter matrix is low–rank, but the underlying noise model for the matrix elements could be any of the general class of exponential family distributions. We look at three well known members of exponential family suitable for different data–types, namely Gaussian, Bernoulli, and binomial, which are popular choices for modeling continuous valued, binary, and count valued data, respectively.
5.1 Experimental Setup
We create low–rank ground truth parameter matrices, of sizes (for simplicity we considered square matrices, ); we set the rank to . The observation matrices, , are then sampled from the different members of exponential family distributions parameterized by . For each , we uniformly sample a subset entries of the observation matrix , and estimate from (6).
For each member of the exponential family of distributions considered, we can measure the performance of our –estimator in parameter space as , or in observation space using an appropriate error metric , where is the maximum likelihood estimate of the recovered distribution,