Inference and Uncertainty Quantification for
Noisy Matrix Completion00footnotetext: Author names are sorted alphabetically.
Noisy matrix completion aims at estimating a low-rank matrix given only partial and corrupted entries. Despite substantial progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained estimates and how to perform statistical inference on the unknown matrix (e.g. constructing a valid and short confidence interval for an unseen entry).
This paper takes a step towards inference and uncertainty quantification for noisy matrix completion. We develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting de-biased estimators admit nearly precise non-asymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals / regions for, say, the missing entries and the low-rank factors. Our inferential procedures do not rely on sample splitting, thus avoiding unnecessary loss of data efficiency. As a byproduct, we obtain a sharp characterization of the estimation accuracy of our de-biased estimators, which, to the best of our knowledge, are the first tractable algorithms that provably achieve full statistical efficiency (including the preconstant). The analysis herein is built upon the intimate link between convex and nonconvex optimization — an appealing feature recently discovered by [chen2019noisy].
Keywords: matrix completion, statistical inference, confidence intervals, uncertainty quantification, convex relaxation, nonconvex optimization
- 1 Introduction
- 2 Models and notation
3 Inferential procedures and main results
- 3.1 Background: convex and nonconvex estimators
- 3.2 Constructing de-biased estimators
- 3.3 Main results: distributional guarantees
- 3.4 Lower bounds and optimality for inference
- 3.5 Back to estimation: the de-biased estimator is optimal
- 3.6 Numerical experiments
- 3.7 A bit of intuition
- 3.8 Inference based on spectral estimates?
- 4 Prior art
- 5 Architecture of the proof
- 6 Discussion
- A Preliminaries
- B Summary of the proposed estimators
- C Proof of Lemma 3
- D Analysis of the low-rank factors
- E Analysis of the entries of the matrix
- F Proof of Corollary 1
- G Proof of Theorem 3
- H Proof of lower bounds
- I Proofs in Section A
- J Technical lemmas
1.1 Motivation: inference and uncertainty quantification?
Low-rank matrix completion is concerned with recovering a low-rank matrix, when only a small fraction of its entries are revealed to us [srebro2004learning, ExactMC09, KesMonSew2010]. Tackling this problem in large-scale applications is computationally challenging, due to the intrinsic nonconvexity incurred by the low-rank structure. To further complicate matters, another inevitable challenge stems from the imperfectness of data acquisition mechanisms, wherein the acquired samples are contaminated by a certain amount of noise.
Fortunately, if the entries of the unknown matrix are sufficiently de-localized and randomly revealed, this problem may not be as hard as it seems. Substantial progress has been made over the past several years in designing computationally tractable algorithms — including both convex and nonconvex approaches — that allow to fill in unseen entries faithfully given only partial noisy samples [CanPla10, Negahban2012restricted, MR2906869, Se2010Noisy, chen2015fast, ma2017implicit, chen2019noisy]. Nevertheless, modern decision making would often require one step further. It not merely anticipates a faithful estimate, but also seeks to quantify the uncertainty or “confidence” of the provided estimate, ideally in a reasonably accurate fashion. For instance, given an estimate returned by the convex approach, how to use it to compute a short interval that is likely to contain a missing entry?
Conducting effective uncertainty quantification for noisy matrix completion is, however, far from straightforward. For the most part, the state-of-the-art matrix completion algorithms require solving highly complex optimization problems, which often do not admit closed-form solutions. Of necessity, it is generally very challenging to pin down the distributions of the estimates returned by these algorithms. The lack of distributional characterizations presents a major roadblock to performing valid, yet efficient, statistical inference on the unknown matrix of interest.
It is worth noting that a number of recent papers have been dedicated to inference and uncertainty quantification for various high-dimensional problems in high-dimensional statistics, including Lasso [zhang2014confidence, van2014asymptotically, javanmard2014confidence], generalized linear models [van2014asymptotically, ning2017general, battey2018distributed], graphical models [jankova2015confidence, ren2015asymptotic, ma2017inter]), amongst others. Very little work, however, has looked into noisy matrix completion along this direction. While non-asymptotic statistical guarantees for noisy matrix completion have been derived in prior theory, most, if not all, of the estimation error bounds are supplied only at an order-wise level. Such order-wise error bounds either lose a significant factor relative to the optimal guarantees, or come with an unspecified (but often enormous) pre-constant. Viewed in this light, a confidence region constructed directly based on such results is bound to be overly conservative, resulting in substantial over-coverage.
1.2 A glimpse of our contributions
This paper takes a substantial step towards efficient inference and uncertainty quantification for noisy matrix completion. Specifically, we develop a simple procedure to compensate for the bias of the commonly used convex and nonconvex estimators. The resulting de-biased estimators admit nearly accurate non-asymptotic distributional guarantees. Such distributional characterizations in turn allow us to reason about the uncertainty of the obtained estimates vis-à-vis the unknown matrix. While details of our main findings are postponed to Section 3, we would like to immediately single out a few important merits of the proposed inferential procedures and theory:
Our results enable two types of uncertainty assessment, namely, we can construct (i) confidence intervals for each entry — either observed or missing — of the unknown matrix; (ii) confidence regions for the low-rank factors of interest (modulo some unavoidable global ambiguity).
Despite the complicated statistical dependency, our procedure and theory do not rely on sample splitting, thus avoiding the unnecessary widening of confidence intervals / regions due to insufficient data usage.
The confidence intervals / regions constructed based on the proposed procedures are, in some sense, optimal.
We present a unified approach that accommodates both convex and nonconvex estimators seamlessly.
As a byproduct, we characterize the Euclidean estimation errors of the proposed de-biased estimators. Such error bounds are sharp and match an oracle lower bound precisely (including the pre-constant). To the best of our knowledge, this is the first theory that demonstrates that a computationally feasible algorithm can achieve the statistical limit including the pre-constant.
All of this is built upon the intimate link between convex and nonconvex estimators [chen2019noisy], as well as the recent advances in analyzing the stability of nonconvex optimization against random noise [ma2017implicit].
2 Models and notation
To cast the noisy matrix completion problem in concrete statistical settings, we adopt a model commonly studied in the literature [ExactMC09]. We also introduce some useful notation.
Denote by the unknown rank- matrix of interest,111We restrict our attention to squared matrices for simplicity of presentation. Most findings extend immediately to the more general rectangular case with different and . whose (compact) singular value decomposition (SVD) is given by . We set
where denotes the th largest singular value of a matrix . Further, we let and stand for the balanced low-rank factors of , which obey
What we observe is a random subset of noisy entries of ; more specifically, we observe
where is a subset of indices, and denotes independently generated noise at the location . From now on, we assume the random sampling model where each index is included in independently with probability (i.e. data are missing uniformly at random). We shall use to represent the orthogonal projection onto the subspace of matrices that vanish outside the index set .
Clearly, not all matrices can be reliably estimated from a highly incomplete set of measurements. To address this issue, we impose a standard incoherence condition [ExactMC09, chen2015incoherence] on the singular subspaces of (i.e. and ):
where is termed the incoherence parameter and denotes the largest norm of all rows in . A small implies that the energy of and are reasonably spread out across all of their rows.
Here, (or ) means for some constant , means for some constant , means for some constants , and means . We write to indicate that for some small constant (much smaller than 1), and use to indicate that for some large constant (much larger than 1).
3 Inferential procedures and main results
The proposed inferential procedure lays its basis on two of the most popular estimation paradigms — convex relaxation and nonconvex optimization — designed for noisy matrix completion. Recognizing the complicated bias of these two highly nonlinear estimators, we shall first illustrate how to perform bias correction, followed by a distributional theory that establishes the near-Gaussianity and optimality of the proposed de-biased estimators.
3.1 Background: convex and nonconvex estimators
We first review in passing two tractable estimation algorithms that are arguably the most widely used in practice. They serve as the starting point for us to design inferential procedures for noisy low-rank matrix completion. The readers familiar with this literature can proceed directly to Section 3.2.
Recall that the rank function is highly nonconvex, which often prevents us from computing a rank-constrained estimator in polynomial time. For the sake of computational feasibility, prior works suggest relaxing the rank function into its convex surrogate [fazel2002matrix, RecFazPar07]; for example, one can consider the following penalized least-squares convex program
or using our notation ,
Here, is the nuclear norm (the sum of singular values, which is a convex surrogate of the rank function), and is some regularization parameter. Under mild conditions, the solution to the convex program (3.1) provably attains near-optimal estimation accuracy (in an order-wise sense), provided that a proper regularization parameter is adopted [chen2019noisy].
|where determines the step size or the learning rate.|
It is recognized that the convex approach, which typically relies on solving a semidefinite program, is still computationally expensive and not scalable to large dimensions. This motivates an alternative route, which represents the matrix variable via two low-rank factors and attempts solving the following nonconvex program directly
Here, we choose a regularizer of the form primarily to mimic the nuclear norm (see [srebro2005rank, mazumder2010spectral]). A variety of optimization algorithms have been proposed to tackle the nonconvex program (3.4) or its variants [sun2016guaranteed, chen2015fast, ma2017implicit]; the readers are referred to [chi2018nonconvex] for a recent overview. As a prominent example, a two-stage algorithm — gradient descent following suitable initialization — provably enjoys fast convergence and order-wise optimal statistical guarantees for a wide range of scenarios [ma2017implicit, chen2019noisy, chen2019nonconvex]. The current paper focuses on this simple yet powerful algorithm, as documented in Algorithm 1 and detailed in Appendix A.1.
Intimate connections between convex and nonconvex estimates.
Denote by any minimizer of the convex program (3.1), and denote by the estimate returned by Algorithm 1 aimed at solving (3.4). As was recently shown in [chen2019noisy], when the regularization parameter is properly chosen, these two estimates obey (see (A.12) in Appendix A.2 for a precise statement)
Here, is the best rank- approximation of the convex estimate , where . In truth, the three matrices of interest in (3.5) are exceedingly close to, if not identical with, each other. This salient feature paves the way for a unified treatment of convex and nonconvex approaches: most inferential procedures and guarantees developed for the nonconvex estimate can be readily transferred to perform inference for the convex one, and vice versa.
3.2 Constructing de-biased estimators
|either or .|
|for the nonconvex case, we take and ; for the convex case, let and , which are the balanced low-rank factors of obeying and .|
|the proposed de-biased estimator as in (3.7).|
|the proposed de-shrunken estimator as in (3.8).|
We are now well equipped to describe how to construct new estimators based on the convex estimate and the nonconvex estimate , so as to enable statistical inference. Motivated by the proximity of the convex and nonconvex estimates and for the sake of conciseness, we shall abuse notation by using the shorthand for both convex and nonconvex estimates; see Table 1 and Appendix B for precise definitions. This allows us to unify the presentation for both convex and nonconvex estimators.
Given that both (3.1) and (3.4) are regularized least-squares problems, they behave effectively like shrinkage estimators, indicating that the provided estimates necessarily suffer from non-negligible bias. In order to enable desired statistical inference, it is natural to first correct the estimation bias.
A de-biased estimator for the matrix.
A natural de-biasing strategy that immediately comes to mind is the following simple linear transformation (recall the notation in Table 1):
where we identify with . Heuristically, if and are statistically independent, then serves as an unbiased estimator of , i.e. ; this arises since the noise has zero mean and under the uniform random sampling model, with the identity operator. Despite its (near) unbiasedness nature at a heuristic level, however, the matrix is typically full-rank, with non-negligible energy spread across its entire spectrum. This results in dramatically increased variability in the estimate, which is undesirable for inferential purposes.
To remedy this issue, we propose to further project onto the set of rank- matrices, leading to the following de-biased estimator
where , and can again be found in Table 1. This projection step effectively suppresses the variability outside the -dimensional principal subspace. As we shall see shortly, the proposed estimator (3.7) properly de-biases the provided estimate , while optimally controlling the extent of uncertainty.
The estimator (3.7) can be viewed as performing one iteration of singular value projection (SVP) [MekJaiDhi2009, ding2018leave] on the current estimate .
An equivalent form: a de-shrunken estimator for the low-rank factors.
It turns out that the de-biased estimator (3.7) admits another almost equivalent representation that offers further insights. Specifically, we consider the following de-shrunken estimator for the low-rank factors
where we recall the definition of and in Table 1. To develop some intuition regarding why this is called a de-shrunken estimator, let us look at a simple scenario where is the SVD of and , . It is then self-evident that
In words, and are obtained by de-shrinking the spectrum of and properly.
3.3 Main results: distributional guarantees
The proposed estimators admit tractable distributional characterizations in the large- regime, which facilitates the construction of confidence regions for many quantities of interest. In particular, this paper centers around two types of inferential problems:
Each entry of the matrix : the entry can be either missing (i.e. predicting an unseen entry) or observed (i.e. de-noising an observed entry). For example, in the Netflix challenge, one would like to infer a user’s preference about any movie, given partially revealed ratings [ExactMC09]. Mathematically, this seeks to determine the distribution of
The low-rank factors : the low-rank factors often reveal critical information about the applications of interest (e.g. community memberships of each individual in the community detection problem [abbe2017entrywise], or angles between each object and a global reference point in the angular synchronization problem [singer2011angular]). Recognizing the global rotational ambiguity issue,222For any rotation matrix , we cannot distinguish from , if only pairwise measurements are available. we aim to pin down the distributions of and up to global rotational ambiguity. More precisely, we intend to characterize the distributions of
for the global rotation matrix that best “aligns” and , i.e.
Here and below, denotes the set of orthonormal matrices in .
Clearly, the above two inferential problems are tightly related: an accurate distributional characterization for the low-rank factors (3.11) often results in a distributional guarantee for the entries (3.10). As such, we shall begin by presenting our distributional characterizations of the low-rank factors. Here and throughout, represents the th standard basis vector in .
Theorem 1 (Distributional guarantees w.r.t. low-rank factors).
Suppose that the sample size and the noise obey
Then one has the following decomposition
In addition, the residual matrices satisfy, with probability at least , that
A more complete version can be found in Theorem 5.
Another interesting feature — which we shall make precise in the proof of this theorem — is that: for any given , the two random vectors and are nearly statistically independent. This is crucial for deriving inferential guarantees for the entries of the matrix.
Theorem 1 is a non-asymptotic result. In words, Theorem 1 decomposes the estimation error (resp. ) into a Gaussian component (resp. ) and a residual term (resp. ). If the sample size is sufficiently large and the noise size is sufficiently small, then the residual terms are much smaller in size compared to and . To see this, it is helpful to leverage the Gaussianity (3.15a) and compute that: for each , the th row of obeys
in other words, the typical size of the th row of is no smaller than the order of . In comparison, the size of each row of (see (3.16)) is much smaller than (and hence smaller than the size of the corresponding row of ) with high probability, provided that (3.13) is satisfied.
Equipped with the above master decompositions of the low-rank factors and Remark 4, we are ready to present a similar decomposition for the entry .
Theorem 2 (Distributional guarantees w.r.t. matrix entries).
For each , define the variance as
where (resp. ) denotes the th (resp. th) row of (resp. ). Suppose that
Then the matrix defined in Table 1 satisfies
where and the residual obeys with probability exceeding .
Remark 5 (The symmetric case).
In the symmetric case where the noise , the truth , and the sampling pattern are all symmetric (i.e. and ), the variance (cf. (3.17)) for the diagonal entries has a different formula; more specifically, it is straightforward to extend our theory to show that
This additional multiplicative factor of 2 arises since and are identical (and hence not independent) in this symmetric case. The variance formula for any () remains unchanged.
Several remarks are in order. To begin with, we develop some intuition regarding where the variance comes from. By virtue of Theorem 1, one has the following Gaussian approximation
Assuming that the first-order expansion is reasonably tight, one has
Here, (i) relies on (3.20) and the near independence between and ; (ii) uses the variance formula in Theorem 1; (iii) arises from the definitions of and (cf. (2.2)). This computation explains (heuristically) the variance formula .
Given that Theorem 2 reveals the tightness of Gaussian approximation under conditions (3.18), it in turn allows us to construct nearly accurate confidence intervals for each matrix entry . This is formally summarized in the following corollary, the proof of which is deferred to Appendix F. Here and throughout, we use to denote the interval .
Corollary 1 (Confidence intervals for the entries ).
In words, Corollary 1 tells us that for any fixed significance level , the interval
is a nearly accurate two-sided confidence interval of .
In addition, we remark that when (and hence ), the above Gaussian approximation is completely off. In this case, one can still leverage Theorem 1 to show that
where are independent and identically distributed according to . However, it is nontrivial to determine whether is vanishingly small or not based on the observed data, which makes it challenging to conduct efficient inference for entries with small (but a priori unknown) .
Last but not least, the careful readers might wonder how to interpret our conditions on the sample complexity and the signal-to-noise ratio. Take the case with for example: our conditions read
The first condition matches the minimal sample complexity limit (up to some logarithmic factor), while the second one coincides with the regime (up to log factor) in which popular algorithms (like spectral methods or nonconvex algorithms) work better than a random guess [Se2010Noisy, chen2015fast, ma2017implicit]. The take-away message is this: once we are able to compute a reasonable estimate in an overall sense, then we can reinforce it to conduct entrywise inference in a statistically efficient fashion. The discussion of the dependency on and is deferred to Section 6.
3.4 Lower bounds and optimality for inference
It is natural to ask how well our inferential procedures perform compared to other algorithms. Encouragingly, the de-biased estimator is optimal in some sense; for instance, it nearly attains the minimum covariance among all unbiased estimators. To formalize this claim, we shall
Quantify the performance of two ideal estimators with the assistance of an oracle;
Demonstrate that the performance of our de-biased estimators is arbitrarily close to that of the ideal estimators.
In what follows, we denote by (resp. ) the th row of (resp. ).
An ideal estimator for ().
Suppose that there is an oracle informing us of , and that we observe the same set of data as in (2.3). Under such an idealistic setting and for any given , the following least-squares estimator achieves the minimum covariance among all unbiased estimators for the th row of (see e.g. [shao2003mathematical, Theorem 3.7])
In other words, for any unbiased estimator of (conditional on ), one has
where is precisely the Cramér-Rao lower bound (conditional on ) under this ideal setting. As it turns out, with high probability, this lower bound concentrates around , as stated in the following lemma. The proof is postponed to Appendix H.1.
Fix an arbitrarily small constant . Suppose that for some sufficiently large constant independent of . Then with probability at least , one has
An ideal estimator for ().
Suppose that there is another oracle informing us of and ; that is, everything about except and everything about except . In addition, we observe the same set of data as in (2.3), except that we do not get to see .333The exclusion of is merely for ease of presentation. One can consider the model where all with are observed with a slightly more complicated argument. Under this idealistic model, the Cramér-Rao lower bound [shao2003mathematical, Theorem 3.3] for estimating can be computed as
This means that any unbiased estimator of must have variance no smaller than . This quantity admits a much simpler lower bound as follows, whose proof can be found in Appendix H.2.
Fix an arbitrarily small constant . Suppose that for some sufficiently large constant independent of . Then with probability at least ,
where is defined in Theorem 2.
Similar to Lemma 1, Lemma 2 reveals that the variance of our de-biased estimator (cf. Theorem 2) — which certainly does not have access to the side information provided by the oracle — is arbitrarily close to the Cramér-Rao lower bound aided by an oracle.
All in all, the above lower bounds demonstrate that the degrees of uncertainty underlying our de-shrunken low-rank factors and de-biased matrix are, in some sense, statistically minimal.
3.5 Back to estimation: the de-biased estimator is optimal
While the emphasis of the current paper is on inference, we would nevertheless like to single out an important consequence that informs the estimation step. To be specific, the decompositions and distributional guarantees derived in Theorem 1 and Theorem 2 allow us to track the estimation accuracy of , as stated in the following theorem. The proof of this result is postponed to Appendix G.
Theorem 3 (Estimation accuracy of ).
In stark contrast to prior statistical estimation guarantees (e.g. [CanPla10, Negahban2012restricted, MR2906869, chen2019noisy]), Theorem 3 pins down the estimation error of the proposed de-biased estimator in a sharp manner (namely, even the pre-constant is fully determined). Encouragingly, there is a sense in which the proposed de-biased estimator achieves the best possible statistical estimation accuracy, as revealed by the following result.
Theorem 4 (An oracle lower bound on estimation errors).
Fix an arbitrarily small constant . Suppose that , and that . Then with probability exceeding , any unbiased estimator of obeys
Intuitively, the term reflects approximately the underlying degrees of freedom in the true subspace of interest (i.e. the tangent space of the rank- matrices at the truth ), whereas the factor captures the effect due to sub-sampling. This result has already been established in [CanPla10, Section III.B] (together with [ExactMC09, Theorem 4.1]). We thus omit the proof for conciseness. The key idea is to consider an oracle informing us of the true tangent space . ∎
The implication of the above two theorems is remarkable: the de-biasing step not merely facilitates uncertainty assessment, but also proves crucial in minimizing the estimation errors. It achieves optimal statistical efficiency in terms of both the rate and the pre-constant. As far as we know, this is the first theory about a polynomial time algorithm that matches the statistical limit in terms of the pre-constant. This intriguing finding is further corroborated by numerical experiments; see Section 3.6 for details (in particular, Figure 3).
3.6 Numerical experiments
We conduct numerical experiments on synthetic data to verify the distributional characterizations provided in Theorem 1 and Theorem 2. Note that our main results hold for the de-biased estimators built upon and . As we will formalize shortly in Section 5.1, these two de-biased estimators are extremely close to each other; see also Figure 4 for experimental evidence. Therefore, in order to save space, we use the de-biased estimator built upon the convex estimate throughout the experiments.
Fix the dimension and the regularization parameter throughout the experiments. We generate a rank- matrix , where are random orthonormal matrices and apply the proximal gradient method [parikh2014proximal] to solve the convex program (3.1).
We begin by checking the validity of Theorem 1. Suppose that one is interested in estimating the inner product between and (). In the Netflix challenge, this might correspond to the similarity between the th user and the th one. As a straightforward consequence of Theorem 1, the normalized estimation error
is extremely close to a standard Gaussian random variable. Here, similar to (3.22), we let
be the empirical estimate of the theoretically predicted variance . As a result, a confidence interval of would be . For each , we define to be the empirical coverage rate of over Monte Carlo simulations. Correspondingly, denote by (resp. ) the average (resp. the standard deviation) of over indices . Table 2 collects the simulation results for different values of . As can be seen, the reported empirical coverage rates are reasonably close to the nominal level . In addition, Figure 1 depicts the Q-Q (quantile-quantile) plots of and vs. the standard Gaussian random variable over 200 Monte Carlo simulations for , and . It is clearly seen that all of these are well approximated by a standard Gaussian random variable.