Inference and Uncertainty Quantification for
Noisy Matrix Completion^{0}^{0}footnotetext: Author names are sorted alphabetically.
Abstract
Noisy matrix completion aims at estimating a lowrank matrix given only partial and corrupted entries. Despite substantial progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained estimates and how to perform statistical inference on the unknown matrix (e.g. constructing a valid and short confidence interval for an unseen entry).
This paper takes a step towards inference and uncertainty quantification for noisy matrix completion. We develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting debiased estimators admit nearly precise nonasymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals / regions for, say, the missing entries and the lowrank factors. Our inferential procedures do not rely on sample splitting, thus avoiding unnecessary loss of data efficiency. As a byproduct, we obtain a sharp characterization of the estimation accuracy of our debiased estimators, which, to the best of our knowledge, are the first tractable algorithms that provably achieve full statistical efficiency (including the preconstant). The analysis herein is built upon the intimate link between convex and nonconvex optimization — an appealing feature recently discovered by [chen2019noisy].
Keywords: matrix completion, statistical inference, confidence intervals, uncertainty quantification, convex relaxation, nonconvex optimization
Contents
 1 Introduction
 2 Models and notation

3 Inferential procedures and main results
 3.1 Background: convex and nonconvex estimators
 3.2 Constructing debiased estimators
 3.3 Main results: distributional guarantees
 3.4 Lower bounds and optimality for inference
 3.5 Back to estimation: the debiased estimator is optimal
 3.6 Numerical experiments
 3.7 A bit of intuition
 3.8 Inference based on spectral estimates?
 4 Prior art
 5 Architecture of the proof
 6 Discussion
 A Preliminaries
 B Summary of the proposed estimators
 C Proof of Lemma 3
 D Analysis of the lowrank factors
 E Analysis of the entries of the matrix
 F Proof of Corollary 1
 G Proof of Theorem 3
 H Proof of lower bounds
 I Proofs in Section A
 J Technical lemmas
1 Introduction
1.1 Motivation: inference and uncertainty quantification?
Lowrank matrix completion is concerned with recovering a lowrank matrix, when only a small fraction of its entries are revealed to us [srebro2004learning, ExactMC09, KesMonSew2010]. Tackling this problem in largescale applications is computationally challenging, due to the intrinsic nonconvexity incurred by the lowrank structure. To further complicate matters, another inevitable challenge stems from the imperfectness of data acquisition mechanisms, wherein the acquired samples are contaminated by a certain amount of noise.
Fortunately, if the entries of the unknown matrix are sufficiently delocalized and randomly revealed, this problem may not be as hard as it seems. Substantial progress has been made over the past several years in designing computationally tractable algorithms — including both convex and nonconvex approaches — that allow to fill in unseen entries faithfully given only partial noisy samples [CanPla10, Negahban2012restricted, MR2906869, Se2010Noisy, chen2015fast, ma2017implicit, chen2019noisy]. Nevertheless, modern decision making would often require one step further. It not merely anticipates a faithful estimate, but also seeks to quantify the uncertainty or “confidence” of the provided estimate, ideally in a reasonably accurate fashion. For instance, given an estimate returned by the convex approach, how to use it to compute a short interval that is likely to contain a missing entry?
Conducting effective uncertainty quantification for noisy matrix completion is, however, far from straightforward. For the most part, the stateoftheart matrix completion algorithms require solving highly complex optimization problems, which often do not admit closedform solutions. Of necessity, it is generally very challenging to pin down the distributions of the estimates returned by these algorithms. The lack of distributional characterizations presents a major roadblock to performing valid, yet efficient, statistical inference on the unknown matrix of interest.
It is worth noting that a number of recent papers have been dedicated to inference and uncertainty quantification for various highdimensional problems in highdimensional statistics, including Lasso [zhang2014confidence, van2014asymptotically, javanmard2014confidence], generalized linear models [van2014asymptotically, ning2017general, battey2018distributed], graphical models [jankova2015confidence, ren2015asymptotic, ma2017inter]), amongst others. Very little work, however, has looked into noisy matrix completion along this direction. While nonasymptotic statistical guarantees for noisy matrix completion have been derived in prior theory, most, if not all, of the estimation error bounds are supplied only at an orderwise level. Such orderwise error bounds either lose a significant factor relative to the optimal guarantees, or come with an unspecified (but often enormous) preconstant. Viewed in this light, a confidence region constructed directly based on such results is bound to be overly conservative, resulting in substantial overcoverage.
1.2 A glimpse of our contributions
This paper takes a substantial step towards efficient inference and uncertainty quantification for noisy matrix completion. Specifically, we develop a simple procedure to compensate for the bias of the commonly used convex and nonconvex estimators. The resulting debiased estimators admit nearly accurate nonasymptotic distributional guarantees. Such distributional characterizations in turn allow us to reason about the uncertainty of the obtained estimates visàvis the unknown matrix. While details of our main findings are postponed to Section 3, we would like to immediately single out a few important merits of the proposed inferential procedures and theory:

Our results enable two types of uncertainty assessment, namely, we can construct (i) confidence intervals for each entry — either observed or missing — of the unknown matrix; (ii) confidence regions for the lowrank factors of interest (modulo some unavoidable global ambiguity).

Despite the complicated statistical dependency, our procedure and theory do not rely on sample splitting, thus avoiding the unnecessary widening of confidence intervals / regions due to insufficient data usage.

The confidence intervals / regions constructed based on the proposed procedures are, in some sense, optimal.

We present a unified approach that accommodates both convex and nonconvex estimators seamlessly.

As a byproduct, we characterize the Euclidean estimation errors of the proposed debiased estimators. Such error bounds are sharp and match an oracle lower bound precisely (including the preconstant). To the best of our knowledge, this is the first theory that demonstrates that a computationally feasible algorithm can achieve the statistical limit including the preconstant.
All of this is built upon the intimate link between convex and nonconvex estimators [chen2019noisy], as well as the recent advances in analyzing the stability of nonconvex optimization against random noise [ma2017implicit].
2 Models and notation
To cast the noisy matrix completion problem in concrete statistical settings, we adopt a model commonly studied in the literature [ExactMC09]. We also introduce some useful notation.
Ground truth.
Denote by the unknown rank matrix of interest,^{1}^{1}1We restrict our attention to squared matrices for simplicity of presentation. Most findings extend immediately to the more general rectangular case with different and . whose (compact) singular value decomposition (SVD) is given by . We set
(2.1) 
where denotes the th largest singular value of a matrix . Further, we let and stand for the balanced lowrank factors of , which obey
(2.2) 
Observation models.
What we observe is a random subset of noisy entries of ; more specifically, we observe
(2.3) 
where is a subset of indices, and denotes independently generated noise at the location . From now on, we assume the random sampling model where each index is included in independently with probability (i.e. data are missing uniformly at random). We shall use to represent the orthogonal projection onto the subspace of matrices that vanish outside the index set .
Incoherence conditions.
Clearly, not all matrices can be reliably estimated from a highly incomplete set of measurements. To address this issue, we impose a standard incoherence condition [ExactMC09, chen2015incoherence] on the singular subspaces of (i.e. and ):
(2.4) 
where is termed the incoherence parameter and denotes the largest norm of all rows in . A small implies that the energy of and are reasonably spread out across all of their rows.
Asymptotic notation.
Here, (or ) means for some constant , means for some constant , means for some constants , and means . We write to indicate that for some small constant (much smaller than 1), and use to indicate that for some large constant (much larger than 1).
3 Inferential procedures and main results
The proposed inferential procedure lays its basis on two of the most popular estimation paradigms — convex relaxation and nonconvex optimization — designed for noisy matrix completion. Recognizing the complicated bias of these two highly nonlinear estimators, we shall first illustrate how to perform bias correction, followed by a distributional theory that establishes the nearGaussianity and optimality of the proposed debiased estimators.
3.1 Background: convex and nonconvex estimators
We first review in passing two tractable estimation algorithms that are arguably the most widely used in practice. They serve as the starting point for us to design inferential procedures for noisy lowrank matrix completion. The readers familiar with this literature can proceed directly to Section 3.2.
Convex relaxation.
Recall that the rank function is highly nonconvex, which often prevents us from computing a rankconstrained estimator in polynomial time. For the sake of computational feasibility, prior works suggest relaxing the rank function into its convex surrogate [fazel2002matrix, RecFazPar07]; for example, one can consider the following penalized leastsquares convex program
(3.1) 
or using our notation ,
(3.2) 
Here, is the nuclear norm (the sum of singular values, which is a convex surrogate of the rank function), and is some regularization parameter. Under mild conditions, the solution to the convex program (3.1) provably attains nearoptimal estimation accuracy (in an orderwise sense), provided that a proper regularization parameter is adopted [chen2019noisy].
(3.3a)  
(3.3b)  
where determines the step size or the learning rate. 
Nonconvex optimization.
It is recognized that the convex approach, which typically relies on solving a semidefinite program, is still computationally expensive and not scalable to large dimensions. This motivates an alternative route, which represents the matrix variable via two lowrank factors and attempts solving the following nonconvex program directly
(3.4) 
Here, we choose a regularizer of the form primarily to mimic the nuclear norm (see [srebro2005rank, mazumder2010spectral]). A variety of optimization algorithms have been proposed to tackle the nonconvex program (3.4) or its variants [sun2016guaranteed, chen2015fast, ma2017implicit]; the readers are referred to [chi2018nonconvex] for a recent overview. As a prominent example, a twostage algorithm — gradient descent following suitable initialization — provably enjoys fast convergence and orderwise optimal statistical guarantees for a wide range of scenarios [ma2017implicit, chen2019noisy, chen2019nonconvex]. The current paper focuses on this simple yet powerful algorithm, as documented in Algorithm 1 and detailed in Appendix A.1.
Intimate connections between convex and nonconvex estimates.
Denote by any minimizer of the convex program (3.1), and denote by the estimate returned by Algorithm 1 aimed at solving (3.4). As was recently shown in [chen2019noisy], when the regularization parameter is properly chosen, these two estimates obey (see (A.12) in Appendix A.2 for a precise statement)
(3.5) 
Here, is the best rank approximation of the convex estimate , where . In truth, the three matrices of interest in (3.5) are exceedingly close to, if not identical with, each other. This salient feature paves the way for a unified treatment of convex and nonconvex approaches: most inferential procedures and guarantees developed for the nonconvex estimate can be readily transferred to perform inference for the convex one, and vice versa.
3.2 Constructing debiased estimators
either or .  

for the nonconvex case, we take and ; for the convex case, let and , which are the balanced lowrank factors of obeying and .  
the proposed debiased estimator as in (3.7).  
the proposed deshrunken estimator as in (3.8). 
We are now well equipped to describe how to construct new estimators based on the convex estimate and the nonconvex estimate , so as to enable statistical inference. Motivated by the proximity of the convex and nonconvex estimates and for the sake of conciseness, we shall abuse notation by using the shorthand for both convex and nonconvex estimates; see Table 1 and Appendix B for precise definitions. This allows us to unify the presentation for both convex and nonconvex estimators.
Given that both (3.1) and (3.4) are regularized leastsquares problems, they behave effectively like shrinkage estimators, indicating that the provided estimates necessarily suffer from nonnegligible bias. In order to enable desired statistical inference, it is natural to first correct the estimation bias.
A debiased estimator for the matrix.
A natural debiasing strategy that immediately comes to mind is the following simple linear transformation (recall the notation in Table 1):
(3.6) 
where we identify with . Heuristically, if and are statistically independent, then serves as an unbiased estimator of , i.e. ; this arises since the noise has zero mean and under the uniform random sampling model, with the identity operator. Despite its (near) unbiasedness nature at a heuristic level, however, the matrix is typically fullrank, with nonnegligible energy spread across its entire spectrum. This results in dramatically increased variability in the estimate, which is undesirable for inferential purposes.
To remedy this issue, we propose to further project onto the set of rank matrices, leading to the following debiased estimator
(3.7) 
where , and can again be found in Table 1. This projection step effectively suppresses the variability outside the dimensional principal subspace. As we shall see shortly, the proposed estimator (3.7) properly debiases the provided estimate , while optimally controlling the extent of uncertainty.
Remark 1.
The estimator (3.7) can be viewed as performing one iteration of singular value projection (SVP) [MekJaiDhi2009, ding2018leave] on the current estimate .
An equivalent form: a deshrunken estimator for the lowrank factors.
It turns out that the debiased estimator (3.7) admits another almost equivalent representation that offers further insights. Specifically, we consider the following deshrunken estimator for the lowrank factors
(3.8) 
where we recall the definition of and in Table 1. To develop some intuition regarding why this is called a deshrunken estimator, let us look at a simple scenario where is the SVD of and , . It is then selfevident that
In words, and are obtained by deshrinking the spectrum of and properly.
3.3 Main results: distributional guarantees
The proposed estimators admit tractable distributional characterizations in the large regime, which facilitates the construction of confidence regions for many quantities of interest. In particular, this paper centers around two types of inferential problems:

Each entry of the matrix : the entry can be either missing (i.e. predicting an unseen entry) or observed (i.e. denoising an observed entry). For example, in the Netflix challenge, one would like to infer a user’s preference about any movie, given partially revealed ratings [ExactMC09]. Mathematically, this seeks to determine the distribution of
(3.10) 
The lowrank factors : the lowrank factors often reveal critical information about the applications of interest (e.g. community memberships of each individual in the community detection problem [abbe2017entrywise], or angles between each object and a global reference point in the angular synchronization problem [singer2011angular]). Recognizing the global rotational ambiguity issue,^{2}^{2}2For any rotation matrix , we cannot distinguish from , if only pairwise measurements are available. we aim to pin down the distributions of and up to global rotational ambiguity. More precisely, we intend to characterize the distributions of
(3.11) for the global rotation matrix that best “aligns” and , i.e.
(3.12) Here and below, denotes the set of orthonormal matrices in .
Clearly, the above two inferential problems are tightly related: an accurate distributional characterization for the lowrank factors (3.11) often results in a distributional guarantee for the entries (3.10). As such, we shall begin by presenting our distributional characterizations of the lowrank factors. Here and throughout, represents the th standard basis vector in .
Theorem 1 (Distributional guarantees w.r.t. lowrank factors).
Suppose that the sample size and the noise obey
(3.13) 
Then one has the following decomposition
(3.14a)  
(3.14b) 
with defined in (2.2), defined in Table 1, and defined in (3.12). Here, the rows of (resp. ) are independent and obey
(3.15a)  
(3.15b) 
In addition, the residual matrices satisfy, with probability at least , that
(3.16) 
Remark 3.
A more complete version can be found in Theorem 5.
Remark 4.
Another interesting feature — which we shall make precise in the proof of this theorem — is that: for any given , the two random vectors and are nearly statistically independent. This is crucial for deriving inferential guarantees for the entries of the matrix.
Theorem 1 is a nonasymptotic result. In words, Theorem 1 decomposes the estimation error (resp. ) into a Gaussian component (resp. ) and a residual term (resp. ). If the sample size is sufficiently large and the noise size is sufficiently small, then the residual terms are much smaller in size compared to and . To see this, it is helpful to leverage the Gaussianity (3.15a) and compute that: for each , the th row of obeys
in other words, the typical size of the th row of is no smaller than the order of . In comparison, the size of each row of (see (3.16)) is much smaller than (and hence smaller than the size of the corresponding row of ) with high probability, provided that (3.13) is satisfied.
Equipped with the above master decompositions of the lowrank factors and Remark 4, we are ready to present a similar decomposition for the entry .
Theorem 2 (Distributional guarantees w.r.t. matrix entries).
For each , define the variance as
(3.17) 
where (resp. ) denotes the th (resp. th) row of (resp. ). Suppose that
(3.18a)  
(3.18b) 
Then the matrix defined in Table 1 satisfies
(3.19) 
where and the residual obeys with probability exceeding .
Remark 5 (The symmetric case).
In the symmetric case where the noise , the truth , and the sampling pattern are all symmetric (i.e. and ), the variance (cf. (3.17)) for the diagonal entries has a different formula; more specifically, it is straightforward to extend our theory to show that
This additional multiplicative factor of 2 arises since and are identical (and hence not independent) in this symmetric case. The variance formula for any () remains unchanged.
Several remarks are in order. To begin with, we develop some intuition regarding where the variance comes from. By virtue of Theorem 1, one has the following Gaussian approximation
Assuming that the firstorder expansion is reasonably tight, one has
(3.20) 
According to Remark 4, and are nearly independent. It is thus straightforward to compute the variance of (3.20) as
Here, (i) relies on (3.20) and the near independence between and ; (ii) uses the variance formula in Theorem 1; (iii) arises from the definitions of and (cf. (2.2)). This computation explains (heuristically) the variance formula .
Given that Theorem 2 reveals the tightness of Gaussian approximation under conditions (3.18), it in turn allows us to construct nearly accurate confidence intervals for each matrix entry . This is formally summarized in the following corollary, the proof of which is deferred to Appendix F. Here and throughout, we use to denote the interval .
Corollary 1 (Confidence intervals for the entries ).
In words, Corollary 1 tells us that for any fixed significance level , the interval
(3.23) 
is a nearly accurate twosided confidence interval of .
In addition, we remark that when (and hence ), the above Gaussian approximation is completely off. In this case, one can still leverage Theorem 1 to show that
(3.24) 
where are independent and identically distributed according to . However, it is nontrivial to determine whether is vanishingly small or not based on the observed data, which makes it challenging to conduct efficient inference for entries with small (but a priori unknown) .
Last but not least, the careful readers might wonder how to interpret our conditions on the sample complexity and the signaltonoise ratio. Take the case with for example: our conditions read
(3.25) 
The first condition matches the minimal sample complexity limit (up to some logarithmic factor), while the second one coincides with the regime (up to log factor) in which popular algorithms (like spectral methods or nonconvex algorithms) work better than a random guess [Se2010Noisy, chen2015fast, ma2017implicit]. The takeaway message is this: once we are able to compute a reasonable estimate in an overall sense, then we can reinforce it to conduct entrywise inference in a statistically efficient fashion. The discussion of the dependency on and is deferred to Section 6.
3.4 Lower bounds and optimality for inference
It is natural to ask how well our inferential procedures perform compared to other algorithms. Encouragingly, the debiased estimator is optimal in some sense; for instance, it nearly attains the minimum covariance among all unbiased estimators. To formalize this claim, we shall

Quantify the performance of two ideal estimators with the assistance of an oracle;

Demonstrate that the performance of our debiased estimators is arbitrarily close to that of the ideal estimators.
In what follows, we denote by (resp. ) the th row of (resp. ).
An ideal estimator for ().
Suppose that there is an oracle informing us of , and that we observe the same set of data as in (2.3). Under such an idealistic setting and for any given , the following leastsquares estimator achieves the minimum covariance among all unbiased estimators for the th row of (see e.g. [shao2003mathematical, Theorem 3.7])
(3.26) 
In other words, for any unbiased estimator of (conditional on ), one has
(3.27) 
where is precisely the CramérRao lower bound (conditional on ) under this ideal setting. As it turns out, with high probability, this lower bound concentrates around , as stated in the following lemma. The proof is postponed to Appendix H.1.
Lemma 1.
Fix an arbitrarily small constant . Suppose that for some sufficiently large constant independent of . Then with probability at least , one has
An ideal estimator for ().
Suppose that there is another oracle informing us of and ; that is, everything about except and everything about except . In addition, we observe the same set of data as in (2.3), except that we do not get to see .^{3}^{3}3The exclusion of is merely for ease of presentation. One can consider the model where all with are observed with a slightly more complicated argument. Under this idealistic model, the CramérRao lower bound [shao2003mathematical, Theorem 3.3] for estimating can be computed as
(3.28) 
This means that any unbiased estimator of must have variance no smaller than . This quantity admits a much simpler lower bound as follows, whose proof can be found in Appendix H.2.
Lemma 2.
Fix an arbitrarily small constant . Suppose that for some sufficiently large constant independent of . Then with probability at least ,
where is defined in Theorem 2.
Similar to Lemma 1, Lemma 2 reveals that the variance of our debiased estimator (cf. Theorem 2) — which certainly does not have access to the side information provided by the oracle — is arbitrarily close to the CramérRao lower bound aided by an oracle.
All in all, the above lower bounds demonstrate that the degrees of uncertainty underlying our deshrunken lowrank factors and debiased matrix are, in some sense, statistically minimal.
3.5 Back to estimation: the debiased estimator is optimal
While the emphasis of the current paper is on inference, we would nevertheless like to single out an important consequence that informs the estimation step. To be specific, the decompositions and distributional guarantees derived in Theorem 1 and Theorem 2 allow us to track the estimation accuracy of , as stated in the following theorem. The proof of this result is postponed to Appendix G.
Theorem 3 (Estimation accuracy of ).
In stark contrast to prior statistical estimation guarantees (e.g. [CanPla10, Negahban2012restricted, MR2906869, chen2019noisy]), Theorem 3 pins down the estimation error of the proposed debiased estimator in a sharp manner (namely, even the preconstant is fully determined). Encouragingly, there is a sense in which the proposed debiased estimator achieves the best possible statistical estimation accuracy, as revealed by the following result.
Theorem 4 (An oracle lower bound on estimation errors).
Fix an arbitrarily small constant . Suppose that , and that . Then with probability exceeding , any unbiased estimator of obeys
(3.30) 
Proof.
Intuitively, the term reflects approximately the underlying degrees of freedom in the true subspace of interest (i.e. the tangent space of the rank matrices at the truth ), whereas the factor captures the effect due to subsampling. This result has already been established in [CanPla10, Section III.B] (together with [ExactMC09, Theorem 4.1]). We thus omit the proof for conciseness. The key idea is to consider an oracle informing us of the true tangent space . ∎
The implication of the above two theorems is remarkable: the debiasing step not merely facilitates uncertainty assessment, but also proves crucial in minimizing the estimation errors. It achieves optimal statistical efficiency in terms of both the rate and the preconstant. As far as we know, this is the first theory about a polynomial time algorithm that matches the statistical limit in terms of the preconstant. This intriguing finding is further corroborated by numerical experiments; see Section 3.6 for details (in particular, Figure 3).
3.6 Numerical experiments
0.9387  0.0197  
0.9400  0.0193  
0.9459  0.0161  
0.9460  0.0162  
0.9227  0.0244  
0.9273  0.0226  
0.9411  0.0173  
0.9418  0.0171 
We conduct numerical experiments on synthetic data to verify the distributional characterizations provided in Theorem 1 and Theorem 2. Note that our main results hold for the debiased estimators built upon and . As we will formalize shortly in Section 5.1, these two debiased estimators are extremely close to each other; see also Figure 4 for experimental evidence. Therefore, in order to save space, we use the debiased estimator built upon the convex estimate throughout the experiments.
Fix the dimension and the regularization parameter throughout the experiments. We generate a rank matrix , where are random orthonormal matrices and apply the proximal gradient method [parikh2014proximal] to solve the convex program (3.1).
(a)  (b)  (c) 
We begin by checking the validity of Theorem 1. Suppose that one is interested in estimating the inner product between and (). In the Netflix challenge, this might correspond to the similarity between the th user and the th one. As a straightforward consequence of Theorem 1, the normalized estimation error
(3.31) 
is extremely close to a standard Gaussian random variable. Here, similar to (3.22), we let
(3.32) 
be the empirical estimate of the theoretically predicted variance . As a result, a confidence interval of would be . For each , we define to be the empirical coverage rate of over Monte Carlo simulations. Correspondingly, denote by (resp. ) the average (resp. the standard deviation) of over indices . Table 2 collects the simulation results for different values of . As can be seen, the reported empirical coverage rates are reasonably close to the nominal level . In addition, Figure 1 depicts the QQ (quantilequantile) plots of and vs. the standard Gaussian random variable over 200 Monte Carlo simulations for , and . It is clearly seen that all of these are well approximated by a standard Gaussian random variable.