Noisy Matrix Completion: Understanding Statistical Guarantees for Convex Relaxation via Nonconvex OptimizationAuthor names are sorted alphabetically.

Noisy Matrix Completion: Understanding Statistical Guarantees for Convex Relaxation via Nonconvex Optimization00footnotetext: Author names are sorted alphabetically.

Yuxin Chen Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA; Email: yuxin.chen@princeton.edu.    Yuejie Chi Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Email: yuejiechi@cmu.edu.    Jianqing Fan Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; Email: {jqfan, congm, yulingy}@princeton.edu.    Cong Ma1    Yuling Yan1
3footnotemark: 3
Abstract

This paper studies noisy low-rank matrix completion: given partial and noisy entries of a large low-rank matrix, the goal is to estimate the underlying matrix faithfully and efficiently. Arguably one of the most popular paradigms to tackle this problem is convex relaxation, which achieves remarkable efficacy in practice. However, the theoretical support of this approach is still far from optimal in the noisy setting, falling short of explaining its empirical success.

We make progress towards demystifying the practical efficacy of convex relaxation vis-à-vis random noise. When the rank and the condition number of the unknown matrix are bounded by a constant, we demonstrate that the convex programming approach achieves near-optimal estimation errors — in terms of the Euclidean loss, the entrywise loss, and the spectral norm loss — for a wide range of noise levels. All of this is enabled by bridging convex relaxation with the nonconvex Burer–Monteiro approach, a seemingly distinct algorithmic paradigm that is provably robust against noise. More specifically, we show that an approximate critical point of the nonconvex formulation serves as an extremely tight approximation of the convex solution, thus allowing us to transfer the desired statistical guarantees of the nonconvex approach to its convex counterpart.

Keywords: matrix completion, minimaxity, stability, convex relaxation, nonconvex optimization, Burer–Monteiro approach.

1 Introduction

Suppose we are interested in a large low-rank data matrix, but only get to observe a highly incomplete subset of its entries. Can we hope to estimate the underlying data matrix in a reliable manner? This problem, often dubbed as low-rank matrix completion, spans a diverse array of science and engineering applications (e.g. collaborative filtering [rennie2005fast], localization [so2007theory], system identification [liu2009interior], magnetic resonance parameter mapping [zhang2015accelerating], joint alignment [chen2016projected]), and has inspired a flurry of research activities in the past decade. In the statistics literature, matrix completion also falls under the category of factor models with a large amount of missing data, which finds numerous statistical applications such as controlling false discovery rates for dependence data [efron2007correlation, efron2010correlated, fan2012estimating, fan2019farmtest], factor-adjusted variable selection [kneip2011factor, fan2018factor], principal component regression [jolliffe1982note, bai2006confidence, paul2008preconditioning, fan2017sufficient], and large covariance matrix estimation[fan2013large, fan2019robust]. Recent years have witnessed the development of many tractable algorithms that come with statistical guarantees, with convex relaxation being one of the most popular paradigms [fazel2004rank, ExactMC09, CanTao10]. See [davenport2016overview, chen2018harnessing] for an overview of this topic.

This paper focuses on noisy low-rank matrix completion, assuming that the revealed entries are corrupted by a certain amount of noise. Setting the stage, consider the task of estimating a rank- data matrix ,111It is straightforward to rephrase our discussions to a general rectangular matrix of size . The current paper sets throughout for simplicity of presentation. and suppose that this needs to be performed on the basis of a subset of noisy entries

(1)

where denotes a set of indices, and stands for the additive noise at the location . As we shall elaborate shortly, solving noisy matrix completion via convex relaxation, while practically exhibiting excellent stability (in terms of the estimation errors against noise), is far less understood theoretically compared to the noiseless setting.

1.1 Convex relaxation: limitations of prior results

Naturally, one would search for a low-rank solution that best fits the observed entries. One choice is the regularized least-squares formulation given by

(2)

where is some regularization parameter. In words, this approach optimizes certain trade-off between the goodness of fit (through the squared loss expressed in the first term of (2)) and the low-rank structure (through the rank function in the second term of (2)). Due to computational intractability of rank minimization, we often resort to convex relaxation in order to obtain computationally feasible solutions. One notable example is the following convex program:

(3)

where denotes the nuclear norm (i.e. the sum of singular values) of — a convex surrogate for the rank function. A significant portion of existing theory supports the use of this paradigm in the noiseless setting: when vanishes for all , the solution to (3) is known to be faithful (i.e. the estimation error becomes zero) even under near-minimal sample complexity [ExactMC09, CanPla10, CanTao10, Gross2011recovering, recht2011simpler, chen2015incoherence].

By contrast, the performance of convex relaxation remains largely unclear when it comes to noisy settings (which are often more practically relevant). Candès and Plan [CanPla10] first studied the stability of an equivalent variant222Technically, [CanPla10] deals with the constrained version of (3), which is equivalent to the Lagrangian form as in (3) with a proper choice of the regularization parameter. of (3) against noise. The estimation error derived therein, of the solution to (3), is significantly larger than the oracle lower bound. This does not explain well the effectiveness of (3) in practice. In fact, the numerical experiments reported in [CanPla10] already indicated that the performance of convex relaxation is far better than their theoretical bounds. This discrepancy between numerical performance and existing theoretical bounds gives rise to the following natural yet challenging questions: Where does the convex program (3) stand in terms of its stability vis-à-vis additive noise? Can we establish statistical performance guarantees that match its practical effectiveness?

We note in passing that several other convex relaxation formulations have been thoroughly analyzed for noisy matrix completion, most notably by Negahban and Wainwright [Negahban2012restricted] and by Koltchinskii et al. [MR2906869]. These works have significantly advanced our understanding of the power of convex relaxation. However, the estimators studied therein, particularly the one in [MR2906869], are quite different from the one (3) considered here; as a consequence, the analysis therein does not lead to improved statistical guarantees of (3). Moreover, the performance guarantees provided for these variants are also suboptimal when restricted to the class of “incoherent” or “de-localized” matrices, unless the magnitudes of the noise are fairly large. See Section 1.4 for more detailed discussions as well as numerical comparisons of these algorithms.

1.2 A detour: nonconvex optimization

While the focus of the current paper is convex relaxation, we take a moment to discuss a seemingly distinct algorithmic paradigm: nonconvex optimization, which turns out to be remarkably helpful in understanding convex relaxation. Inspired by the Burer–Monteiro approach [burer2003nonlinear], the nonconvex scheme starts by representing the rank- decision matrix (or parameters) as via low-rank factors , and proceeds by solving the following nonconvex (regularized) least-squares problem [KesMonSew2010]

(4)

Here, denotes a certain regularization term that promotes additional structural properties.

To see its intimate connection with the convex program (3), we make the following observation: if the solution to (3) has rank , then it must coincide with the solution to

(5)

This can be easily verified by recognizing the elementary fact that

(6)

for any rank- matrix  [srebro2005rank, mazumder2010spectral]. Note, however, that it is very challenging to predict when the key assumption in establishing this connection — namely, the rank- assumption of the solution to the convex program (3) — can possibly hold (and in particular, whether it can hold under minimal sample complexity requirement).

Despite the nonconvexity of (4), simple first-order optimization methods, in conjunction with proper initialization, are often effective in solving (4). Partial examples include gradient descent on manifold [KesMonSew2010, Se2010Noisy, wei2016guarantees], gradient descent [sun2016guaranteed, ma2017implicit], and projected gradient descent [chen2015fast, zheng2016convergence]. Apart from their practical efficiency, the nonconvex optimization approach is also appealing in theory. To begin with, algorithms tailored to (4) often enable exact recovery in the noiseless setting. Perhaps more importantly, for a wide range of noise settings, the nonconvex approach achieves appealing estimation accuracy [chen2015fast, ma2017implicit], which could be significantly better than those bounds derived for convex relaxation discussed earlier. See [chi2018nonconvex, chen2018harnessing] for a summary of recent results. Such intriguing statistical guarantees motivate us to take a closer inspection of the underlying connection between the two contrasting algorithmic frameworks.

1.3 Empirical evidence: convex and nonconvex solutions are often close

In order to obtain a better sense of the relationships between convex and nonconvex approaches, we begin by comparing the estimates returned by the two approaches via numerical experiments. Fix and . We generate , where are random orthonormal matrices. Each entry of is observed with probability independently, and then corrupted by an independent Gaussian noise . Throughout the experiments, we set . The convex program (3) is solved by the proximal gradient method [parikh2014proximal], whereas we attempt solving the nonconvex formulation (5) by gradient descent with spectral initialization (see [chi2018nonconvex] for details). Let (resp. ) be the solution returned by the convex program (3) (resp. the nonconvex program (5)). Figure 1 displays the relative estimation errors of both methods ( and ) as well as the relative distance between the two estimates. The results are averaged over 20 independent trials.

Figure 1: The relative estimation errors of both (the estimate of the convex program (3)) and (the estimate returned by the nonconvex approach tailored to (5)) and the relative distance between them vs. the standard deviation of the noise. The results are reported for , , , and are averaged over 20 independent trials.

Interestingly, the distance between the convex and the nonconvex solutions seems extremely small (e.g.  is typically below ); in comparison, the relative estimation errors of both and are substantially larger. In other words, the estimate returned by the nonconvex approach serves as a remarkably accurate approximation of the convex solution. Given that the nonconvex approach is often guaranteed to achieve intriguing statistical guarantees vis-à-vis random noise [ma2017implicit], this suggests that the convex program is equally stable — a phenomenon that was not captured by prior theory [CanPla10]. Can we leverage existing theory for the nonconvex scheme to improve the statistical analysis of the convex relaxation approach?

Before continuing, we remark that the above numerical connection between convex relaxation (3) and nonconvex optimization (5) has already been observed multiple times in prior literature [fazel2002matrix, srebro2005rank, RecFazPar07, mazumder2010spectral, Se2010Noisy]. Nevertheless, all prior observations on this connection were either completely empirical, or provided in a way that does not lead to improved statistical error bounds of the convex paradigm (3). In fact, the difficulty in rigorously justifying the above numerical observations has been noted in the literature; see e.g. [Se2010Noisy].333The seminal work [Se2010Noisy] by Keshavan, Montanari and Oh stated that “In view of the identity (6) it might be possible to use the results in this paper to prove stronger guarantees on the nuclear norm minimization approach. Unfortunately this implication is not immediate Trying to establish such an implication, and clarifying the relation between the two approaches is nevertheless a promising research direction.

1.4 Models and main results

The numerical experiments reported in Section 1.3 suggest an alternative route for analyzing convex relaxation for noisy matrix completion. If one can formally justify the proximity between the convex and the nonconvex solutions, then it is possible to propagate the appealing stability guarantees from the nonconvex scheme to the convex approach. As it turns out, this simple idea leads to significantly enhanced statistical guarantees for the convex program (3), which we formally present in this subsection.

1.4.1 Models and assumptions

Before proceeding, we introduce a few model assumptions that play a crucial role in our theory.

Assumption 1.
  • [leftmargin=2.5em]

  • (Random sampling) Each index belongs to the index set independently with probability .

  • (Random noise) The noise matrix is composed of i.i.d. zero-mean sub-Gaussian random variables with sub-Gaussian norm at most , i.e.  (see [Vershynin2012, Definition 5.7]).

In addition, let be the singular value decomposition (SVD) of , where consist of orthonormal columns and is a diagonal matrix obeying . Denote by the condition number of . We impose the following incoherence condition on , which is known to be crucial for reliable recovery of  [ExactMC09, chen2015incoherence].

Definition 1.

A rank- matrix with SVD is said to be -incoherent if

Here, denotes the largest norm of all rows of a matrix .

Remark 1.

It is worth noting that several other conditions on the low-rank matrix have been proposed in the noisy setting. Examples include the spikiness condition [Negahban2012restricted] and the bounded norm condition [MR2906869]. However, these conditions alone are often unable to ensure identifiability of the true matrix even in the absence of noise.

1.4.2 Theoretical guarantees: when both the rank and the condition number are constants

With these in place, we are positioned to present our improved statistical guarantees for convex relaxation. For convenience of presentation, we shall begin with a simple yet fundamentally important class of settings when the rank and the condition number are both fixed constants. As it turns out, this class of problems arises in a variety of engineering applications. For example, in a fundamental problem in cryo-EM called angular synchronization [singer2011angular], one needs to deal with rank-2 or rank-3 matrices with ; in a joint shape mapping problem that arises in computer graphics [huang2013consistent, chen2014matching], the matrix under consideration has low rank and a condition number equal to 1; and in structure from motion in computer vision [tomasi1992shape], one often seeks to estimate a matrix with and a small condition number. Encouragingly, our theory delivers near-optimal statistical guarantees for such practically important scenarios.

Theorem 1.

Let be rank- and -incoherent with a condition number , where the rank and the condition number satisfy . Suppose that Assumption 1 holds and take in (3) for some large enough constant . Assume the sample size obeys for some sufficiently large constant , and the noise satisfies for some sufficiently small constant . Then with probability exceeding :

  1. Any minimizer of (3) obeys

    (7a)
    (7b)
  2. Letting be the best rank- approximation of , we have

    (8)

    and the error bounds in (7) continue to hold if is replaced by .

Remark 2.

Here and throughout, or means for some constant when is sufficiently large; means for some constant when is sufficiently large; and if and only if and . In addition, denotes the entrywise norm, whereas is the spectral norm.

Remark 3.

The factor in (8) can be replaced by for an arbitrarily large fixed constant (e.g. ).

To explain the applicability of the above theorem, we first remark on the conditions required for this theorem to hold; for simplicity, we assume that .

  • Sample complexity. To begin with, the sample size needs to exceed the order of , which is information-theoretically optimal up to some logarithmic term [CanTao10].

  • Noise size. We then turn attention to the noise requirement, i.e. . Note that under the sample size condition , the size of the noise in each entry is allowed to be substantially larger than the maximum entry in the matrix. In other words, the signal-to-noise ratio w.r.t. each observed entry could be very small. According to prior literature (e.g. [Se2010Noisy, Theorem 1.1] and [ma2017implicit, Theorem 2]), such noise conditions are typically required for spectral methods to perform noticeably better than random guessing.

Further, Theorem 1 has several important implications about the power of convex relaxation. The discussions below again concentrate on the case where .

  • Near-optimal stability guarantees. Our results reveal that the Euclidean error of any convex optimizer of (3) obeys

    (9)

    implying that the performance of convex relaxation degrades gracefully as the signal-to-noise ratio decreases. This result matches the oracle lower bound derived in [CanPla10, Eq. (III.13)], which also improves upon their statistical guarantee. Specifically, Candès and Plan [CanPla10] provided a stability guarantee in the presence of arbitrary bounded noise. When applied to the random noise model assumed here, their results yield , which could be times more conservative than our bound (9).

  • Nearly low-rank structure of the convex solution. In light of (8), the optimizer of the convex program (3) is almost, if not exactly, rank-. When the true rank is known a priori, it is not uncommon for practitioners to return the rank- approximation of . Our theorem formally justifies that there is no loss of statistical accuracy — measured in terms of either or — when performing the rank- projection operation.

  • Entrywise and spectral norm error control. Moving beyond the Euclidean loss, our theory uncovers that the estimation errors of the convex optimizer are fairly spread out across all entries, thus implying near-optimal entrywise error control. This is a stronger form of error bounds, as an optimal Euclidean estimation accuracy alone does not preclude the possibility of the estimation errors being spiky and localized. Furthermore, the spectral norm error of the convex optimizer is also well-controlled. Figure 2 displays the relative estimation errors in both the norm and the spectral norm, under the same setting as in Figure 1. As can be seen, both forms of estimation errors scale linearly with the noise level, corroborating our theory.

  • Implicit regularization. As a byproduct of the entrywise error control, this result indicates that the additional constraint suggested by [Negahban2012restricted] is automatically satisfied and is hence unnecessary. In other words, the convex approach implicitly controls the spikiness of its entries, without resorting to explicit regularization. This is also confirmed by the numerical experiments reported in Figure 3, where we see that the estimation error of (3) and that of the constrained version considered in [Negahban2012restricted] are nearly identical.

    Figure 2: The relative estimation error of measured by both (i.e. ) and (i.e. ) vs. the standard deviation of the noise. The results are reported for , , , and are averaged over 20 independent trials.
  • Statistical guarantees for fast iterative optimization methods. Various iterative algorithms have been developed to solve the nuclear norm regularized least-squares problem (3) up to an arbitrarily prescribed accuracy, examples including SVT (or proximal gradient methods) [cai2010singular], FPC [ma2011fixed], SOFT–IMPUTE [mazumder2010spectral], FISTA [beck2009fast, toh2010accelerated], to name just a few. Our theory immediately provides statistical guarantees for these algorithms. As we shall make precise in Section 2, any point with (where is defined in (3)) enjoys the same error bounds as in (7) (with replaced by in (7)), provided that is sufficiently small. In other words, when these convex optimization algorithms converge w.r.t. the objective value, they are guaranteed to return a statistically reliable estimate.

   
(a)     (b)
Figure 3: The relative estimation errors of , measured in terms of and , vs. the standard deviation  of the noise. Here can be either the modified convex estimator in [MR2906869], the constrained convex estimator in [Negahban2012restricted] or the vanilla convex estimator (3). The results are reported for , , , and are averaged over 20 Monte-Carlo trials. For the modified convex estimator in [MR2906869], we choose the regularization parameter therein to be , as suggested by their theory. For the constrained one in [Negahban2012restricted], the regularization parameter is set to be and the constraint is set to be . Both choices are recommended by [Negahban2012restricted]. As for (3), we set .

To better understand our contributions, we take a moment to discuss two important but different convex programs studied in [Negahban2012restricted] and [MR2906869]. To begin with, under a spikiness assumption on the low-rank matrix, Negahban and Wainwright [Negahban2012restricted] proposed to enforce an extra entrywise constraint when solving (3), in order to explicitly control the spikiness of the estimate. When applied to our model with , their results read (up to some logarithmic factor)

(10)

where is the estimate returned by their modified convex algorithm. While this matches the optimal bound when , it becomes suboptimal when (under our models). Moreover, as we have already discussed, the extra spikiness constraint becomes unnecessary in the regime considered herein. This also means that our result complements existing theory about the convex program in [Negahban2012restricted] by demonstrating its minimaxity for an additional range of noise. Another work by Koltchinskii et al. [MR2906869] investigated a completely different convex algorithm, which is effectively a spectral method (namely, one round of soft singular value thresholding on a rescaled zero-padded data matrix). The algorithm is shown to be minimax optimal over the class of low-rank matrices with bounded norm (note that this is very different from the set of incoherent matrices studied here). When specialized to our model, their error bound is the same as (10) (modulo some log factor), which also becomes suboptimal as decreases. As can be seen from the numerical experiments in Figure 3, the estimation error of this thresholding-based spectral algorithm does not decrease as the noise shrinks, and its performance seems uniformly outperformed by that of convex relaxation (3) and the constrained estimator in [Negahban2012restricted]. In fact, this is part of our motivation to pursue an improved theoretical understanding of the formulation (3).

Finally, we make note of a connection between our result and prior theory developed for the noiseless case. Specifically, when the noise vanishes (i.e. ), one can take a diminishing sequence of regularization parameters with , then the resulting estimation errors associated with this sequence should decrease to as (which implies exact recovery in the limit of ). This parallels the connection between Lasso in sparse linear regression and basis pursuit in compressed sensing.

1.4.3 Theoretical guarantees: extensions to more general settings

So far we have presented results when the true matrix has bounded rank and condition number, i.e. . Our theory actually accommodates a significantly broader range of scenarios, where the rank and the condition number are both allowed to grow with the dimension .

Theorem 2.

Let be rank- and -incoherent with a condition number . Suppose Assumption 1 holds and take in (3) for some large enough constant . Assume the sample size obeys for some sufficiently large constant , and the noise satisfies for some sufficiently small constant . Then with probability exceeding ,

  1. Any minimizer of (3) obeys

    (11a)
    (11b)
    (11c)
  2. Letting be the best rank- approximation of , we have

    (12)

    and the error bounds in (11) continue to hold if is replaced by .

Remark 4 (The noise condition).

The incoherence condition (cf. Definition 1) guarantees that the largest entry of the matrix is no larger than . As a result, the noise condition stated in Theorem 2 covers all scenarios obeying

Therefore, the typical size of the noise is allowed to be much larger than the size of the largest entry of , provided that . In particular, when , this recovers the noise condition in Theorem 1.

Notably, the sample size condition for noisy matrix completion (i.e. ) is more stringent than that in the noiseless setting (i.e. ), and our statistical guarantees are likely suboptimal with respect to the dependency on and . This sub-optimality is mainly due to the analysis of nonconvex optimization, a key ingredient of our analysis of convex relaxation. In fact, the state-of-the-art nonconvex analysis [Se2010Noisy, chen2015fast, ma2017implicit] requires the sample size to be much larger than the optimal one (e.g. ) even in the noiseless setting. It would certainly be interesting, and in fact important, to see whether it is possible to develop a theory with optimal dependency on and . We leave this for future investigation.

Despite the above sub-optimality issue, implications similar to those of Theorem 1 hold for this general setting. To begin with, the nearly low-rank structure of the convex solution is preserved (cf. (12)). In addition, the estimation error of the convex estimate is spread out across entries (cf. (11b)), thus uncovering an implicit regularization phenomenon underlying convex relaxation (which implicitly regularizes the spikiness constraint on the solution). Last but not least, the upper bounds (11) and (12) continue to hold for approximate minimizers of the convex program (3), thus yielding statistical guarantees for numerous iterative algorithms aimed at minimizing (3).

2 Strategy and novelty

In this section, we introduce the strategy for proving our main theorem, i.e. Theorem 2. Theorem 1 follows immediately. Informally, the main technical difficulty stems from the lack of closed-form expressions for the primal solution to (3), which in turn makes it difficult to construct a dual certificate. This is in stark contrast to the noiseless setting, where one clearly anticipates the ground truth to be the primal solution; in fact, this is precisely why the analysis for the noisy case is significantly more challenging. Our strategy, as we shall detail below, mainly entails invoking an iterative nonconvex algorithm to “approximate” such a primal solution.

Before continuing, we introduce a few more notations. Let represent the projection onto the subspace of matrices supported on , namely,

(13)

for any matrix . For a rank- matrix with singular value decomposition , denote by its tangent space, i.e.

(14)

Correspondingly, let be the orthogonal projection onto the subspace , that is,

(15)

for any matrix . In addition, let and denote the orthogonal complement of and the projection onto , respectively. With regards to the ground truth, we denote

(16)

The nonconvex problem (5) is equivalent to

(17)

where we have inserted an extra factor (compared to (5)) to simplify the presentation of the analysis later on.

2.1 Exact duality

In order to analyze the convex program (3), it is natural to start with the first-order optimality condition. Specifically, suppose that is a (primal) solution to (3) with SVD .444Here and below, we use (rather than ) for notational simplicity, whenever it is clear from the context. As before, let be the tangent space of , and let be the orthogonal complement of . Then the first-order optimality condition for (3) reads: there exists a matrix (called a dual certificate) such that

(18a)
(18b)

This condition is not only necessary to certify the optimality of , but also “almost sufficient” in guaranteeing the uniqueness of the solution ; see Appendix B for in-depth discussions.

The challenge then boils down to identifying such a primal-dual pair satisfying the optimality condition (18). For the noise-free case, the primal solution is clearly if exact recovery is to be expected; the dual certificate can then be either constructed exactly by the least-squares solution to a certain underdetermined linear system [ExactMC09, CanTao10], or produced approximately via a clever golfing scheme pioneered by Gross [Gross2011recovering]. For the noisy case, however, it is often difficult to hypothesize on the primal solution , as it depends on the random noise in a complicated way. In fact, the lack of a suitable guess of (and hence ) was the major hurdle that prior works faced when carrying out the duality analysis.

2.2 A candidate primal solution via nonconvex optimization

Motivated by the numerical experiment in Section 1.3, we propose to examine whether the optimizer of the nonconvex problem (5) stays close to the solution to the convex program (3). Towards this, suppose that form a critical point of (5) with .555Once again, we abuse the notation (instead of using ) for notational simplicity, whenever it is clear from the context. Then the first-order condition reads

(19a)
(19b)

To develop some intuition about the connection between (18) and (19), let us take a look at the case with . Denote and and assume that the two rank-1 factors are “balanced”, namely, . It then follows from (19) that has a singular value , whose corresponding left and right singular vectors are and , respectively. In other words, one can express

(20)

where is orthogonal to the tangent space of ; this is precisely the condition (18a). It remains to argue that (18b) is valid as well. Towards this end, the first-order condition (19) alone is insufficient, as there might be non-global critical points (e.g. saddle points) that are unable to approximate the convex solution well. Fortunately, as long as the candidate is not far away from the ground truth , one can guarantee as required in (18b).

The above informal argument about the link between the convex and the nonconvex problems can be rigorized. To begin with, we introduce the following conditions on the regularization parameter .

Condition 1 (Regularization parameter).

The regularization parameter satisfies

  1. [leftmargin=2.5em]

  2. (Relative to noise)

  3. (Relative to nonconvex solution) .

Remark 5.

Condition 1 requires that the regularization parameter should dominate a certain norm of the noise, as well as of the deviation of from its mean ; as will be seen shortly, the latter condition can be met when is sufficiently close to .

With the above condition in place, the following result demonstrates that a critical point of the nonconvex problem (5) readily translates to the unique minimizer of the convex program (3). This lemma is established in Appendix C.1.

Lemma 1 (Exact nonconvex vs. convex optimizers).

Suppose that is a critical point of (5) satisfying , and the sampling operator is injective when restricted to the elements of the tangent space of , namely,

(21)

Under Condition 1, the point is the unique minimizer of (3).

In order to apply Lemma 1, one needs to locate a critical point of (5) that is sufficiently close to the truth, for which one natural candidate is the global optimizer of (5). The caveat, however, is the lack of theory characterizing directly the properties of the optimizer of (5). Instead, what is available in prior theory is the characterization of some iterative sequence (e.g. gradient descent iterates) aimed at solving (5). It is unclear from prior theory whether the iterative algorithm under study (e.g. gradient descent) converges to the global optimizer in the presence of noise. This leads to technical difficulty in justifying the proximity between the nonconvex optimizer and the convex solution via Lemma 1.

2.3 Approximate nonconvex optimizers

Fortunately, perfect knowledge of the nonconvex optimizer is not pivotal. Instead, an approximate solution to the nonconvex problem (5) (or equivalently (17)) suffices to serve as a reasonably tight approximation of the convex solution. More precisely, we desire two factors that result in nearly zero (rather than exactly zero) gradients:

where is the nonconvex objective function as defined in (17). This relaxes the condition discussed in Lemma 1 (which only applies to critical points of (5) as opposed to approximate critical points). As it turns out, such points can be found via gradient descent tailored to (5). The sufficiency of the near-zero gradient condition is made possible by slightly strengthening the injectivity assumption (21), which is stated below.

Condition 2 (Injectivity).

Let be the tangent space of . There is a quantity such that

(22)

The following lemma states quantitatively how an approximate nonconvex optimizer serves as an excellent proxy of the convex solution, which we establish in Appendix C.2.

Lemma 2 (Approximate nonconvex vs. convex optimizers).

Suppose that obeys

(23)

for some sufficiently small constant . Further assume that any singular value of and lies in . Then under Conditions 1 and 2, any minimizer of (3) satisfies

(24)
Remark 6.

In fact, this lemma continues to hold if is replaced by any obeying , where is the objective function defined in (3) and and are low-rank factors obeying conditions of Lemma 2. This is important in providing statistical guarantees for iterative methods like SVT [cai2010singular], FPC [ma2011fixed], SOFT–IMPUTE [mazumder2010spectral], FISTA [beck2009fast], etc. To be more specific, suppose that results in an approximate optimizer of (3), namely, for some sufficiently small . Then for any obeying , one has

(25)

As a result, as long as the above-mentioned algorithms converge in terms of the objective value, they must return a solution obeying (25), which is exceedingly close to if is small.

It is clear from Lemma 2 that, as the size of the gradient gets smaller, the nonconvex estimate becomes an increasingly tighter approximation of any convex optimizer of (3), which is consistent with Lemma 1. In contrast to Lemma 1, due to the lack of strong convexity, a nonconvex estimate with a near-zero gradient does not imply the uniqueness of the optimizer of the convex program (3); rather, it indicates that any minimizer of (3) lies within a sufficiently small neighborhood surrounding (cf. (24)).

2.4 Construction of an approximate nonconvex optimizer

So far, Lemmas 1-2 are both deterministic results based on Condition 1. As we will soon see, under Assumption 1, we can derive simpler conditions that — with high probability — guarantee Condition 1. We start with Condition 1(a).

Lemma 3.

Suppose for some sufficiently large constant . Then with probability at least , one has As a result, Condition 1 holds (i.e. ) as long as for some sufficiently large constant .

Proof.

This follows from [chen2015fast, Lemma 11] with a slight and straightforward modification to accommodate the asymmetric noise here. For brevity, we omit the proof. ∎

Turning attention to Condition 1(b) and Condition 2, we have the following lemma, the proof of which is deferred to Appendix C.3.

Lemma 4.

Under the assumptions of Theorem 2, with probability exceeding we have

hold simultaneously for all obeying

(26)

Here, denotes the tangent space of , and is some absolute constant.

This lemma is a uniform result, namely, the bounds hold irrespective of the statistical dependency between and . As a consequence, to demonstrate the proximity between the convex and the nonconvex solutions (cf. (24)), it remains to identify a point with vanishingly small gradient (cf. (23)) that is sufficiently close to the truth (cf. (4)).

As we already alluded to previously, a simple gradient descent algorithm aimed at solving the nonconvex problem (5) might help us produce an approximate nonconvex optimizer. This procedure is summarized in Algorithm 1. Our hope is this: when initialized at the ground truth and run for sufficiently many iterations, the GD trajectory produced by Algorithm 1 will contain at least one approximate stationary point of (5) with the desired properties (23) and (4). We shall note that Algorithm 1 is not practical since it starts from the ground truth ; this is an auxiliary step mainly to simplify the theoretical analysis. While we can certainly make it practical by adopting spectral initialization as in [ma2017implicit, chen2019nonconvex], it requires more lengthy proofs without further improving our statistical guarantees.

  Initialization: ; .
  Gradient updates: for do
  
(27a)
(27b)
Here, is the step size.
Algorithm 1 Construction of an approximate primal solution.

2.5 Properties of the nonconvex iterates

In this subsection, we will build upon the literature on nonconvex low-rank matrix completion to justify that the estimates returned by Algorithm 1 satisfy the requirement stated in (4). Our theory will be largely established upon the leave-one-out strategy introduced by Ma et al. [ma2017implicit], which is an effective analysis technique to control the error of the estimates. This strategy has recently been extended by Chen et al. [chen2019nonconvex] to the more general rectangular case with an improved sample complexity bound.

Before continuing, we introduce several useful notations. Notice that the matrix product of and is invariant under global orthonormal transformation, namely, for any orthonormal matrix one has . Viewed in this light, we shall consider distance metrics modulo global rotation. In particular, the theory relies heavily on a specific global rotation matrix defined as follows

(28)

where is the set of orthonormal matrices.

We are now ready to present the performance guarantees for Algorithm 1.

Lemma 5 (Quality of the nonconvex estimates).

Instate the notation and hypotheses of Theorem 2. With probability at least , the iterates of Algorithm 1 satisfy

(29a)
(29b)
(29c)
(30)

where are some absolute constants, provided that and that .

This lemma, which we establish in Appendix D, reveals that for a polynomially large number of iterations, all iterates of the gradient descent sequence — when initialized at the ground truth — remain fairly close to the true low-rank factors. This holds in terms of the estimation errors measured by the Frobenius norm, the spectral norm, and the norm. In particular, the proximity in terms of the norm error plays a pivotal role in implementing our analysis strategy (particularly Lemmas 2-4) described previously. In addition, this lemma (cf. (30)) guarantees the existence of a small-gradient point within this sequence , a somewhat straightforward property of GD tailored to smooth problems [nesterov2012make]. This in turn enables us to invoke Lemma 2.

As immediate consequences of Lemma 5, with high probability we have

(31a)
(31b)
(31c)

for all . The proof is deferred to Appendix D.12.

2.6 Proof of Theorem 2

Let , and take (cf. (28)). It is straightforward to verify that obeys (i) the small-gradient condition (23), and (ii) the proximity condition (4). We are now positioned to invoke Lemma 2: for any optimizer of (3), one has