Implicit regularization and solution uniqueness in overparameterized matrix sensing
Abstract
We consider whether algorithmic choices in overparameterized linear matrix factorization introduce implicit regularization. We focus on noiseless matrix sensing over rank positive semidefinite (PSD) matrices in , with a sensing mechanism that satisfies restricted isometry properties (RIP). The algorithm we study is factored gradient descent, where we model the lowrankness and PSD constraints with the factorization , for . Surprisingly, recent work argues that the choice of is not pivotal: even setting is sufficient for factored gradient descent to find the rank solution, which suggests that operating over the factors leads to an implicit regularization. In this contribution, we provide a different perspective to the problem of implicit regularization. We show that under certain conditions, the PSD constraint by itself is sufficient to lead to a unique rank matrix recovery, without implicit or explicit lowrank regularization. I.e., under assumptions, the set of PSD matrices, that are consistent with the observed data, is a singleton, regardless of the algorithm used.
1 Introduction
In this work, we study how overparameterization relates to regularization (zhang2016understanding). By overparameterization, we mean that the number of parameters to estimate is larger than the available data, thus leading to an underdetermined system.^{1}^{1}1It helps picturing overparameterization via a simple linear system of equations: when the number of parameters is more than the number of equations, then there is an infinite number of solutions, and which is the one we choose depends on additional regularization bias. E.g., deep neural networks are usually designed overparameterized, with ever growing number of layers, and, eventually, a larger number of parameters (telgarsky2016benefits). What is surprising though is the lack of overfitting in such networks: while there could be many different parameter realizations that lead to zero training error, the algorithms select models that also generalize well to unseen data, despite overparameterization (keskar2016large; poggio2017theory; soltanolkotabi2017theoretical; dinh2017sharp; cooper2018loss).
The authors of (li2017algorithmic) show that the success of overparameterization can be theoretically fleshed out in the context of shallow, linear neural networks. They consider the case of lowrank and positive semidefinite (PSD) factorization in matrix sensing (recht2010guaranteed): given measurements —where has rank and is PSD, and satisfies the restricted isometry property— they prove that a square and fullrank factorized gradient descent algorithm over , where , converges to . I.e., whereas the algorithm has the expressive power to find any matrix that is consistent with the noiseless data (and due to overparametrization there are infinitely many such ’s), in contrast, it automatically converges to the minimum rank solution. This argument was previously conjectured in (gunasekar2017implicit).
This could be seen as a first step towards understanding overparameterization in general nonlinear models, whose objectives are more involved and complex. Such network simplifications have been followed in other recent works in machine learning and theoretical computer science, such as in convolutional neural networks (du2017convolutional), and landscape characterization of generic objectives (baldi1989neural; boob2017theoretical; safran2017spurious).
In this work, we provide a different perspective on the interpretation of overparameterization in matrix sensing. We show that, in the noiseless case, the PSD constraint by itself could be sufficient to lead to a unique matrix recovery from observations, without the use of implicit or explicit lowrankness. In other words, the set of PSD matrices that satisfy the measurements is a singleton, irrespective of the algorithm used.
Notation. Vectors are denoted with plain lower case letters; matrices are denoted with capital letters; and mappings, from one Euclidean space to another, are denoted with capital calligraphic letters. Given , its norm is defined as , where denotes its th entry; similarly, we define the norm as . The pseudonorm, , is defined as the number nonzero entries in . Given , is the diagonal matrix with diagonal entries the vector . For two matrices with appropriate dimensions, we define their inner product as , where is the trace operator. Given , the nuclear norm is defined as , where is the th singular value. The spectral norm is denoted as , where is the maximum singular value.
1.1 Related work
Implicit regularization in matrix sensing. This area was initiated by the conjecture in (gunasekar2017implicit): The authors suggest that nonconvex gradient descent on a fulldimensional factorization , where , converges to the minimum nuclear norm solution. (li2017algorithmic) sheds light on this conjecture: they theoretically explain the regularization inserted by algorithms, even beyond learning matrix factorization models, such as onehiddenlayer neural nets with quadratic activation; see also (du2018power).
Implicit regularization beyond matrix sensing. For the general linear regression setting, (wilson2017marginal) shows that, under specific assumptions, adaptive gradient methods, like AdaGrad and Adam, converge to a different solution than the simple (stochastic) gradient descent (SGD); see also (gunasekar2018characterizing). SGD has been shown to converge to the socalled minimum norm solution; see also (soudry2017implicit) for the case of logistic regression. This behavior is also demonstrated using DNNs in (wilson2017marginal), where simple gradient descent generalizes as well as the adaptive methods.
No spurious local minima. There is a recent line of work, focusing on nonconvex problems, that state conditions under which problem formulations actually have nospurious local minima, when we transform the problem from its convex formulation to a nonconvex one. Characteristic examples include that factored gradient descent does not introduce spurious local minima in matrix completion ge2016matrix and matrix sensing bhojanapalli2016global; park2017non, and all local minima are global in some tensor decompositions ge2015escaping and dictionary learning sun2016complete; see ge2017no; sun2015nonconvex for a complete overview of these results. Further, there is literature that characterizes the landscape of factorization problems, using the strict saddle property, to indicate that we can escape easily any saddle point ge2015escaping; zhu2018global; in this work, we take a different path showing that by construction the set of solutions that satisfy our observations is a singleton, demystifying the behavior of factored gradient descent in overparameterized matrix sensing.
2 Nonnegativity and sparsity: the vector analog of PSD and low rankness
We briefly describe the work of (bruckstein2008uniqueness), as we borrow ideas from that paper. Consider the problem of finding a nonnegative, sparse solution to an overparameterized linear system of equations: . Here, the sensing matrix , where , the unknown satisfies (entrywise) and is sufficiently sparse , and the measurements are .
This scenario suggests the following optimization problem as a solution:
subject to  (1) 
Here, is a function metric that measures the quality of the candidate solutions. Examples are (i.e., the minimum norm solution that satisfies the constraints), (i.e., the solution that has small norm, and promotes sparsity), and (i.e., the solution with the smallest number of nonzeros). These tasks have been encountered in statistics, computer vision and signal processing applications (zass2007nonnegative; shashua2005non; hazan2005sparse), and they are popular in the compressed sensing literature (donoho2006compressed; foucart2013mathematical), when is assumed sparse.
Let us disregard for the moment the positivity constraints on . By definition, an overparameterized linear inverse problem has infinite number of solutions. Unless we use the information that is sparse, its reconstruction using only and is an illposed problem, and there is no hope in finding the true vector without ambiguity.
Therefore, to reconstruct in an overparametrized setting, prior knowledge should be exploited by the optimization solver. Compressed sensing is an example where additional constraints restrict the feasible set to a singleton: under proper assumptions on the sensing matrix –such as the restricted isometry property (candes2008restricted), or the coherence property (bruckstein2008uniqueness)– and assuming sufficient number of measurements , one can show that the feasible set contains only one element, for sufficiently small .
Reinserting the positivity constraints in our discussion, (bruckstein2008uniqueness) show that, when a sufficiently sparse solution generates , and assuming the rowspan of intersects with the positive orthant, then the nonnegative constraint by itself is sufficient to identify the sparse , and reduce the cardinality of the feasible solutions to singleton. In other words, the inclusion of a sparsity inducing in (1) is not needed, even if we know a priori that is sparse; nonnegativity is sufficient to find a unique solution to the feasibility problem:
find  such that 
that matches . This way, we can still use convex optimization solvers –linear programming in this particular case– and avoid hard nonconvex problem instances.
3 The matrix sensing problem for PSD matrices
Let us now describe the matrix sensing problem, draw the connections with the vector case, and study the overparametrization , for . Following (li2017algorithmic), we consider the PSDconstrained case, where the optimum solution is both lowrank and PSD.
A rough description is as a problem of linear system of equations over matrices. It is derived by the generative model , where is the lowrank, PSD ground truth. Let the true rank of be . The mapping is such that the th entry of is given by , for independently drawn symmetric measurement matrices.
We study the PSDconstrained formulation, where we aim to find via:
subject to  (2) 
again represents a function metric that promotes lowrankness; standard choices include the nuclear norm (which imposes “sparsity" on the set of singular values and hence lowrankness), and the nonconvex metric.
Practical methods for this scenario include the PSDconstrained basis pursuit algorithm for matrices (chen2001atomic; goldstein2010high) that solve (2) for using interiorpoint methods (liu2009interior); and projected gradient descent algorithms, that solve an equivalent form of (2) for wisely chosen :
(3) 
for , and (kyrillidis2011recipes; kyrillidis2014matrix; khanna2017iht). In the latter, the objective appears in the constraint set as or .
Recently, we have witnessed a series of works (zhao2015nonconvex; park2016provable; park2016non; sun2016guaranteed; bhojanapalli2016global; park2016finding; kyrillidis2017provable; ge2017no; hsieh2017non), that operate directly on the factorization , and do not include any PSD and rank constraints. This is based on the observation that, for any rank and PSD , the factorization , for , guarantees that is at the same time PSD and at most rank. This reparameterizes (2) as:
find  subject to 
and (3) as:
Observe that in both cases, there are no metrics that explicitly favor lowrankness or any PSD constraints; these are implicitly encoded by the factorization . Algorithmic solutions for the above criteria include the factorized gradient descent (bhojanapalli2016dropping; park2016finding) that obeys the following recursion:
(4) 
Current theory (bhojanapalli2016dropping; park2016finding) assumes that is known a priori, in order to set the dimensions of the factor , accordingly. The only work that deviates from this perspective is the recent work in (li2017algorithmic), where the authors prove that even square in (4) still converges to the lowrank ground truth , with proper initialization and step size selection. The result relies on restricted isometry assumptions of . In a manner, this suggests that operating on the factorized space, the algorithm implicitly favors lowrank solutions, even if there is expressive power to select a full rank as a solution. The following subsection provides a different perspective on the matter: the implicit PSD constraint in could be sufficient to reduce the feasibility set to singleton, no matter what algorithm is used for solution.
When positivity constraints are sufficient for unique recovery under RIP. We note that the Restricted Isometry Property (RIP) assumption is made in (zhao2015nonconvex; park2016provable; park2016non; sun2016guaranteed; bhojanapalli2016global; park2016finding; kyrillidis2017provable; ge2017no; hsieh2017non; li2017algorithmic). There are various versions of RIP, with the most wellknown being the RIP (chen2015exact). Due to the construction of our sensing matrices as outer products of Gaussians—and thus the connection with rankone measurements in matrix sensing—for our theory, we will also use a variant of RIP, RIP (chen2015exact). The equivalence or superiority of one RIP definition over the other is not known, to the best of our knowledge. The RIP of linear maps on low rank matrices is key in our disucssion (candes2011tight; liu2011universal):
Definition 1 (RIP in chen2015exact)
A linear map satisfies the RIP with constant , if , is satisfied for all matrices such that . Here, denotes the norm over matrices.
Corollary 1 ((needell2009cosamp))
Let and be positive integers. Then, .
We extend the results in the previous section, and prove that, under appropriate conditions, the set of solutions is a singleton. To generalize, in our theoretical developments we will consider the case where is generated through a Gaussian process.
Sensing mappings comprised of Wishart matrices and their properties. In particular, consider the sensing map , where are nonsingular Wishart matrices for , and is the total number of measurements.
Definition 2 (Wishart Distribution, muirhead2009aspects)
Suppose that where each column (multivariate normal with zero mean). Define matrix by . We say that follows a Wishart distribution with degrees of freedom and covariance matrix , which we denote by .
Wishart matrices are commonly used to estimate covariance in high dimensional statistics (vershynin2012close; chen2015exact), and they come with nice properties that we will exploit in our theory. To generate , we generate Wishart matrices as defined above, for , and is the identity matrix.
The parameter is userdefined and set to . By assumption of , all ’s are nonsingular Wishart matrices (eaton). This ensures that, , our theory holds by the properties of nonsingular Wishart matrices: the density function of exists, exists, and is positive definite.
By definition of as positive definite matrices, this results to the following observation:
(5) 
I.e., there exists at least one vector such that the weighted sum of ’s is a positive definite matrix; this can be easily derived from the fact that by construction all ’s are positive definite; this also relates to the Farkas’ Lemma for semidefinite programs (lovasz2003semidefinite).
To proceed, we require the following definitions of Wishart matrices:

[leftmargin=0.5cm]

By (muirhead2009aspects), we know that is a nonsingular wishart matrix satisfying: . I.e., the weight sum of Wishart matrices satisfies the Wishart distribution.

As a nonsingular matrix, exists and follows an inverse Wishart distribution. In particular, (supp).
Further, since , there exists a unique such that .

[leftmargin=0.5cm]

Regarding the decomposition , we can extract information about ’s by Bartlett’s Decomposition (eaton). In particular, the matrix is a lower triangular matrix, where the random variables are mutually independent: for , follow a normal distribution as , and diagonal elements of follow a chisquared distribution as , for all .
Equivalent reformulation of matrix sensing with fixed trace. Given the above set up, we will make the following connections, starting with the following change of variables. Given the full rank such that , and for each , we define a new mapping , such that:
Given and , define the auxiliary variable ; observe that, for full rank , . Then, for any , we have:
where the last equality is due to the definitions of and . For the rest of the discussion, we assume that , for rank .
The above indicates the onetoone correspondence between the original feasibility set and the corresponding set after the change of variables:
(6) 
Further, the rank of the solutions, and , are the same. After the change of variables to , for and that belong to the above sets, we observe:
Here, is due to the definition of , is due to the assumption that the span of is strictly positive and equals , according to (5), and is due to , for being in the feasibility set. This dictates that the trace of matrices in the set is constant and does not depend on directly; it only depends on the measurement vector and the vector defined above.
Let us focus on the set . By definition, , where is rank and relates to in . Assume that , is a linear map that satisfies the RIP in Definition 1. Consider the convex optimization criterion with estimate :
(7) 
The following result is from (chen2015exact).
Theorem 1 (Theorem 1 in (chen2015exact) (Informal))
Assume satisfies the RIP for some , and for some integer . Then, (7), in the absence of noise, allows perfect recovery of the unique that satisfies the measurements, with exponentially high probability, provided we have enough samples .
Let us interpret and use this theorem. Assume that , and . Under these assumptions, the minimizer of (7) is identical to the unique, rank matrix that satisfies the the set of observations . Taking into account the PSD nature of , we have . Note that we do not include the constraint , because any feasible solution should satisfy this condition (see above). Also, we do not include the PSD constraint; the problem (7) is sufficient to guarantee uniqueness.
By the above theorem, any other PSD solution, , that satisfies the measurements must have a nuclear norm larger than . Being PSD, this also means , which implies that any other PSD solution is not in the feasible set . Hence, this set contains only one element, by contradiction.
Due to the onetoone correspondence between the sets in (6) then, we infer that the first set is also a singleton. This further implies that the inclusion of any metric that favors lowrankness in (2)(3) or restricting to be a tall matrix with wisely chosen in (4) makes no difference, as there is only one matrix that fits measurements .
RIP for the new sensing mapping . Key assumption is that the RIP holds for the transformed sensing map —and not the original sensing map . Thus, in general, it is required to find such transformation between and .
We know that , where . By construction through the Wishart distribution, we know that , by Theorem 3.2.4 from (muirhead2009aspects), satisfies:
where .
Using the following definition of subexponential random variables, we show Lemma 1.
Definition 3 (Sum of SubExponential Random Variables, (sumse))
Suppose that are independent subExponential random variables. Then, the sum is subExponential, where .
Lemma 1
Nonsingular Wishart matrices are subexponential matrices.
Proof. Let be a nonsingular Wishart random matrix. According to our constructions so far, may be characterized by , , and . Recall that , where , where each row of is generated from a multivariate normal distribution with zero mean. It is well known that each row , , is a subGaussian random vector (vershynin2017four), and that all elements of these vectors are subGaussian random variables. From arast, we know that both the square of a subGaussian random variable, and the product of independent subGaussian random variables, are subExponential; see also Lemma 7 in ahmed2014compressive.
We then can use the following result from foucart2019iterative:
Theorem 2 ((foucart2019iterative))
Given subexponential sensing matrices for the matrix sensing setting over rank matrices, the RIP requirement in Definition 1 is satisfied with probability:
for constant related to the subexponential distribution, and , for constant .
This completes the proof: In plain words, using Gaussiangenerated sensing matrices, we can generate Wishart sensing maps , such that the solution cardinality that satisfies the observations is a singleton.
Remark 1
The above show that the specific RIP assumption on is a sufficient, but not a necessary, condition to guarantee that the feasibility set is a singleton, . It remains an open question to find necessary conditions and possibly different sufficient conditions –such as the incoherence condition in (bruckstein2008uniqueness) for matrices, or the RIP– that also lead to a singleton set. In (baldwin2015informational) the authors find particular instances of sensing maps that, while not satisfy RIP condition, lead to a singleton set. It is interesting to study the sensing construction in the current paper for RIP conditions using:
Theorem 3 ((chen2015exact), Theorem 5)
Let be a sensing map, with individual matrices . Suppose that for all , , hold for some quantity . For any small constant , if , then with probability at least , one has satisfies RIP w.r.t. all matrices of rank at most and obeys .
4 Experiments
The aim of the following experiments is twofold: in Section 4.1, to show that the theory applies in practice; in Section 4.2, to show that a sensing map beyond the Wishart distribution can be sufficient to lead to a good approximation of , without the use of explicit regularization for lowrankness.
4.1 Using Wishart matrices in simulated matrix sensing problems
The following example shows some preliminary results, shown in Figure 1. Here, we compare least squares in with no constraints (CVX  second order method), least squares in with PSD constraints (CVX  second order method), and least squares in with PSD constraints (using projected gradient descent). The plot assumes small for proof of concept, where , due to the computational restrictions the second order method poses; same behavior is observed for any value we tested. The degrees of freedom are . While, criterion finds a good solution after observing samples, as expected, criteria find a relatively good solution well before that, which implies that the PSD constraint alone is sufficient to find the solution, only from a limited set of random measurements.
We generate as a rank PSD matrix, where for random scalar and vector . The measurement matrices are generated from a Gaussian distribution: and . Therefore the are symmetric and PSD, according to the construction described in the main text. The measurements are generated by for .
4.2 Beyond Wishart matrices: quantum state tomography
In this subsection, we consider the setting of quantum state tomography (QST): We generate measurements according to , where and the Pauli observable is randomly generated. In all settings, for simplicity, we assume is rank1, PSD and normalized , to satisfy the QST setting. Given and , we consider:
(8) 
I.e., the left criterion is the nuclearnorm minimization problem, with explicit regularization towards lowrank solutions (recht2010guaranteed); the middle criterion is the minimumnorm solution problem, where the objective regularizes towards with the minimum Frobenius norm; the right criterion is the PSD constrained, leastsquares problem, where the task is to fit the data subject to PSD constraints. In the two latter settings, there is no explicit regularization towards lowrank solutions.
We use the CVX Matlab implementation, in its lowprecision setting, to solve all problems in (8) (gb08; cvx). The results are presented in Table 1: denotes the entrywise distance . Since the estimates in all criteria in (8) are only approximately lowrank^{2}^{2}2Due to numerical precision limits, non of the solutions are nor rank1 in the strict sense., we also report the entrywise distance between and the best rank1 approximation of , denoted as . We consider four different settings for parameters; our experiments are restricted to small values of in , due to the high computational complexity of the CVX solvers (by default we use the SDPT3 solver (toh1999sdpt3)). Note that this is a secondorder algorithm.
Table 1 support our claim: All three criteria, and for all cases, lead to the same solution, while they all use different “regularization” in optimization. Any small differences can be assumed due to numerical precision, and not equivalent initial conditions. We observe consistently that, using the nuclearnorm bias, we obtain a better approximation of . Thus, using explicit regularization helps.
4.3 Behavior of firstorder, nonconvex solvers on parameterization
In view of the previous results, here we study the behavior of firstorder, nonconvex solvers, that utilize the reparameterization of as . We borrow the iteration in (bhojanapalli2016dropping; park2016finding), where: , for . We consider two cases: where is the rank of , and is assumed known a priori; this is the case in (bhojanapalli2016dropping; park2016finding) and has explicit regularization, as the algorithm operates only on the space of rank matrices. where we can operate over the whole space ; this is the case studied in (li2017algorithmic).
In both cases, the initialization and step size follow the prescriptions in (park2016finding), and they are computed using the same procedures for both cases. Table 2 reports our findings. To ease comparison, we repeat the results of the leastsquares objective in (8). We observe that all algorithms converge close to : obviously, using the a priori information that is rank1 biases towards a lowrank estimate, where faster convergence rates are observed. In the contrary, using shows slower convergence towards the vicinity of ; nevertheless, the reported results suggests that still one can achieve a small distance to (). Finally, while could be fullrank, most of the energy is contained in a small number of principal components, indicating that all algorithms favor (approximately) lowrank solutions.
  
  
  
 
5 Conclusion
In this manuscript we provide theoretical and practical evidence that in PSD, lowrank matrix sensing, the solution set is a singleton, under RIP assumptions and appropriate transformations on the sensing map . In these cases, the PSD constraint itself provides guarantees for unique matrix recovery. The question whether the above can be generalized to less restrictive linear sensing mappings remains open. Note that RIP is a sufficient but not a necessary condition; we believe that generalizing our work to more broad settings and assumptions is an interesting research direction.