Implicit regularization and solution uniqueness in over-parameterized matrix sensing
We consider whether algorithmic choices in over-parameterized linear matrix factorization introduce implicit regularization. We focus on noiseless matrix sensing over rank- positive semi-definite (PSD) matrices in , with a sensing mechanism that satisfies restricted isometry properties (RIP). The algorithm we study is factored gradient descent, where we model the low-rankness and PSD constraints with the factorization , for . Surprisingly, recent work argues that the choice of is not pivotal: even setting is sufficient for factored gradient descent to find the rank- solution, which suggests that operating over the factors leads to an implicit regularization. In this contribution, we provide a different perspective to the problem of implicit regularization. We show that under certain conditions, the PSD constraint by itself is sufficient to lead to a unique rank- matrix recovery, without implicit or explicit low-rank regularization. I.e., under assumptions, the set of PSD matrices, that are consistent with the observed data, is a singleton, regardless of the algorithm used.
In this work, we study how over-parameterization relates to regularization (zhang2016understanding). By over-parameterization, we mean that the number of parameters to estimate is larger than the available data, thus leading to an under-determined system.111It helps picturing over-parameterization via a simple linear system of equations: when the number of parameters is more than the number of equations, then there is an infinite number of solutions, and which is the one we choose depends on additional regularization bias. E.g., deep neural networks are usually designed over-parameterized, with ever growing number of layers, and, eventually, a larger number of parameters (telgarsky2016benefits). What is surprising though is the lack of overfitting in such networks: while there could be many different parameter realizations that lead to zero training error, the algorithms select models that also generalize well to unseen data, despite over-parameterization (keskar2016large; poggio2017theory; soltanolkotabi2017theoretical; dinh2017sharp; cooper2018loss).
The authors of (li2017algorithmic) show that the success of over-parameterization can be theoretically fleshed out in the context of shallow, linear neural networks. They consider the case of low-rank and positive semi-definite (PSD) factorization in matrix sensing (recht2010guaranteed): given measurements —where has rank and is PSD, and satisfies the restricted isometry property— they prove that a square and full-rank factorized gradient descent algorithm over , where , converges to . I.e., whereas the algorithm has the expressive power to find any matrix that is consistent with the noiseless data (and due to over-parametrization there are infinitely many such ’s), in contrast, it automatically converges to the minimum rank solution. This argument was previously conjectured in (gunasekar2017implicit).
This could be seen as a first step towards understanding over-parameterization in general non-linear models, whose objectives are more involved and complex. Such network simplifications have been followed in other recent works in machine learning and theoretical computer science, such as in convolutional neural networks (du2017convolutional), and landscape characterization of generic objectives (baldi1989neural; boob2017theoretical; safran2017spurious).
In this work, we provide a different perspective on the interpretation of over-parameterization in matrix sensing. We show that, in the noiseless case, the PSD constraint by itself could be sufficient to lead to a unique matrix recovery from observations, without the use of implicit or explicit low-rankness. In other words, the set of PSD matrices that satisfy the measurements is a singleton, irrespective of the algorithm used.
Notation. Vectors are denoted with plain lower case letters; matrices are denoted with capital letters; and mappings, from one Euclidean space to another, are denoted with capital calligraphic letters. Given , its -norm is defined as , where denotes its -th entry; similarly, we define the -norm as . The -pseudonorm, , is defined as the number non-zero entries in . Given , is the diagonal matrix with diagonal entries the vector . For two matrices with appropriate dimensions, we define their inner product as , where is the trace operator. Given , the nuclear norm is defined as , where is the -th singular value. The spectral norm is denoted as , where is the maximum singular value.
1.1 Related work
Implicit regularization in matrix sensing. This area was initiated by the conjecture in (gunasekar2017implicit): The authors suggest that non-convex gradient descent on a full-dimensional factorization , where , converges to the minimum nuclear norm solution. (li2017algorithmic) sheds light on this conjecture: they theoretically explain the regularization inserted by algorithms, even beyond learning matrix factorization models, such as one-hidden-layer neural nets with quadratic activation; see also (du2018power).
Implicit regularization beyond matrix sensing. For the general linear regression setting, (wilson2017marginal) shows that, under specific assumptions, adaptive gradient methods, like AdaGrad and Adam, converge to a different solution than the simple (stochastic) gradient descent (SGD); see also (gunasekar2018characterizing). SGD has been shown to converge to the so-called minimum norm solution; see also (soudry2017implicit) for the case of logistic regression. This behavior is also demonstrated using DNNs in (wilson2017marginal), where simple gradient descent generalizes as well as the adaptive methods.
No spurious local minima. There is a recent line of work, focusing on non-convex problems, that state conditions under which problem formulations actually have no-spurious local minima, when we transform the problem from its convex formulation to a non-convex one. Characteristic examples include that factored gradient descent does not introduce spurious local minima in matrix completion ge2016matrix and matrix sensing bhojanapalli2016global; park2017non, and all local minima are global in some tensor decompositions ge2015escaping and dictionary learning sun2016complete; see ge2017no; sun2015nonconvex for a complete overview of these results. Further, there is literature that characterizes the landscape of factorization problems, using the strict saddle property, to indicate that we can escape easily any saddle point ge2015escaping; zhu2018global; in this work, we take a different path showing that by construction the set of solutions that satisfy our observations is a singleton, demystifying the behavior of factored gradient descent in over-parameterized matrix sensing.
2 Nonnegativity and sparsity: the vector analog of PSD and low rankness
We briefly describe the work of (bruckstein2008uniqueness), as we borrow ideas from that paper. Consider the problem of finding a non-negative, sparse solution to an over-parameterized linear system of equations: . Here, the sensing matrix , where , the unknown satisfies (entrywise) and is sufficiently sparse , and the measurements are .
This scenario suggests the following optimization problem as a solution:
Here, is a function metric that measures the quality of the candidate solutions. Examples are (i.e., the minimum norm solution that satisfies the constraints), (i.e., the solution that has small -norm, and promotes sparsity), and (i.e., the solution with the smallest number of non-zeros). These tasks have been encountered in statistics, computer vision and signal processing applications (zass2007nonnegative; shashua2005non; hazan2005sparse), and they are popular in the compressed sensing literature (donoho2006compressed; foucart2013mathematical), when is assumed sparse.
Let us disregard for the moment the positivity constraints on . By definition, an over-parameterized linear inverse problem has infinite number of solutions. Unless we use the information that is sparse, its reconstruction using only and is an ill-posed problem, and there is no hope in finding the true vector without ambiguity.
Therefore, to reconstruct in an over-parametrized setting, prior knowledge should be exploited by the optimization solver. Compressed sensing is an example where additional constraints restrict the feasible set to a singleton: under proper assumptions on the sensing matrix –such as the restricted isometry property (candes2008restricted), or the coherence property (bruckstein2008uniqueness)– and assuming sufficient number of measurements , one can show that the feasible set contains only one element, for sufficiently small .
Re-inserting the positivity constraints in our discussion, (bruckstein2008uniqueness) show that, when a sufficiently sparse solution generates , and assuming the row-span of intersects with the positive orthant, then the non-negative constraint by itself is sufficient to identify the sparse , and reduce the cardinality of the feasible solutions to singleton. In other words, the inclusion of a sparsity inducing in (1) is not needed, even if we know a priori that is sparse; non-negativity is sufficient to find a unique solution to the feasibility problem:
that matches . This way, we can still use convex optimization solvers –linear programming in this particular case– and avoid hard non-convex problem instances.
3 The matrix sensing problem for PSD matrices
Let us now describe the matrix sensing problem, draw the connections with the vector case, and study the over-parametrization , for . Following (li2017algorithmic), we consider the PSD-constrained case, where the optimum solution is both low-rank and PSD.
A rough description is as a problem of linear system of equations over matrices. It is derived by the generative model , where is the low-rank, PSD ground truth. Let the true rank of be . The mapping is such that the -th entry of is given by , for independently drawn symmetric measurement matrices.
We study the PSD-constrained formulation, where we aim to find via:
again represents a function metric that promotes low-rankness; standard choices include the nuclear norm (which imposes “sparsity" on the set of singular values and hence low-rankness), and the non-convex metric.
Practical methods for this scenario include the PSD-constrained basis pursuit algorithm for matrices (chen2001atomic; goldstein2010high) that solve (2) for using interior-point methods (liu2009interior); and projected gradient descent algorithms, that solve an equivalent form of (2) for wisely chosen :
for , and (kyrillidis2011recipes; kyrillidis2014matrix; khanna2017iht). In the latter, the objective appears in the constraint set as or .
Recently, we have witnessed a series of works (zhao2015nonconvex; park2016provable; park2016non; sun2016guaranteed; bhojanapalli2016global; park2016finding; kyrillidis2017provable; ge2017no; hsieh2017non), that operate directly on the factorization , and do not include any PSD and rank constraints. This is based on the observation that, for any rank- and PSD , the factorization , for , guarantees that is at the same time PSD and at most rank-. This re-parameterizes (2) as:
and (3) as:
Observe that in both cases, there are no metrics that explicitly favor low-rankness or any PSD constraints; these are implicitly encoded by the factorization . Algorithmic solutions for the above criteria include the factorized gradient descent (bhojanapalli2016dropping; park2016finding) that obeys the following recursion:
Current theory (bhojanapalli2016dropping; park2016finding) assumes that is known a priori, in order to set the dimensions of the factor , accordingly. The only work that deviates from this perspective is the recent work in (li2017algorithmic), where the authors prove that even square in (4) still converges to the low-rank ground truth , with proper initialization and step size selection. The result relies on restricted isometry assumptions of . In a manner, this suggests that operating on the factorized space, the algorithm implicitly favors low-rank solutions, even if there is expressive power to select a full rank- as a solution. The following subsection provides a different perspective on the matter: the implicit PSD constraint in could be sufficient to reduce the feasibility set to singleton, no matter what algorithm is used for solution.
When positivity constraints are sufficient for unique recovery under RIP. We note that the Restricted Isometry Property (RIP) assumption is made in (zhao2015nonconvex; park2016provable; park2016non; sun2016guaranteed; bhojanapalli2016global; park2016finding; kyrillidis2017provable; ge2017no; hsieh2017non; li2017algorithmic). There are various versions of RIP, with the most well-known being the RIP- (chen2015exact). Due to the construction of our sensing matrices as outer products of Gaussians—and thus the connection with rank-one measurements in matrix sensing—for our theory, we will also use a variant of RIP, RIP- (chen2015exact). The equivalence or superiority of one RIP definition over the other is not known, to the best of our knowledge. The RIP of linear maps on low rank matrices is key in our disucssion (candes2011tight; liu2011universal):
Definition 1 (RIP in chen2015exact)
A linear map satisfies the -RIP- with constant , if , is satisfied for all matrices such that . Here, denotes the -norm over matrices.
Corollary 1 ((needell2009cosamp))
Let and be positive integers. Then, .
We extend the results in the previous section, and prove that, under appropriate conditions, the set of solutions is a singleton. To generalize, in our theoretical developments we will consider the case where is generated through a Gaussian process.
Sensing mappings comprised of Wishart matrices and their properties. In particular, consider the sensing map , where are non-singular Wishart matrices for , and is the total number of measurements.
Definition 2 (Wishart Distribution, muirhead2009aspects)
Suppose that where each column (multivariate normal with zero mean). Define matrix by . We say that follows a Wishart distribution with degrees of freedom and covariance matrix , which we denote by .
Wishart matrices are commonly used to estimate covariance in high dimensional statistics (vershynin2012close; chen2015exact), and they come with nice properties that we will exploit in our theory. To generate , we generate Wishart matrices as defined above, for , and is the identity matrix.
The parameter is user-defined and set to . By assumption of , all ’s are non-singular Wishart matrices (eaton). This ensures that, , our theory holds by the properties of non-singular Wishart matrices: the density function of exists, exists, and is positive definite.
By definition of as positive definite matrices, this results to the following observation:
I.e., there exists at least one vector such that the weighted sum of ’s is a positive definite matrix; this can be easily derived from the fact that by construction all ’s are positive definite; this also relates to the Farkas’ Lemma for semidefinite programs (lovasz2003semidefinite).
To proceed, we require the following definitions of Wishart matrices:
By (muirhead2009aspects), we know that is a non-singular wishart matrix satisfying: . I.e., the weight sum of Wishart matrices satisfies the Wishart distribution.
As a non-singular matrix, exists and follows an inverse Wishart distribution. In particular, (supp).
Further, since , there exists a unique such that .
Regarding the decomposition , we can extract information about ’s by Bartlett’s Decomposition (eaton). In particular, the matrix is a lower triangular matrix, where the random variables are mutually independent: for , follow a normal distribution as , and diagonal elements of follow a chi-squared distribution as , for all .
Equivalent reformulation of matrix sensing with fixed trace. Given the above set up, we will make the following connections, starting with the following change of variables. Given the full rank such that , and for each , we define a new mapping , such that:
Given and , define the auxiliary variable ; observe that, for full rank , . Then, for any , we have:
where the last equality is due to the definitions of and . For the rest of the discussion, we assume that , for rank- .
The above indicates the one-to-one correspondence between the original feasibility set and the corresponding set after the change of variables:
Further, the rank of the solutions, and , are the same. After the change of variables to , for and that belong to the above sets, we observe:
Here, is due to the definition of , is due to the assumption that the span of is strictly positive and equals , according to (5), and is due to , for being in the feasibility set. This dictates that the trace of matrices in the set is constant and does not depend on directly; it only depends on the measurement vector and the vector defined above.
Let us focus on the set . By definition, , where is rank- and relates to in . Assume that , is a linear map that satisfies the RIP in Definition 1. Consider the convex optimization criterion with estimate :
The following result is from (chen2015exact).
Theorem 1 (Theorem 1 in (chen2015exact) (Informal))
Assume satisfies the RIP- for some , and for some integer . Then, (7), in the absence of noise, allows perfect recovery of the unique that satisfies the measurements, with exponentially high probability, provided we have enough samples .
Let us interpret and use this theorem. Assume that , and . Under these assumptions, the minimizer of (7) is identical to the unique, rank- matrix that satisfies the the set of observations . Taking into account the PSD nature of , we have . Note that we do not include the constraint , because any feasible solution should satisfy this condition (see above). Also, we do not include the PSD constraint; the problem (7) is sufficient to guarantee uniqueness.
By the above theorem, any other PSD solution, , that satisfies the measurements must have a nuclear norm larger than . Being PSD, this also means , which implies that any other PSD solution is not in the feasible set . Hence, this set contains only one element, by contradiction.
Due to the one-to-one correspondence between the sets in (6) then, we infer that the first set is also a singleton. This further implies that the inclusion of any metric that favors low-rankness in (2)-(3) or restricting to be a tall matrix with wisely chosen in (4) makes no difference, as there is only one matrix that fits measurements .
RIP- for the new sensing mapping . Key assumption is that the RIP- holds for the transformed sensing map —and not the original sensing map . Thus, in general, it is required to find such transformation between and .
We know that , where . By construction through the Wishart distribution, we know that , by Theorem 3.2.4 from (muirhead2009aspects), satisfies:
Using the following definition of sub-exponential random variables, we show Lemma 1.
Definition 3 (Sum of Sub-Exponential Random Variables, (sum-se))
Suppose that are independent sub-Exponential random variables. Then, the sum is sub-Exponential, where .
Non-singular Wishart matrices are sub-exponential matrices.
Proof. Let be a non-singular Wishart random matrix. According to our constructions so far, may be characterized by , , and . Recall that , where , where each row of is generated from a multivariate normal distribution with zero mean. It is well known that each row , , is a sub-Gaussian random vector (vershynin2017four), and that all elements of these vectors are sub-Gaussian random variables. From ar-ast, we know that both the square of a sub-Gaussian random variable, and the product of independent sub-Gaussian random variables, are sub-Exponential; see also Lemma 7 in ahmed2014compressive.
We then can use the following result from foucart2019iterative:
Theorem 2 ((foucart2019iterative))
Given sub-exponential sensing matrices for the matrix sensing setting over rank- matrices, the RIP- requirement in Definition 1 is satisfied with probability:
for constant related to the subexponential distribution, and , for constant .
This completes the proof: In plain words, using Gaussian-generated sensing matrices, we can generate Wishart sensing maps , such that the solution cardinality that satisfies the observations is a singleton.
The above show that the specific RIP- assumption on is a sufficient, but not a necessary, condition to guarantee that the feasibility set is a singleton, . It remains an open question to find necessary conditions and possibly different sufficient conditions –such as the incoherence condition in (bruckstein2008uniqueness) for matrices, or the RIP-– that also lead to a singleton set. In (baldwin2015informational) the authors find particular instances of sensing maps that, while not satisfy RIP condition, lead to a singleton set. It is interesting to study the sensing construction in the current paper for RIP- conditions using:
Theorem 3 ((chen2015exact), Theorem 5)
Let be a sensing map, with individual matrices . Suppose that for all , , hold for some quantity . For any small constant , if , then with probability at least , one has satisfies RIP- w.r.t. all matrices of rank at most and obeys .
The aim of the following experiments is twofold: in Section 4.1, to show that the theory applies in practice; in Section 4.2, to show that a sensing map beyond the Wishart distribution can be sufficient to lead to a good approximation of , without the use of explicit regularization for low-rankness.
4.1 Using Wishart matrices in simulated matrix sensing problems
The following example shows some preliminary results, shown in Figure 1. Here, we compare least squares in with no constraints (CVX - second order method), least squares in with PSD constraints (CVX - second order method), and least squares in with PSD constraints (using projected gradient descent). The plot assumes small for proof of concept, where , due to the computational restrictions the second order method poses; same behavior is observed for any value we tested. The degrees of freedom are . While, criterion finds a good solution after observing samples, as expected, criteria find a relatively good solution well before that, which implies that the PSD constraint alone is sufficient to find the solution, only from a limited set of random measurements.
We generate as a rank- PSD matrix, where for random scalar and vector . The measurement matrices are generated from a Gaussian distribution: and . Therefore the are symmetric and PSD, according to the construction described in the main text. The measurements are generated by for .
4.2 Beyond Wishart matrices: quantum state tomography
In this subsection, we consider the setting of quantum state tomography (QST): We generate measurements according to , where and the Pauli observable is randomly generated. In all settings, for simplicity, we assume is rank-1, PSD and normalized , to satisfy the QST setting. Given and , we consider:
I.e., the left criterion is the nuclear-norm minimization problem, with explicit regularization towards low-rank solutions (recht2010guaranteed); the middle criterion is the minimum-norm solution problem, where the objective regularizes towards with the minimum Frobenius norm; the right criterion is the PSD constrained, least-squares problem, where the task is to fit the data subject to PSD constraints. In the two latter settings, there is no explicit regularization towards low-rank solutions.
We use the CVX Matlab implementation, in its low-precision setting, to solve all problems in (8) (gb08; cvx). The results are presented in Table 1: denotes the entrywise distance . Since the estimates in all criteria in (8) are only approximately low-rank222Due to numerical precision limits, non of the solutions are nor rank-1 in the strict sense., we also report the entrywise distance between and the best rank-1 approximation of , denoted as . We consider four different settings for parameters; our experiments are restricted to small values of in , due to the high computational complexity of the CVX solvers (by default we use the SDPT3 solver (toh1999sdpt3)). Note that this is a second-order algorithm.
Table 1 support our claim: All three criteria, and for all cases, lead to the same solution, while they all use different “regularization” in optimization. Any small differences can be assumed due to numerical precision, and not equivalent initial conditions. We observe consistently that, using the nuclear-norm bias, we obtain a better approximation of . Thus, using explicit regularization helps.
4.3 Behavior of first-order, non-convex solvers on parameterization
In view of the previous results, here we study the behavior of first-order, non-convex solvers, that utilize the re-parameterization of as . We borrow the iteration in (bhojanapalli2016dropping; park2016finding), where: , for . We consider two cases: where is the rank of , and is assumed known a priori; this is the case in (bhojanapalli2016dropping; park2016finding) and has explicit regularization, as the algorithm operates only on the space of rank- matrices. where we can operate over the whole space ; this is the case studied in (li2017algorithmic).
In both cases, the initialization and step size follow the prescriptions in (park2016finding), and they are computed using the same procedures for both cases. Table 2 reports our findings. To ease comparison, we repeat the results of the least-squares objective in (8). We observe that all algorithms converge close to : obviously, using the a priori information that is rank-1 biases towards a low-rank estimate, where faster convergence rates are observed. In the contrary, using shows slower convergence towards the vicinity of ; nevertheless, the reported results suggests that still one can achieve a small distance to (). Finally, while could be full-rank, most of the energy is contained in a small number of principal components, indicating that all algorithms favor (approximately) low-rank solutions.
In this manuscript we provide theoretical and practical evidence that in PSD, low-rank matrix sensing, the solution set is a singleton, under RIP assumptions and appropriate transformations on the sensing map . In these cases, the PSD constraint itself provides guarantees for unique matrix recovery. The question whether the above can be generalized to less restrictive linear sensing mappings remains open. Note that RIP is a sufficient but not a necessary condition; we believe that generalizing our work to more broad settings and assumptions is an interesting research direction.