Implicit regularization and solution uniqueness in over-parameterized matrix sensing

# Implicit regularization and solution uniqueness in over-parameterized matrix sensing

Anastasios Kyrillidis
IBM & Rice University
anastasios@rice.edu
&Amir Kalev
University of Maryland
amirk@umd.edu
###### Abstract

We consider whether algorithmic choices in over-parameterized linear matrix factorization introduce implicit regularization. We focus on noiseless matrix sensing over rank- positive semi-definite (PSD) matrices in , with a sensing mechanism that satisfies the restricted isometry property (RIP). The algorithm we study is that of factored gradient descent, where we model the low-rankness and PSD constraints with the factorization , where . Surprisingly, recent work argues that the choice of is not pivotal: even setting is sufficient for factored gradient descent to find the rank- solution, which suggests that operating over the factors leads to an implicit regularization.

In this note, we provide a different perspective. We show that, in the noiseless case, under certain conditions, the PSD constraint by itself is sufficient to lead to a unique rank- matrix recovery, without implicit or explicit low-rank regularization. I.e., under assumptions, the set of PSD matrices, that are consistent with the observed data, is a singleton, irrespective of the algorithm used.

Implicit regularization and solution uniqueness in over-parameterized matrix sensing

Anastasios Kyrillidis IBM & Rice University anastasios@rice.edu Amir Kalev University of Maryland amirk@umd.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Deep neural networks (DNNs) are hard to train in theory [blum1989training]. Nonetheless, they have led to recent success of machine learning and artificial intelligence in real-life applications [krizhevsky2012imagenet, simonyan2014very, girshick2015fast, long2015fully, goodfellow2016deep]. This antithesis has sparked the interest of the algorithmic research community towards better understanding how training algorithms generate models that generalize well on unseen data [livni2014computational, cohen2016expressive, raghu2016expressive, daniely2016toward, poole2016exponential]. Characteristic example is the interpretation that (stochastic) gradient descent injects regularization in optimization, and asymptotically converges to the minimum norm solution (under assumptions) [zhang2016understanding, poggio2017theory]. The latter relates to the maximum margin solution that guarantees good classification error [poggio2017theory]; see [neyshabur2014search, hardt2016train, neyshabur2017exploring].

In this work, we study how over-parameterization relates to regularization [zhang2016understanding]. By over-parameterization, we mean that the number of parameters to estimate is larger than the available data, thus leading to an under-determined system.111It helps picturing over-parameterization via a simple linear system of equations: when the number of parameters is more than the number of equations, then there is an infinite number of solutions, and which is the one we choose depends on additional regularization bias. DNNs are usually designed over-parameterized, with ever growing number of layers, and, eventually, a larger number of parameters [telgarsky2016benefits]. What is surprising though is the lack of overfitting in such networks: while there could be many different parameter realizations that lead to zero training error, the algorithms select models that also generalize well to unseen data, despite over-parameterization [keskar2016large, poggio2017theory, soltanolkotabi2017theoretical, dinh2017sharp, cooper2018loss].

The authors of [li2017algorithmic] show that the success of over-parameterization can be theoretically fleshed out in the context of shallow, linear neural networks. They consider the case of low-rank and positive semi-definite (PSD) factorization in matrix sensing [recht2010guaranteed]: given measurements —where has rank and is PSD, and satisfies the restricted isometry property— they prove that a square and full-rank factorized gradient descent algorithm over , where , converges to . I.e., whereas the algorithm has the expressive power to find any matrix that is consistent with the noiseless data (and due to over-parametrization there are infinitely many such ’s), in contrast, it automatically converges to the minimum rank solution. This argument was previously conjectured in [gunasekar2017implicit].

This could be seen as a first step towards understanding over-parameterization in general non-linear models, whose objectives are more involved and complex. Such network simplifications have been followed in other recent works in machine learning and theoretical computer science, such as in convolutional neural networks [du2017convolutional], and landscape characterization of generic objectives [baldi1989neural, boob2017theoretical, safran2017spurious].

In this short note, we provide a different perspective on the interpretation of over-parameterization in matrix sensing. We show that, in the noiseless case, the PSD constraint by itself could be sufficient to lead to a unique matrix recovery from observations, without the use of implicit or explicit low-rankness. In other words, the set of PSD matrices that satisfy the measurements is a singleton, irrespective of the algorithm used.

Section 2 describes the closely related problem of finding the sparsest non-negative vector that satisfies a set of linear equations. Based on Section 2, Section 3 non-trivially extends these ideas for the case of recovering a low-rank and PSD matrix from linear measurements, as defined in matrix sensing. This section contains the main theory of the paper: under RIP assumptions and some proper and allowable transformations of the sensing matrix, there is only one PSD matrix that could generate the data; given that the observations are generated from a low-rank matrix, this is the only feasible solution. In Section 4, we show some experimental results that justify our main arguments.

#### Notation and background.

Vectors are denoted with plain lower case letters; matrices are denoted with capital letters; and mappings, from one Euclidean space to another, are denoted with capital calligraphic letters. Given , its -norm is defined as , where denotes its -th entry; similarly, we define the -norm as . The -pseudonorm, , is defined as the number non-zero entries in . Given , is the diagonal matrix with diagonal entries the vector . For two matrices with appropriate dimensions, we define their inner product as , where is the trace operator. Given , the nuclear norm is defined as , where is the -th singular value.

## 2 Nonnegativity and sparsity: the vector analog of PSD and low rankness

We briefly describe the work of [bruckstein2008uniqueness], as we borrow ideas from that paper. Consider the problem of finding a non-negative, sparse solution to an over-parameterized linear system of equations: . Here, the “sensing” matrix lives in , where , the unknown satisfies (entrywise) and is sufficiently sparse , and the measurements are .

This scenario suggests the following optimization problem as a solution:

 minx∈Rn f(x) (1) subject to b=Ax and x≥0.

Here, is a function metric that measures the quality of the candidate solutions. Examples are (i.e., the minimum norm solution that satisfies the constraints), (i.e., the solution that has small -norm, and promotes sparsity), and (i.e., the solution with the smallest number of non-zeros). These tasks have been encountered in statistics, computer vision and signal processing applications [zass2007nonnegative, shashua2005non, hazan2005sparse], and they are strongly correlated with the compressed sensing literature [donoho2006compressed, foucart2013mathematical], when is assumed sparse.

Let us disregard for the moment the positivity constraints on . By definition, an over-parameterized linear inverse problem has infinite number of solutions. Unless we use the information that is sparse, its reconstruction using only and is an ill-posed problem, and there is no hope in finding the true vector without ambiguity. I.e., the cardinality of the feasibility set, , is infinite.

Therefore, to reconstruct in an over-parametrized setting, prior knowledge should be exploited by the optimization solver. Compressed sensing is an example where additional constraints restrict the feasible set to a singleton: under proper assumptions on the sensing matrix –such as the restricted isometry property [candes2008restricted], or the coherence property [bruckstein2008uniqueness]– and assuming sufficient number of measurements , one can show that the feasible set contains only one element, for sufficiently small .

Re-inserting the positivity constraints in our discussion, [bruckstein2008uniqueness] show that, when a sufficiently sparse solution generates , and assuming the row-span of intersects with the positive orthant, then the non-negative constraint by itself is sufficient to identify the sparse , and reduce the cardinality of the feasible solutions to singleton.

###### Theorem 1 ([bruckstein2008uniqueness])

Assume such that ; this also implies that , for some positive constant . Define and . Then, there is one-to-one correspondence between the feasibility sets:

 {x:Ax=b and x≥0}and{z:Dz=b,∥z∥1=c and z≥0},

where . Further, focusing on the latter set, assume satisfies standard regularity conditions such as incoherence [bruckstein2008uniqueness] and restricted isometry property [recht2010guaranteed]. Then, if is a solution to the linear system with sufficiently small sparsity, then is the unique solution, i.e., the feasible set is a singleton. Hence, the same holds for the original set .

In other words, the inclusion of a sparsity inducing in (1) is not needed, even if we know a priori that is sparse; non-negativity is sufficient to find a unique solution to the feasibility problem:

 find x such that b=Ax and x≥0,

that matches . This way, we can still use convex optimization solvers –linear programming in this particular case– and avoid hard non-convex problem instances.

## 3 The matrix sensing problem for PSD matrices

Let us now describe the matrix sensing problem, draw the connections with the vector case, and study the over-parametrization , for . Following [li2017algorithmic], we consider the PSD-constrained case, where the optimum solution is both low-rank and PSD.

A rough description is as a problem of linear system of equations over matrices. It is derived by the generative model , where is the low-rank, PSD ground truth. Let the true rank of be . The mapping is such that the -th entry of is given by , for independently drawn symmetric measurement matrices; more details on these matrices in the sections to follow.

We study the PSD-constrained formulation, where we aim to find via:

 minX∈Rn×n f(X) (2) subject to b=A(X), X⪰0.

again represents a function metric that promotes low-rankness; standard choices include the nuclear norm (which imposes “sparsity” on the set of singular values and hence low-rankness), and the non-convex metric.

Practical methods for this scenario include the PSD-constrained basis pursuit algorithm for matrices [chen2001atomic, goldstein2010high] that solve (2) for using interior-point methods [liu2009interior]; and projected gradient descent algorithms, that solve an equivalent form of (2) for wisely chosen :

 (3)

for , and [kyrillidis2011recipes, kyrillidis2014matrix, khanna2017iht]. In the latter, the objective appears in the constraint set as or .

Recently, we have witnessed a series of works [zhao2015nonconvex, park2016provable, park2016non, sun2016guaranteed, bhojanapalli2016global, park2016finding, kyrillidis2017provable, ge2017no, hsieh2017non], that operate directly on the factorization , and do not include any PSD and rank constraints. This is based on the observation that, for any rank- and PSD , the factorization , for , guarantees that is at the same time PSD and at most rank-. This re-parameterizes (2) as:

 find U∈Rn×r subject to b=A(UU⊤),

and (3) as:

 minU∈Rn×r g(UU⊤):=12∥b−A(UU⊤)∥22

Observe that in both cases, there are no metrics that explicitly favor low-rankness or any PSD constraints; these are implicitly encoded by the factorization . Algorithmic solutions for the above criteria include the factorized gradient descent [bhojanapalli2016dropping, park2016finding] that obeys the following recursion:

 Ui+1=Ui−η∇g(UiU⊤i)⋅Ui. (4)

Current theory [bhojanapalli2016dropping, park2016finding] assumes that is known a priori, in order to set the dimensions of the factor , accordingly. The only work that deviates from this perspective is the recent work in [li2017algorithmic], where the authors prove that even square in (4) still converges to the low-rank ground truth , with proper initialization and step size selection. The result relies on restricted isometry assumptions of —see below for a formal definition. In a manner, this suggests that operating on the factorized space, the algorithm implicitly favors low-rank solutions, even if there is expressive power to select a full rank- as a solution. The following subsection provides a different perspective on the matter: the implicit PSD constraint in could be sufficient to reduce the feasibility set to singleton, no matter what algorithm is used for solution.

### 3.1 When positivity constraints are sufficient for unique recovery under RIP

Key in our discussion is the restricted isometry property of linear maps on low rank matrices [candes2011tight, liu2011universal]:

###### Definition 1 (Restricted Isometry Property (RIP))

A linear map satisfies the -RIP with constant , if

 (1−δr)∥X∥2F≤∥F(X)∥22≤(1+δr)∥X∥2F,

is satisfied for all matrices such that .

###### Corollary 1 ([needell2009cosamp])

Let and be positive integers. Then, .

We note that the RIP assumption is made in [zhao2015nonconvex, park2016provable, park2016non, sun2016guaranteed, bhojanapalli2016global, park2016finding, kyrillidis2017provable, ge2017no, hsieh2017non] and [li2017algorithmic].

Here, we extend the results in the previous section, and prove that, under appropriate conditions, the set of solutions is a singleton.

Consider the sensing map , where are matrices, drawn from some probability distribution, for , and is the total number of measurements. Further, suppose that the span of , , is strictly positive; that is:

 ∃ φ=[φ1, φ2, …, φm]⊤such thatm∑i=1φiAi=B∈Rn×nandB≻0. (5)

Since , there exists such that .

Next, we make the following change of variables. Given full rank , and for each , we define the mapping , such that:

 (M(X))i=⟨Mi,X⟩=⟨V−1Ai(V−1)⊤,X⟩,% for all  i,

where . Given and , define the auxiliary variable ; observe that, for full rank , . Then, for any , we have:

 bi=(A(X))i =⟨Ai,X⟩=⟨Ai,XVV−1⟩=⟨Ai(V−1)⊤,XV⟩ =⟨Ai(V−1)⊤,(V⊤)−1V⊤XV⟩ =⟨V−1Ai(V−1)⊤,V⊤XV⟩ =⟨Mi,Y⟩=(M(Y))i,

where the last equality is due to the definitions of and . For the rest of the discussion, we assume that , for rank- .

The above indicates the one-to-one correspondence between the original feasibility set and the corresponding set after the change of variables:

 {X∈Rn×n:b=A(X),X⪰0}and{Y∈Rn×n:b=M(Y),Y⪰0}. (6)

Further, the rank of the solutions, and , are the same. After the change of variables to , for and that belong to the above sets, we observe:

 Tr(Y)\lx@stackrel(i)=Tr(V⊤XV) =Tr(XVV⊤)=Tr(XB)\lx@stackrel(ii)=Tr(Xm∑i=1φiAi) =m∑i=1φi⋅⟨Ai,X⟩\lx@stackrel(iii)=m∑i=1φi⋅bi:=c,% for constant  c.

Here, is due to the definition of , is due to the assumption that the span of is strictly positive and equals , according to (5), and is due to , for being in the feasibility set. This dictates that the trace of matrices in the set is constant and does not depend on directly; it only depends on the measurement vector and the vector defined above.

Let us focus on the set . By definition, , where is rank- and relates to in . Assume that , is a linear map that satisfies the RIP in Definition 1, with . Consider the convex optimization criterion with estimate :

 ˆY=argminY∈Rn×n ∥Y∥∗subject tob=M(Y). (7)

The following result is from [recht2010guaranteed].

###### Theorem 2 (Theorems 3.2 and Theorems 3.3 in [recht2010guaranteed])

Assume satisfies the RIP with for some integer . Then, is the only matrix of rank at most that satisfies the set of observations . Moreover, if satisfies , which is a stricter condition according to Corollary 1, then .

Let us interpret and use this theorem. Assume that , and . Under these assumptions, the minimizer of (7) is identical to the unique, rank- matrix that satisfies the the set of observations . Taking into account the PSD nature of , we have . Note that we do not include the constraint , because any feasible solution should satisfy this condition (see above). Also, we do not include the PSD constraint; the problem (7) is sufficient to guarantee uniqueness.

By the above theorem, any other PSD solution, , that satisfies the measurements must have a nuclear norm larger than . Being PSD, this also means , which implies that any other PSD solution is not in the feasible set . Hence, this set contains only one element, by contradiction.

Due to the one-to-one correspondence between the sets in (6) then, we infer that the first set is also a singleton. This further implies that the inclusion of any metric that favors low-rankness in (2)-(3) or restricting to be a tall matrix with wisely chosen in (4) makes no difference, as there is only one matrix that fits measurements .

Key assumption is that the RIP holds for the transformed sensing map —and not the original sensing map —with constant . Thus, in general, it is required to find such transformation between and .

###### Remark 1

The above show that the RIP assumption on is a sufficient, but not a necessary, condition to guarantee that the feasibility set is a singleton, . It remains an open question to find necessary conditions and possibly different sufficient conditions –such as the incoherence condition in [bruckstein2008uniqueness] for matrices, combined with different compressed sensing results– that also lead to a singleton set. In [baldwin2015informational] the authors find particular instances of sensing maps that, while not satisfy RIP condition, lead to a singleton set.

Given Remark 1, in the next section we describe a sensing map used in quantum information theory for quantum state tomography.

### 3.2 An example of a sensing map in quantum state tomography

While the assumptions and theory above are not directly found and verified in ’s used in practice, here we describe a sensing map that satisfies (5), and it is based on a set of sensing matrices that satisfy the restricted isometry property. Particularly, we will define a positive-operator value measure (POVM), used in quantum state tomography (QST). One can think of a POVM [helstrom1969quantum, aaronson2007learnability] as a sensing map that contains a set of matrices that form a resolution of the identity matrix. I.e.,

 A contains matrices in some set {Ai∈Cn×n: ∑iAi=I}.

Let us describe QST in more detail: we are interested in the recovery of a low-rank -qubit state, , where is PSD and normalized , from measuring expectation values of -qubit POVM elements . This translates into a measurement vector , whose elements represent the possible outcomes of the measurement. The probability of measuring an outcome is given by inner product: . We denote such that , for .

A possible realization of ’s is based on -qubit Pauli observables. In particular, we define where denotes the Kronecker product [aaronson2007learnability, rocchetto2017stabiliser]. Each is a matrix from the set:

 σI=[1001],σx=[0110],σy=[0−ii0],σz=[100−1],

where denotes the imaginary number.

There are possible combinations of in total. In general, one needs to have the expectation values of all observables to uniquely reconstruct ; this is the full quantum state tomography. However, since here we look for pure states (i.e., rank-1 quantum states), we can apply the compressed sensing result [gross2010quantum, liu2011universal], that guarantees RIP for with high probability:

###### Lemma 1 (Informal - RIP for Pauli measurements [liu2011universal])

With high probability over the choice of Pauli observables , where is an absolute constant, the collection of of them satisfies the -RIP with constant , .

The above indicate that the collection of Pauli observables satisfies RIP; not the matrices that are used in practice. We conjecture that still have nice properties that lead to unique recovery of from using only PSD constraints, as we show empirically in Section 4; see also Remark 1.

In terms of the span of being strictly positive, we identify that if we select , i.e., an all-ones vector, then

 B=m∑i=1φiAi=m∑i=1Ai=I,

by the definition of . This further means that for . Thus, , and . The above indicate that .

## 4 Experiments

For experiments, our aim is to show that the sensing map in Subsection 3.2 is sufficient to lead to a good approximation of , without the use of explicit regularization for low-rankness. Moreover, the experiments show that different objectives and algorithms lead to the same , suggesting the uniqueness of the solution.

### 4.1 Different criteria for QST; the same estimated solution

In this subsection, we consider the following setting: We generate measurements according to , where and the Pauli observable is randomly generated. In all settings, for simplicity, we assume is rank-1, PSD and normalized , to satisfy the QST setting.

Given and , we consider the following optimization criteria:

 (8)

I.e., the left criterion is the nuclear-norm minimization problem, with explicit regularization towards low-rank solutions [recht2010guaranteed]; the middle criterion is the minimum-norm solution problem, where the objective regularizes towards with the minimum Frobenius norm; the right criterion is the PSD constrained, least-squares problem, where the task is to fit the data subject to PSD constraints. Observe that in the two latter settings, there is no explicit regularization towards low-rank solutions.

We use the CVX Matlab implementation, in its low-precision setting, to solve all problems in (8) [gb08, cvx]. The results are presented in Table 1: denotes the entrywise distance . Since the estimates in all criteria in (8) are only approximately low-rank222Due to numerical precision limits, non of the solutions are nor rank-1 in the strict sense., we also report the entrywise distance between and the best rank-1 approximation of , denoted as . We consider four different settings for parameters; our experiments are restricted to small values of in , due to the high computational complexity of the CVX solvers (by default we use the SDPT3 solver [toh1999sdpt3]). Note that this is a second-order algorithm; we consider first-order methods later in the text.

The results in Table 1 support our claim: All three criteria, and for all cases, lead to the same solution, while they all use different “regularization” in optimization. Any small differences can be assumed due to numerical precision, not equivalent initial conditions for each problem execution, etc. Definitely, we observe consistently that, using the nuclear-norm bias, we obtain a better approximation of ; thus using explicit regularization helps. In summary, there are cases of over-parameterized matrix sensing where the minimum nuclear norm solution, the minimum Frobenius norm solution and the least-squares solution coincide, suggesting that the feasibility solution set is a singleton.

### 4.2 Behavior of first-order, non-convex solvers on UU⊤ parameterization

In view of the previous results, here we study the behavior of first-order, non-convex solvers, that utilize the re-parameterization of as . We borrow the iteration in [bhojanapalli2016dropping, park2016finding], where:

 Ui+1=Ui−η∇g(UiU⊤i)⋅Ui. (9)

for . We consider two cases: where is the rank of , and is assumed known a priori; this is the case in [bhojanapalli2016dropping, park2016finding] and has explicit regularization, as the algorithm operates only on the space of (at most) rank- matrices. where (9) has the freedom to operate over the whole space ; this is the case studied in [li2017algorithmic].

In both cases, the initialization and step size in (9) follow the prescriptions in [park2016finding], and they are computed using the same procedures for both cases. Table 2 reports our findings. To ease comparison, we repeat the results of the least-squares objective in (8). We observe that all algorithms converge close to : obviously, using the a priori information that is rank-1 biases towards a low-rank estimate, where faster convergence rates are observed. In the contrary, using shows slower convergence towards the vicinity of ; nevertheless, the reported results suggests that still one can achieve a small distance to ().

Finally, while could be even full-rank, most of the energy is contained in a small number of principal components, indicating that all algorithms favor (approximately) low-rank solutions.

## 5 Related work

#### Implicit regularization in matrix sensing.

This area was initiated by the conjecture in [gunasekar2017implicit]: The authors suggest that non-convex gradient descent on a full-dimensional factorization , where , converges to the minimum nuclear norm solution. [li2017algorithmic] sheds light on this conjecture: they theoretically explain the regularization inserted by algorithms, even beyond learning matrix factorization models, such as one-hidden-layer neural nets with quadratic activations; see also [du2018power].

#### Implicit regularization beyond matrix sensing.

For the general linear regression setting, [wilson2017marginal] shows that, under specific assumptions, adaptive gradient methods, like AdaGrad and Adam, converge to a different solution than the simple (stochastic) gradient descent (SGD); see also [gunasekar2018characterizing]. SGD has been shown to converge to the so-called minimum norm solution; see also [soudry2017implicit] for the case of logistic regression. This behavior is also demonstrated using DNNs in [wilson2017marginal], where simple gradient descent generalizes at least as well as the adaptive methods.

[keskar2016large, hoffer2017train] study the impact of mini-batch size in stochastic gradient descent w.r.t. the generalization of the trained classifier. To the best of our knowledge, only the work in [brutzkus2017sgd] provably shows that SGD converges to favorable minima of a shallow two-layer over-parameterized neural network.

## 6 Conclusions

In this manuscript, we provide theoretical and practical evidence that in PSD, low-rank matrix sensing, the solution set is a singleton, under RIP assumptions and appropriate transformations on the sensing map . In these cases, the PSD constraint itself provides guarantees for unique matrix recovery. The question whether the above can be generalized to less restrictive linear sensing mappings remains open: note that RIP is a sufficient but not a necessary condition; we believe that generalizing our work to more broad settings and assumptions is a direction worth to pursue. Finally, finding a specific that satisfies simultaneously all the conditions –required for our theory to hold– is an interesting research direction.

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters