Stationary Points of Shallow Neural Networks with Quadratic Activation Function
We consider the problem of learning shallow noiseless neural networks with quadratic activation function and planted weights , where is the width of the hidden layer and is the dimension of the data which consists of centered i.i.d. coordinates with second moment and fourth moment . We provide an analytical formula for the population risk of any in terms of , and the distance of from . We establish that the landscape of the population risk admits an energy barrier separating rank-deficient solutions: if with , then is bounded away from zero by an amount we quantify. We then establish that all full-rank stationary points of are necessarily global optimum. These two results propose a simple explanation for the success of the gradient descent in training such networks: when properly initialized, gradient descent algorithm finds global optimum due to absence of spurious stationary points within the set of full-rank matrices.
We then show that if the planted weight matrix has centered i.i.d. entries with unit variance and finite fourth moment (while the data still has centered i.i.d. coordinates as above), and is sufficiently wide, that is for large enough , then it is easy to construct a full rank matrix with population risk below the aforementioned energy barrier, starting from which gradient descent is guaranteed to converge to a global optimum.
Our final focus is on sample complexity: we identify a simple necessary and sufficient geometric condition on the training data under which any minimizer of the empirical loss has necessarily zero generalization error. We show that as soon as , randomly generated data enjoys this geometric condition almost surely. At the same time we show that if , then when the data has centered i.i.d. coordinates, there always exists a matrix with empirical risk equal to zero, but with population risk bounded away from zero by the same amount as rank deficient matrices.
Our results on sample complexity further shed light on an interesting phenomenon observed empirically about neural networks: we show that overparametrization does not hurt generalization, once the data is interpolated, for the networks with quadratic activations.
- 1 Introduction
2 Main Results
- 2.1 Landscape of Population Risk: Band-Gap and Optimality of Full-Rank Stationary Points
- 2.2 On Initialization: Randomly Generated Planted Weights
- 2.3 Critical Number of Training Samples
- 3 Proofs
The main focus of many machine learning and statistics tasks is to extract information from a given collection of data, often by solving the following “canonical problem”: Let be a data set, with and . The components are often called the labels. The problem, then, is to find a function , which:
Explains the unknown input-output relationship over the data set as accurately as possible, that is, if and , then it ensures that is small, where is some cost function. The resulting is known as the “training error”.
Has good prediction performance on the unseen data, that is, it generalizes well. In mathematical terms, if the training data is generated from a distribution on , then one demands that is small, where is some (perhaps different) cost function. Here the expectation is with respect to a fresh sample . The resulting is known as the “generalization error” or the “test error”.
One popular class of such functions are neural networks, which attempt to explain the (unknown) input-output relationship with the following form:
Here, is the ’depth’ of the neural network, are the non-linear functions called activations which act coordinate-wise, are the weight matrices of appropriate sizes; and is a vector, carrying the output weights. The process of finding parameters and which can interpolate the data set is called training.
Albeit being simple to state, neural networks turn out to be extremely powerful in tasks such as natural language processing [CW08], image recognition [HZRS16], image classification [KSH12], speech recognition [MDH11], and game playing [SSS17]; and started becoming popular in other areas, such as applied mathematics [CRBD18, WHJ17], and clinical diagnosis [DFLRP18], and so on.
In this paper, we consider a shallow neural network model with one hidden layer (), consisting of neurons (that is, the “width” of the network is ), a (planted) weight matrix (where for each , the row of this matrix is the weight vector associated to the neuron), and quadratic activation function . This object, for every input vector , computes the function:
We note that, albeit being rarely used in practice, quadratic activations help us develop understanding for networks consisting of polynomials activations with higher degrees or sigmoid activations [LSSS14, SH18, VBB18].
The associated empirical risk minimization problem is then cast as follows: Let be a sequence of input data. Generate the output labels using a shallow neural network (the so-called teacher network) with the planted weight matrix , that is, . This model is also known as the realizable model, as the labels are generated according to a teacher network with planted weight matrix . Assume the learner has access to the training data . The empirical risk minimization problem (ERM) is the optimization problem
of finding a weight matrix that explains the input-output relationship for the training data set in the best possible way. The focus is on tractable algorithms to solve this minimization problem and understanding its generalization ability, quantified by generalization error. The generalization error, also known as population risk associated with any solution candidate (whether it is optimal with respect to (2) or not), is
where the expectation is with respect to a ”fresh” sample , which has the same distribution as , but is independent from the sample. This object provides information about how well generalizes, that is, how much error on average it makes on prediction for the yet unseen data. Observe that due to the quadratic nature of the activation function, the ground truth matrix , which is also an optimal solution to the empirical risk optimization problem (with value zero) and population risk optimization problem (also with value zero), is invariant under rotation. Namely for every orthonormal , is also an optimal solution to both optimization problems.
The landscape of the loss function as a function of is highly non-convex, rendering the underlying optimization problem potentially difficult. Nevertheless, it has been empirically observed, and rigorously proven under certain assumptions (references below), that the gradient descent algorithm, despite being a simple first order procedure, is rather successful in training such neural networks: it appears to find a with . Our paper suggests a novel explanation for this, when the network consists of quadratic activation functions.
1.2 Summary of the Main Results and Comparison with the Prior Work
We first study the landscape of the population risk function under the assumption that the data consists of centered i.i.d. coordinates with second moment and (finite) fourth moment . We provide an analytical expression for the population risk in terms of the moments and of the data; and the trace of the matrix which measures how far is from . This analytical expression then yields a simple yet useful lower and upper bound for . These results are the subject of Theorem 2.1.
By utilizing the lower bound on obtained in Theorem 2.1, we then establish the following result: If is full-rank, then for any rank-deficient matrix , is bounded away from zero, by an explicit constant controlled by the smallest singular value of the planted weight matrix , and the data moments . In particular, this result highlights the existence of an “energy barrier” separating full-rank points with ’small cost’ from rank-deficient points, and shows that near the ground state one can only find full-rank points . Moreover, we show that this bound is tight up to a multiplicative constant by explicitly constructing a rank-deficient and studying its associated loss by utilizing the upper bound obtained in Theorem 2.1. These results are the subject of Theorem 2.2.
We next study the full-rank stationary points of . We establish in Theorem 2.4 that again for data with centered i.i.d. coordinates and finite fourth moment, when is full rank, all full-rank stationary points of are necessarily global optimum, and in particular, for any such , there exists an orthonormal matrix such that . This is done in an analogous manner as in Theorem 2.1: we first establish an explicit analytical formula for the population gradient which yields a stationarity equation; and we then study the implications of this equation when . Combining Theorem 2.4 with Theorem 2.2 discussed above, as a corollary we obtain the following interesting conclusion: if the gradient descent algorithm is initialized at a point which has a sufficiently small population risk, in particular if it is lower than smallest risk value achieved by rank deficient matrices , then the gradient descent converges to a global optimum of the population risk optimization problem . This is the subject of Theorem 2.5.
We now briefly pause to make a comparison with some of the related prior work. Among these the most relevant to us is the work of Soltanolkotabi, Javanmard and Lee [SJL18]. Here, in particular in [SJL18, Theorem 2.2], the authors study the empirical loss landscape of a slightly more general version of our model: , with the same quadratic activation function , assuming , and assuming all non-zero entries of have the same sign. Thus our model is the special case where all entries of equal unity. The authors establish that as long as for some small fixed constant , every local minima of the empirical risk function is also a global minima (namely, there exists no spurious local minima), and furthermore, every saddle point has a direction of negative curvature. As a result they show that gradient descent with an arbitrary initialization converges to a globally optimum solution of the empirical risk minimization problem. In particular, their result does not require the initialization point to be below some risk value, like in our case. Our result though shows that one needs not to worry about saddle points below some risk value as none exist per our first Theorem 2.2, when the population risk minimization problem is considered instead. We note also that the proof technique of [SJL18], more concretely, the proof of [SJL18, Theorem 2.1], which can be found in Section 6.2.1 therein, also appears to be a path to prove an empirical risk version of Theorem 2.4 with very minor modifications. Importantly, though, the regime for small is below the provable sample complexity value , as per our results discussed below. In particular, as we discuss below and prove in the main body of our paper, when , the empirical risk minimization problem admits global optimum solutions with zero empirical risk value, but with generalization error bounded away from zero. Thus, the regime does not correspond to the regime where learning with guaranteed generalization error is possible.
It is also worth noting that albeit not being our focus in the present paper, the aforementioned paper by Soltanolkotabi, Javanmard and Lee, more concretely [SJL18, Theorem 2.1], also studies the landscape of the empirical loss when a quadratic network model is used for interpolating arbitrary input/label pairs , , that is, without making an assumption that the labels are generated according to a network with planted weights. They establish similar landscape results; namely, the absence of spurious local minima, and the fact that every saddle point has a direction of negative curvature, as long as the output weights has at least positive and negative entries (consequently, has to be at least ). While this result does not assume any rank condition on like in our case, it bypasses this technicality at the cost of assuming that the output weights contain at least positive and negative weights, and consequently, by assuming is at least , namely when the network is sufficiently wide.
Another work which is partially relevant to the present paper is the work by Du and Lee [DL18] who establish the following result: as long as , and is any smooth and convex loss, the landscape of the regularized loss function admits favorable geometric characteristics, that is, all local minima is global, and all saddle points are strict.
Next, we study the question of proper initialization required by our analysis: is it possible to guarantee an initialization scheme under which one starts below the minimum loss value achieved by rank deficient matrices? We study this problem in the context of randomly generated weight matrices , and establish the following result: as long as the network is wide enough, specifically for some sufficiently large constant , the spectrum of the associated Wishart matrix is tightly concentrated, consequently, it is possible to initialize so that with high probability the population loss of is below the required threshold. This is the subject of our Theorem 2.6. This theorem relies on several results from the theory of random matrices regarding the concentration of the spectrum, both for the matrices with standard normal entries; as well as for matrices having i.i.d. centered entries with unit second and finite fourth moment, outside the proportional regime, that is, we focus on the regime where , but .
Our next focus is on the following question: in light of the fact that (and any of its orthonormal equivalents) achieve zero generalization error for both the empirical and population risk, can we expect that an optimal solution to the problem of minimizing empirical loss function achieves the same? The answer turns out to be positive and we give necessary and sufficient conditions on the sample to achieve this. We show that, if is the space of all -dimensional real-valued symmetric matrices, then any global minimum of the empirical loss is necessarily a global optimizer of the population loss, and thus, has zero generalization error. Note that, this condition is not retrospective in manner: this geometric condition can be checked ahead of the optimization task by computing . Conversely, we show that if the span condition above is not met then there exists a global minimum of the empirical risk function which induces a strictly positive generalization error. This is established in Theorem 2.7.
A common paradigm in statistical learning theory is that, overparametrized models, that is, models with more parameters than necessary, while being capable of interpolating the training data, tend to generalize poorly. Yet, it has been observed empirically that neural networks tend to not suffer from this complication [ZBH16]: despite being overparametrized, they seem to have good generalization performance, provided the interpolation barrier is exceeded. The result of Theorem 2.7 will shed light on this phenomenon for the case of shallow neural networks with quadratic activations. More concretely, we establish the following: suppose that the data enjoys the aforementioned geometric condition. Then, one still retains good generalization, even when the interpolation is satisfied using a neural network with potentially larger number of internal nodes, namely by using a weight matrix where is possibly larger than . Hence, our result explains the aforementioned phenomenon in case of networks with quadratic activations: interpolation implies good generalization, even if the prediction is made using a larger network, provided the interpolated data satisfies a certain geometric condition, which essentially holds for almost any data.
To complement our analysis, we then ask the following question: what is the critical number of the training samples, under which the (random) data enjoys the aforementioned span condition? We prove the number to be , under a very mild assumption that the coordinates of are jointly continuous. This is shown in Theorem 2.9. Finally, in Theorem 2.10 we show that when not only there exists with zero empirical risk and strictly positive generalization error, but we bound this error from below by the same amount as the bound for all rank deficient matrices discussed in our first Theorem 2.2.
The remainder of the paper is organized as follows. In Section 2.1 we present our main results on the landscape of the population risk. In particular, we provide an analytical expression for the population risk , state our energy barrier result for rank deficient matrices, our result about the absence of full-rank stationary points of a population risk function except the globally optimum points; and our result on the convergence of gradient descent. In Section 2.2, we present our results regarding randomly generated weight matrices and sufficient conditions for good initializations. In Section 2.3, we study the critical number of training samples guaranteeing good generalization property. The proofs of all of our results are found in Section 3.
We provide a list of notational convention that we follow throughout. The set of real numbers is denoted by , and the set of positive real numbers is denoted by . The set is denoted by . Given any matrix , and respectively denote the smallest and largest singular values of ; denotes the spectrum of , that is, the set of all eigenvalues of ; and denotes the sum of the diagonal entries of . Given two matrices , denote their Hadamard product by , which is a matrix where for every and , . Moreover, denotes the spectral norm of , that is, the square root of the largest eigenvalue of . Denote by the identity matrix. The objects with an asterisk denotes the planted objects, for instance, denotes the planted weight matrix of the network. Given any vector , denotes its Euclidean norm, that is, . Given two vectors , their Euclidean inner product is denoted by . Given a collection of objects of the same kind (in particular, vectors or matrices), is the set, . , , and are standard (asymptotic) order notations for comparing the growth of two sequences. Finally, the population risk is denoted by ; and its empirical variant is denoted by .
2 Main Results
2.1 Landscape of Population Risk: Band-Gap and Optimality of Full-Rank Stationary Points
In this section, we present our results pertaining the landscape of the population risk, including an analytical expression for the population risk, the aforementioned band-gap; and the global optimality of full-rank stationary points, and a discussion on success of gradient descent on such networks. All of our results hold under a mild assumption on the data distribution: the data has centered i.i.d. coordinates, that is are i.i.d. with .
An Analytical Expression for the Population Risk
Our first result provides an analytical expression for the population risk of any in terms of how close it is to the planted weight matrix .
We recall that a random vector in is defined to have jointly continuous distribution if there exists a measurable function such that for any and Borel set ,
where is the Lebesgue measure on .
where the expectation is with respect to the distribution of .
Suppose the distribution of is jointly continuous. Then , that is, almost surely with respect to , if and only if for some orthonormal matrix .
Suppose now that the coordinates of are i.i.d. with , and .
It holds that:
where , and is the Hadamard product of with itself. In particular, if has i.i.d. standard normal coordinates, we obtain .
The following bounds hold:
In a nutshell, Theorem 2.1 states that the population risk of any is completely determined by how close it is to the planted weights as measured by the matrix ; and the second and fourth moments of the data. This is not surprising: is essentially a function of the first four moments of the data, and the difference of the quadratic forms generated by and , which is precisely encapsulated by the matrix . Note also that the characterization of the “optimal orbit” per part is not surprising either: any matrix with the property where is an orthonormal matrix, that is, , has the property that for any data . Part then says the the reverse is true as well, provided that the distribution of is jointly continuous. Note also that for with centered i.i.d. entries the thesis of part follows also from part : implies that , which, together with the fact that is symmetric, then yields , that is, .
Existence of a Band-Gap
Our next result, which is essentially a consequence of Theorem 2.1 , shows the appearance of a band gap structure in the landscape of the population risk , below which any rank-deficient ceases to exist.
Suppose that has i.i.d. centered coordinates with variance , (finite) fourth moment , and .
It holds that
There exists a matrix such that and
This theorem indicates that there is a threshold , such that for any with , it is the case that is full rank. This value of is the aforementioned energy value below which any rank-deficient ceases to exist. Part of Theorem 2.2 implies that our lower bound on the value is tight up to a multiplicative constant determined by the moments of the data. As a simple corollary to Theorem 2.2, we obtain that the landscape of the population risk still admits an energy barrier, even if we consider the same network architecture with planted weight matrix , and quadratic activation function having lower order terms, that is, the activation , with . This barrier is quantified by , in addition to and the corresponding moments of the data.
For any , define , where with arbitrary. Let
where has centered i.i.d. coordinates. Then,
The proof of this corollary is deferred to Section 3.4.
Global Optimality of Full-Rank Stationary Points and Convergence of Gradient Descent
Our next result establishes that any full-rank stationary point (of the population risk) is necessarily a global minimum.
Suppose with . Suppose has centered i.i.d. coordinates with , ; and . Let be a stationary point of the population risk with full-rank, that is, , and . Then, for some orthogonal matrix , and that, .
We now combine Theorems 2.2 and 2.4 to obtain the following potentially interesting conclusion on running the gradient descent for the population loss. If the gradient descent algorithm is initialized at a point which has a sufficiently small population risk, in particular lower than the smallest risk value achieved by rank deficient matrices, then the gradient descent converges to a global optimum of the population risk optimization problem , which is zero.
Let be a matrix of weights, with the property that
where by we denote the spectral norm of the matrix . Then, and the gradient descent algorithm with initialization and a step size of generates a trajectory of weights such that .
In particular, the gradient descent, when initialized properly, finds global optimum due to absence of spurious stationary points within the set of full-rank matrices. In the next section, we address the question of proper initialization when the (planted) weights are generated randomly, and complement Theorem 2.5 by providing a deterministic initialization guarantee, which with high probability beats the aforementioned energy barrier.
2.2 On Initialization: Randomly Generated Planted Weights
As noted in the previous section, our Theorems 2.2 and 2.4 offer an alternative conceptual explanation for the success of training gradient descent in learning aforementioned neural network architectures from the landscape perspective; provided that the algorithm is initialized properly.
In this section, we provide a way to properly initialize such networks under the assumption that the data has centered i.i.d. coordinates with finite fourth moment and the (planted) weight matrix has i.i.d. centered entries with unit variance and finite fourth moment. Our result is valid provided that the network is sufficiently overparametrized: for some large constant . Note that this implies is a tall matrix sending into .
The rationale behind this approach relies on the following observations:
Suppose that the spectrum of is tightly concentrated around a value , that is, for some sufficiently small , . Now, assume the initialization is such that , and thus, . Then, the population loss per equation above can be made small enough (in particular, smaller than the energy barrier stated in the previous section), provided is small enough, namely, provided that the concentration is tight enough.
The spectrum of tall random matrices are essentially concentrated (see [Ver10, Corollary 5.35] and references therein). Namely, sufficiently tall random matrices are approximately isometric embeddings of into
Equipped with these observations, we are now in a position to state our result, a high probability guarantee for the cost of a particular choice of initialization.
Suppose that the data consists of i.i.d. centered coordinates with and . Recall that
where the expectation is taken with respect to the randomness in a fresh sample .
Suppose that the planted weight matrix has i.i.d. standard normal entries. Let the initial weight matrix be defined by for , and otherwise (hence, with ). Then, provided for a sufficiently large absolute constant ,
with probability at least , where the probability is with respect to the draw of .
Suppose the planted weight matrix has centered i.i.d. entries with unit variance and finite fourth moment. Let the initial weight matrix be defined by for , and otherwise (hence, ). Then, provided for a sufficiently large absolute constant ,
with high probability, as , where the probability is with respect to the draw of .
The proof of this theorem is provided in Section 3.7.
Note that, the part of Theorem 2.6 gives an explicit rate for probability, in the case when the i.i.d. entries of the planted weight matrix are standard normal, and is based on a non-asymptotic concentration result for the spectrum of such matrices. The extension in part is based on a result of Bai and Yin [BY88].
With this, we now turn our attention to the number of training samples required to learn such models.
2.3 Critical Number of Training Samples
Our next focus is on the number of training samples required for controlling the generalization error. In particular we establish the following results: we identify a necessary and sufficient condition on the training data under which any minimizer of the empirical loss (which, in the case of planted weights, necessarily interpolates the data) has zero generalization error. Furthermore we identify the smallest number of training samples, such that (randomly generated) training data satisfies the aforementioned condition, so long as .
A Necessary and Sufficient Geometric Condition on the Training Data
Below is a necessary and sufficient (geometric) condition on the training data under which any minimizer of the empirical loss (which, in the case of planted weights, necessarily interpolates the data) has zero generalization error.
Let be a set of data.
where is the set of all symmetric real-valued matrices. Let be arbitrary. Then for any interpolating the data, that is for every , it holds that . In particular, if , then for some matrix with orthonormal columns, , and if , then for some matrix with orthonormal columns, .
is a strict subset of . Then, for any with and any positive integer , there exists a such that , while interpolates the data, that is, for all . In particular, for this , , where is defined with respect to any jointly continuous distribution on .
We now make the following remarks. The condition stated in Theorem 2.7 is not retrospective in manner: it can be checked ahead of the optimization process. Next, there are no randomness assumptions in the setting of Theorem 2.7, and it provides a purely geometric necessary and sufficient condition: as long as is the space of all symmetric matrices (in ) we have that any (global) minimizer of the empirical risk has zero generalization error. Conversely, in the absence of this geometric condition, there are optimizers of the empirical risk such that while , the generalization error of is bounded away from zero, that is, . Soon in Theorem 2.10, we give a more refined version of this result, with a concrete lower bound on , in the more realistic setting, where the training data is generated randomly.
We further highlight the presence of the parameter . In particular, part of Theorem 2.7 states that provided the span condition is satisfied, any neural network with internal nodes interpolating the data has necessarily zero generalization error, regardless of whether is equal to , in particular, even when . This, in fact, is an instance of an interesting phenomenon empirically observed about neural networks, which somewhat challenges one of the main paradigms in statistical learning theory: overparametrizartion does not hurt generalization performance of neural networks once the data is interpolated. Namely beyond the interpolation threshold, one retains good generalization property.
We note that Theorem 2.7 still remains valid under a slightly more general setup, where each node has an associated positive but otherwise arbitrary output weight .
Let , , and be the function computed by the neural network with input , quadratic activation function, planted weights , and output weights , that is, . Let be a set of data.
Then for any and interpolating the data, that is for every , it holds that for every (here, for all ). In particular, achieves zero generalization error.
is a strict subset of . Then, for any , and every , there is a pair , such that while interpolates the data, that is, for every , has strictly positive generalization error, with respect to any jointly continuous distribution on .
The proof of this corollary is deferred to Section 3.9.
Randomized Data Enjoys the Geometric Condition
We now identify the smallest number of training samples, such that (randomly generated) training data satisfies the aforementioned geometric condition almost surely; as soon as .
Let , and be i.i.d. random vectors with jointly continuous distribution. Then,
If , then .
If , then for arbitrary , .
The critical number is obtained to be since . Note also that, with this observation, part of Theorem 2.9 is trivial, since we do not have enough number of matrices to span the space .
Sample Complexity Bound for the Planted Network Model
Let be i.i.d. with a jointly continuous distribution on . Let the corresponding outputs be generated via , with with .
Suppose , and . Then with probability one over the training data , if is such that for every , then for every .
Suppose are i.i.d. random vectors with i.i.d. centered coordinates having variance and finite fourth moment . Suppose that . Then there exists a such that for every , yet the generalization error satisfies
We highlight that the lower bound arising in Theorem 2.10 is the same bound obtained in Theorem 2.2 as the energy band regarding rank deficient matrices. Note also that the interpolating network in in part can potentially be larger than the original network generating the data: any large network, despite being overparametrized, still generalizes well, provided it interpolates on a training set enjoying the aforementioned geometric condition.
In this section, we present the proofs of the main results of this paper.
3.1 Auxiliary Results
Some of our results use the following auxiliary results:
([CT05]) Let be an arbitrary positive integer; and be a polynomial. Then, either is identically , or has zero Lebesgue measure, namely, is non-zero almost everywhere.
([HJ12, Theorem 7.3.11]) For two matrices and where ; holds if and only if for some matrix with orthonormal columns.
Our results regarding the initialization guarantees use the several auxiliary results from random matrix theory: The spectrum of tall random matrices are essentially concentrated:
([Ver10, Corollary 5.35]) Let be an matrix with independent standard normal entries. For every , with probability at least , we have:
Let be an random matrix whose entries are independent copies of a random variable with zero mean, unit variance, and finite fourth moment. Suppose that the dimensions and grow to infinity while the aspect ratio converges to a constant in . Then
3.2 Proof of Theorem 2.1
First, we have
where is a symmetric matrix. Note also that,
where is equal to