Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently
Abstract
It has been observed [zhang2016understanding] that deep neural networks can memorize: they achieve 100% accuracy on training data. Recent theoretical results explained such behavior in highly overparametrized regimes, where the number of neurons in each layer is larger than the number of training samples. In this paper, we show that neural networks can be trained to memorize training data perfectly in a mildly overparametrized regime, where the number of parameters is just a constant factor more than the number of training samples, and the number of neurons is much smaller.
1 Introduction
In deep learning, highly nonconvex objectives are optimized by simple algorithms such as stochastic gradient descent. However, it was observed that neural networks are able to fit the training data perfectly, even when the data/labels are randomly corrupted[zhang2016understanding]. Recently, a series of work (du2019gradient, allen2019convergence, chizat2018global, jacot2018neural, see more references in Section 1.2) developed a theory of neural tangent kernels (NTK) that explains the success of training neural networks through overparametrization. Several results showed that if the number of neurons at each layer is much larger than the number of training samples, networks of different architectures (multilayer/recurrent) can all fit the training data perfectly.
However, if one considers the number of parameters required for the current theoretical analysis, these networks are highly overparametrized. Consider fully connected networks for example. If a twolayer network has a hidden layer with neurons, the number of parameters is at least where is the dimension of the input. For deeper networks, if it has two consecutive hidden layers of size , then the number of parameters is at least . All of the existing works require the number of neurons perlayer to be at least the number of training samples (in fact, most of them require to be a polynomial of ). In these cases, the number of parameters can be at least or even for deeper networks, which is much larger than the number of training samples . Therefore, a natural question is whether neural networks can fit the training data in the mildly overparametrized regime  where the number of (trainable) parameters is only a constant factor larger than the number of training data. To achieve this, one would want to use a small number of neurons in each layer  for a twolayer network and for a threelayer network. yun2018small showed such networks have enough capacity to memorize any training data. In this paper we show with polynomial activation functions, simple optimization algorithms are guaranteed to find a solution that memorizes training data.
1.1 Our Results
In this paper, we give network architectures (with polynomial activations) such that every hidden layer has size much smaller than the number of training samples , the total number of parameters is linear in , and simple optimization algorithms on such neural networks can fit any training data.
We first give a warmup result that works when the number of training samples is roughly (where is the input dimension).
Theorem 1 (Informal).
Suppose there are training samples in general position, there exists a twolayer neural network with quadratic activations, such that the number of neurons in the hidden layer is , the total number of parameters is , and perturbed gradient descent can fit the network to any output.
Here “in general position” will be formalized later as a deterministic condition that is true with probability 1 for random inputs, see Theorem 4 for details.
In this case, the number of hidden neurons is only roughly the square root of the number of training samples, so the weights for these neurons need to be trained carefully in order to fit the data. Our analysis relies on an analysis of optimization landscape  we show that every local minimum for such neural network must also be globally optimal (and has 0 training error). As a result, the algorithm can converge from an arbitrary initialization.
Of course, the result above is limited as the number of training samples cannot be larger than . We can extend the result to handle a larger number of training samples:
Theorem 2 (Informal).
Suppose number of training samples for some constant , if the training samples are in general position there exists a threelayer neural network with polynomial activations, such that the number of neurons in each layer is , and perturbed gradient descent on the middle layer can fit the network to any output.
Here considers as a constants and hides constant factors that only depend on . We consider “in general position” in the smoothed analysis framework[spielman2004smoothed]  given arbitrary inputs , fix a perturbation radius , the actual inputs is where . The guarantee of training algorithm will depend inverse polynomially on the perturbation (note that the architecture  in particular the number of neurons  is independent of ). The formal result is given in Theorem 5 in Section 4. Later we also give a deterministic condition for the inputs, and prove a slightly weaker result (see Theorem 6).
1.2 Related Works
Neural Tangent Kernel
Many results in the framework of neural tangent kernel show that networks with different architecture can all memorize the training data, including twolayer [du2019gradient], multilayer[du2018gradient2, allen2019convergence, zou2019improved], recurrent neural network[allen2018convergence]. However, all of these works require the number of neurons in each layer to be at least quadratic in the number of training samples. oymak2019towards improved the number of neurons required for twolayer networks, but their bound is still larger than the number of training samples. There are also more works for NTK on generalization guarantees (e.g., allen2018learning), finegrained analysis for specific inputs[arora2019fine] and empirical performances[arora2019exact], but they are not directly related to our results.
Representation Power of Neural Networks
For standard neural networks with ReLU activations, yun2018small showed that networks of similar size as Theorem 2 can memorize any training data. Their construction is delicate and it is not clear whether gradient descent can find such a memorizing network efficiently.
Matrix Factorizations
Since the activation function for our twolayer net is quadratic, training of the network is very similar to matrix factorization problem. Many existing works analyzed the optimization landscape and implicit bias for problems related to matrix factorization in various settings[bhojanapalli2016global, ge2016matrix, ge2017no, park2016non, gunasekar2017implicit, li2018algorithmic, arora2019implicit]. In this line of work, du2018power is the most similar to our twolayer result, where they showed how gradient descent can learn a twolayer neural network that represents any positive semidefinite matrix. However positive definite matrices cannot be used to memorize arbitrary data, and our twolayer network can represent an arbitrary matrix.
Interpolating Methods
Of course, simply memorizing the data may not be useful in machine learning. However, recently several works[belkin2018overfitting, belkin2019does, liang2019risk, mei2019generalization] showed that learning regimes that interpolate/memorize data can also have generalization guarantees. Proving generalization for our architectures is an interesting open problem.
2 Preliminaries
In this section, we introduce notations, the two neural network architectures used for Theorem 1 and 2, and the perturbed gradient descent algorithm.
2.1 Notations
We use to denote the set . For a vector , we use to denote its norm, and sometimes as a shorthand. For a matrix , we use to denote its Frobenius norm, to denote its spectral norm. We will also use and to denote the th largest eigenvalue and singular value of matrix , and , to denote the smallest eigenvalue/singular value.
For the results of threelayer networks, our activation is going to be , where is considered as a small constant. We use , to hide factors that only depend on .
For vectors , the tensor product is denoted by . We use as a shorthand for th power of in terms of tensor product. For two matrices , we use denote the Kronecker product of 2 matrices.
2.2 Network Architectures
In this section, we introduce the neural net architectures we use. As we discussed, Theorem 1 uses a twolayer network (see Figure 1 (a)) and Theorem 2 uses a threelayer network (see Figure 1 (b)).
Twolayer Neural Network
For the twolayer neural network, suppose the input samples are in , the hidden layer has hidden neurons (for simplicity, we assume is even, in Theorem 4 we will show that is enough). The activation function of the hidden layer is .
We use to denote the input weight of hidden neuron . These weight vectors are collected as a weight matrix . The output layer has only 1 neuron, and we use to denote the its input weight from hidden neuron . There is no nonlinearity for the output layer. For simplicity, we fix the parameters in the way that for all and for all . Given as the input, the output of the neural network is
If the training samples are , we define the empirical risk of the neural network with parameters to be
Threelayer neural network
For Theorem 2, we use a more complicated, threelayer neural network. In this network, the first layer has a polynomial activation , and the next two layers are the same as the twolayer network.
We use to denote the weight parameter of the first layer. The first hidden layer has neurons with activation where is the parameter in Theorem 2. Given input , the output of the first hidden layer is denoted as , and satisfy .The second hidden layer has neurons (again we will later show is enough). The weight matrix for second layer is denoted as where each is the weight for a neuron in the second hidden layer. The activation for the second hidden layer is . The third layer has weight and is initialized the same way as before, where , and . The final output can be computed as
Given inputs , suppose is the output of the first hidden layer for , the empirical loss is defined as:
Note that only the secondlayer weight is trainable. The first layer with weights acts like a random feature layer that maps ’s into a new representation ’s.
2.3 Second order stationary points and perturbed gradient descent
Gradient descent converges to a global optimum of a convex function. However, for nonconvex objectives, gradient descent is only guaranteed to converge into a firstorder stationary point  a point with 0 gradient, which can be a local/global optimum or a saddle point. Our result requires any algorithm that can find a secondorder stationary point  a point with 0 gradient and positive definite Hessian. Many algorithms were known to achieve such guarantee[ge2015escaping, sun2015nonconvex, carmon2018accelerated, agarwal2017finding, jin2017escape, jin2017accelerated]. As we require some additional properties of the algorithm (see Section 3), we will adapt the Perturbed Gradient Descent(PGD, [jin2017escape]). See Section B for a detailed description of the algorithm. Here we give the guarantee of PGD that we need. The PGD algorithm requires the function and its gradient to be Lipschitz:
Definition 1 (Smoothness and Hessian Lipschitz).
A differentiable function is smooth if:
A twicedifferentiable function is Hessian Lipschitz if:
Under these assumptions, we will consider an approximation for exact secondorder stationary point as follows:
Definition 2 (secondorder stationary point).
For a Hessian Lipschitz function , we say that is an secondorder stationary point if:
jin2017escape showed that PGD converges to an secondorder stationary point efficiently:
Theorem 3 (Convergence of PGD (jin2017escape)).
Assume that is smooth and Hessian Lipschitz. Then there exists an absolute constant such that, for any , and constant , will output an secondorder stationary point with probability , and terminate in the following number of iterations:
3 Warmup: Twolayer Net for Fitting Small Training Set
In this section, we show how the twolayer neural net in Section 2.2 trained with perturbed gradient descent can fit any small training set (Theorem 1). Our result is based on a characterization of optimization landscape: for small enough , every secondorder stationary point achieves nearzero training error. We then combine such a result with PGD to show that simple algorithms can always memorize the training data. Detailed proofs are deferred to Section D in the Appendix.
3.1 Optimization landscape of twolayer neural network
Recall that the twolayer network we consider has hidden units with bottom layer weights , and the weight for the top layer is set to for , and for . For a set of input data , the objective function is defined as
With these definitions, we will show that when a point is an approximate secondorder stationary point (in fact, we just need it to have an almost positive semidefinite Hessian) it must also have low loss:
Lemma 1 (Optimization Landscape).
Given training data , Suppose the matrix has full column rank and the smallest singular value is at least . Also suppose that the number of hidden neurons satisfies . Then if , the function value is bounded by .
For simplicity, we will use to denote the residual for th data point: the difference between the output of the neural network and the label . We will also combine these residuals into a matrix . Intuitively, we first show that when is large, the smallest eigenvalue of is very negative.
Lemma 2.
When the number of the hidden neurons , we have
where denotes the smallest eigenvalue of the matrix and denotes the th eigenvalue of the matrix .
Then we complete the proof by showing if the objective function is large, is large.
Lemma 3.
Suppose the matrix has full column rank and the smallest singular value is at least . Then if the spectral norm of the matrix is upper bounded by , the function value is bounded by
Combining the two lemmas, we know is bounded when the point has almost positive Hessian, therefore every secondorder stationary point must be nearoptimal.
3.2 Optimizing the twolayer neural net
In this section, we show how to use PGD to train our twolayer neural network.
Given the property of the optimization landscape for , it is natural to directly apply PGD to find a secondorder stationary point. However, this is not enough since the function does not have bounded smoothness constant and Hessian Lipschitz constant (its Lipschitz parameters depend on the norm of ), and without further constraints, PGD is not guaranteed to converge in polynomial time. In order to control the Lipschitz parameters, we note that these parameters are bounded when the norm of is bounded (see Lemma 5 in appendix). Therefore we add a small regularizer term to control the norm of . More concretely, we optimize the following objective
We want to use this regularizer term to show that: 1. the optimization landscape is preserved: for appropriate , any secondorder stationary point of will still give a small ; and 2. During the training process of the 2layer neural network, the norm of is bounded, therefore the smoothness and Hessian Lipschitz parameters are bounded. Then, the proof of Theorem 1 just follows from the combination of Theorem 3 of PGD and the result of the geometric property.
The first step is simple as the regularizer only introduces a term to the Hessian, which increases all the eigenvalues by . Therefore any secondorder stationary point of will also lead to the fact that is small, and hence is small by Lemma 1.
For the second step, note that in order to show the training process using PGD will not escape from the area with some , it suffices to bound the function value by , which implies . To bound the function value we use properties of PGD: for a gradient descent step, since the function is smooth in this region, the function value always decreases; for a perturbation step, the function value can increase, but cannot increase by too much. Using mathematical induction, we can show that the function value of is smaller than some fixed value(related to the random initialization but not related to time ) and will not escape the set for appropriate .
Using PGD on function , we have the following main theorem for the 2layer neural network.
Theorem 4 (Main theorem for 2layer NN).
Suppose the matrix has full column rank and the smallest singular value is at least . Also assume that we have and for all . We choose our width of neural network and we choose , , and . Then there exists an absolute constant such that, for any , and constant , on will output an parameter such that with probability , when the algorithm terminates in the following number of iterations:
4 ThreeLayer Net for Fitting Larger Training Set
In this section, we show how a threelayer neural net can fit a larger training set (Theorem 2). The main limitation of the twolayer architecture in the previous section is that the activation functions are quadratic. No matter how many neurons the hidden layer has, the whole network only captures a quadratic function over the input, and cannot fit an arbitrary training set of size much larger than . On the other hand, if one replaces the quadratic activation with other functions, it is known that even twolayer neural networks can have bad local minima[safran2018spurious].
To address this problem, the threelayer neural net in this section uses the firstlayer as a random mapping of the input. The first layer is going to map inputs ’s into ’s of dimension (where ). If ’s satisfy the requirements of Theorem 4, then we can use the same arguments as the previous section to show perturbed gradient descent can fit the training data.
We prove our main result in the smoothed analysis setting, which is a popular approach for going beyond worstcase. Given any worstcase input , in the smoothed analysis framework, these inputs will first be slightly perturbed before given to the algorithm. More specifically, let , where is a random Gaussian vector following the distribution of . Here the amount of perturbation is controlled by the variance . The final running time for our algorithm will depend inverse polynomially on . Note that on the other hand, the network architecture and the number of neurons/parameters in each layer does not depend on .
Let denote the output of the first layer with , we first show that ’s satisfy the requirement of Theorem 4:
Lemma 4.
Suppose and , let be the perturbed input in the smoothed analysis setting, where , let be the output of the first layer on the perturbed input (). Let be the matrix whose th column is equal to , then with probability at least , the smallest singular value of is at least .
This lemma shows that the output of the first layer (’s) satisfies the requirements of Theorem 4. With this lemma, we can prove the main theorem of this section:
Theorem 5 (Main theorem for 3layer NN).
Suppose the original inputs satisfy , inputs are perturbed by , with probability over the random initialization, for , perturbed gradient descent on the second layer weights achieves a loss in iterations.
Using different tools, we can also prove a similar result without the smoothed analysis setting:
Theorem 6.
Suppose the matrix has full column rank, and smallest singular value at least . Choose , with high probability perturbed gradient descent on the second layer weights achieves a loss in iterations.
5 Experiments
In this section, we validate our theory using experiments. Detailed parameters of the experiments as well as more result are deferred to Section A in Appendix.
Small Synthetic Example
We first run gradient descent on a small synthetic dataset, which fits into the setting of Theorem 4. Our training set, including the samples and the labels, are generated from a fixed normalized uniform distribution(random sample from a hypercube and then normalized to have norm 1). As shown in Figure 2, simple gradient descent can already memorize the training set.
MNIST Experiment
We also show how our architectures (both twolayer and threelayer) can be used to memorize MNIST. For MNIST, we use a squared loss between the network’s prediction and the true label (which is an integer in ). For the twolayer experiment, we use the original MNIST dataset, with a small Gaussian perturbation added to the data to make sure the condition in Theorem 4 is satisfied. For the threelayer experiment, we use PCA to project MNIST images to 100 dimensions (so the twolayer architecture will no longer be able to memorize the training set). See Figure 3 for the results. In this part, we use ADAM as the optimizer to improve convergence speed, but as we discussed earlier, our main result is on the optimization landscape and the algorithm is flexible.
MNIST with random label
We further test our results on MNIST with random labels to verify that our result does not use any potential structure in the MNIST datasets. The setting is exactly the same as before. As shown in Figure 4, the training loss can also converge.
6 Conclusion
In this paper, we showed that even a mildly overparametrized neural network can be trained to memorize the training set efficiently. The number of neurons and parameters in our results are tight (up to constant factors) and matches the bounds in yun2018small. There are several immediate open problems, including generalizing our result to more standard activation functions and providing generalization guarantees. More importantly, we believe that the mildly overparametrized regime is more realistic and interesting compared to the highly overparametrized regime. We hope this work would serve as a first step towards understanding the mildly overparametrized regime for deep learning.
References
Appendix A More Experiments and Detailed Experiment Setup
a.1 Experiments setup
In this section, we introduce the experiment setup in detail.
Small Synthetic Example
We generate the dataset in the following way: We first set up a random matrices (samples), where is the number of samples, is the input dimension and (labels). Each entry in or follows a uniform distribution with support . Each entry is independent from others. Then we normalize the dataset such that each row in has norm , denote the normalized dataset as . Then we compute the smallest singular value for the matrix , and we feed the normalized dataset into the twolayer network(Section 2.2) with hidden neurons. We select all the parameters as shown in Theorem 4, and plot the function value for .
In our experiment for the small artificial random dataset, we choose , and .
MNIST experiments
For MNIST, we use a squared loss between the network’s prediction and the true label (which is an integer in ).
For the first twolayer network structure, we first normalize the samples in MNIST dataset to have norm 1. Then we set up a twolayer network with quadratic activation with hidden neurons (note that although our theory suggests to choose , having a larger increases the number of decreasing directions and helps optimization algorithms in practice). For these experiments, we use Adam optimizer[kingma2014adam] with batch size 128, initial learning rate 0.003, and decay the learning rate by a factor of 0.3 every 15 epochs (we find that the learning ratedecay is crucial for getting high accuracy).
We run the twolayer network in two settings, one for the original MNIST data, and one for the MNIST data with a small Gaussian noise (0.01 standard deviation per coordinate). The perturbation is added in order for the conditions in Theorem 4 to hold.
For the threelayer network structure, we first normalize the samples in MNIST dataset with norm 1. Then we do the PCA to project it into a 100dimension subspace. We use to denote this dataset after PCA. Note that the original 2layer network may not apply to this setting, since now the matrix does not have full column rank(. We then add a small Gaussian perturbation to to the sample matrix and denote the perturbed matrix . We then randomly select a matrix and compute the random feature , where denote the elementwise square. Then we feed this sample into the 2layer neural network with hidden neuron . Note that this is equivalent to our threelayer network structure in Section 2.2. In our experiments, .
MNIST with random labels
These experiments have exactly the same setup as the original MNIST experiments, except that the labels are replaced by a random number in {0,1,2,…,9}.
a.2 Experiment Results
In this section, we give detailed experiment results with bigger plots. For all the training loss graphs, we record the training loss for every 5 iterations. Then for the th recorded loss, we average the recorded loss from th to th and set it as the average loss at th iteration. Then we take the logarithm on the loss and generated the training loss graphs.
Small Synthetic Example
As we can see in Figure 5 the loss converges to 0 quickly.
MNIST experiments with original labels
First we compare Figure 6 and Figure 7. In Figure 6, we optimize the twolayer architecture with original input/labels. Here the loss decreases to a small value (), but the decrease becomes slower afterwards. This is likely because for the matrix defined in Theorem 4, some of the directions have very small singular values, which makes it much harder to correctly optimize for those directions. In Figure 7, after adding the perturbation the smallest singular value of the matrix becomes better, and as we can see the loss decreases geometrically to a very small value ().
A surprising phenomenon is that even though we offer no generalization guarantees, the network trained as in Figure 6 has an MSE error of 1.21 when tested on test set, which is much better than a random guess (recall the range of labels is 0 to 9). This is likely due to some implicit regularization effect [gunasekar2017implicit, li2018algorithmic].
For threelayer networks, in Figure 8 we can see even though we are using only the top 100 PCA directions, the threelayer architecture can still drive the training error to a very low level.
MNIST with random label
When we try to fit random labels, the original MNIST input does not work well. We believe this is again because there are many small singular values for the matrix in Theorem 4, so the data does not have enough effective dimensions fit random labels. The reason that it was still able to fit the original labels to some extent (as in Figure 6) is likely because the original label is correlated with some features of the input, so the original label is less likely to fall into the subspace with smaller singular values. Similar phenomenon was found in arora2019fine.
Appendix B Detailed Description of Perturbed Gradient Descent
In this section we give the pseudocode of the Perturbed Gradient Descent algorithm as in jin2017escape, see Algorithm 1. The algorithm is quite simple: it just runs the standard gradient descent, except if the loss has not decreased for a long enough time, it adds a perturbation. The perturbation allows the algorithm to escape saddle points. Note that we only use PGD algorithm to find a secondorder stationary point. Many other algorithms, including stochastic gradient descent and accelerated gradient descent, are also known to find a secondorder stationary point efficiently. All these algorithms can be used for our analysis.
Appendix C Gradient and Hessian of the Cost Function
Before we prove any of our main theorems, we first compute the gradient and Hessian of the functions and . In our training process, we need to compute the gradient of function , and in the analysis for the smoothness and Hessian Lipschitz constants, we need both the gradient and Hessian.
Recall that given the samples and their corresponding labels , we define the cost function of the neural network with parameters ,
Given the above form of the cost function, we can write out the gradient and the hessian with respect to . We have the following gradient,
and
In the above computation, is a column vector and is a square matrix whose different rows means the derivative to elements in and different columns represent the derivative to elements in . Then, given the above formula, we can write out the quadratic form of the hessian with respect to the parameters ,
In order to train this neural network in polynomial time, we need to add a small regularizer to the original ocst function . Let
where is a constant. Then we can directly get the gradient and the hessian of from those of . We have
For simplicity, we can use to denote , where is a diagonal matrix with . Then we have
Appendix D Omitted Proofs for Section 3
In this section, we will give a formal proof of Theorem 4. We will follow the proof sketch in Section 3. First in Section D.1 we prove Lemma 1 which gives the optimization landscape for the twolayer neural network with large enough width;then in Section D.2 we will show that the training process on the function with regularization will end in polynomial time.
d.1 Optimization landscape of twolayer neural net
In this part we will prove the optimization landscape(Lemma 1) of 2layer neural network. First we recall Lemma 1.
See 1
For simplicity, we will use to denote the error of the output of the neural network and the label . Consider the matrix . To show that every secondorder stationary point of will have small function value , we need the following 2 lemmas.
Generally speaking, the first lemma shows that, when the network is large enough, any point with almost Semidefinite Hessian will lead to a small spectral norm of matrix .
See 2
Proof.
First note that the equation
is equivalent to
and we will give a proof of the equivalent form.
First, we show that
Intuitively, this is because is the sum of two terms, one of them is always positive semidefinite, and the other term is equivalent to a weighted combination of the matrix applied to different columns of .
Then we have
For the other side, we show that
by showing that there exists such that .
First, let . Recall that for simplicity, we assume that is an even number and for all and for all . If , there exists such that

,

for all ,

,
since for constraints 2 and 3, they form a homogeneous linear system, and constraint 2 has equations and constraint 3 has equations. The total number of the variables is and we have since we assume that . Then there must exists that satisfies constraints 2 and 3. Then we normalize that to have norm