Quadratic number of nodes is sufficient to learn a dataset via gradient descent.
We prove that if an activation function satisfies some mild conditions and number of neurons in a two-layered fully connected neural network with this activation function is beyond a certain threshold, then gradient descent on quadratic loss function finds the optimal weights of input layer for global minima in linear time. This threshold value is an improvement over previously obtained values in [1, 2].
During the last decade deep learning has achieved remarkable success in several fields of practical applications, despite the fact that the reason for its success is still largely unclear. In particular, it is not yet well understood why neural nets used in practice generalize to data points not used in its training procedure. Alongside with this unexplained phenomenon, there is another issue, which is relatively simpler to state: why do neural nets learn the training data? In other words, it is not clear why our training algorithm (say gradient descent or stochastic gradient descent) manages to find a solution with zero training loss.
There are theoretical results based on approximation theory  which establish that a large enough neural network with suitable weights can approximate any function. However, such theoretical considerations do not touch upon the issues raised above, for example, it is not guaranteed that the corresponding optimal weights associated with the network can be found by performing say, a stochastic gradient descent (SGD) (one of the most commonly used optimization procedure).
In practice however, we see that for networks of practical sizes SGD always succeeds in finding a point of the global minima of the associated loss function. One step towards understanding this phenomenon at a theoretical level was made by Du et al. , who proved that with a probability of , gradient descent finds global minimum of the loss function corresponding to an regression problem solved by a two-layer fully-connected network with ReLU activation, given that the number of hidden neurons is at least , where is the number of training examples and is the smallest eigenvalue of the gramian matrix made with the training data. Recently, Song & Yang  improved this result to .
In a similar vein, our current work further improves this result to under some mild assumptions on activation function that are satisfied by many ones used in practice. We hypothesise that this result is no longer improvable using the same technique.
This work was supported by National Technology Initiative and PAO Sberbank project ID 0000000007417F630002.
2. A new activation function and statement of the main result
2.1. A necessary and sufficient condition for a function to be an activation function
In [3, Theorem 1] the author gives a necessary and sufficient condition for a function to be an activation function for a neural network. We state the result below:
[3, Theorem 1] Suppose is a function which is in and the closure of the set of discontinuities of are of Lebesgue measure zero. Define
Then is dense in in the compact open topology if and only if is not an algebraic polynomial (a.e.).
It is worthwhile to note that all the standard activation functions used in neural networks satisfy the hypothesis of the above proposition. We will prove that under some mild conditions on an activation function, a two-layered neural network with number of hidden neurons beyond a certain improved threshold, can always be trained by gradient descent in linear time. Our new improved threshold is an improvement over previous threshold values obtained in [1, 2].
2.2. Our set-up
Suppose is a function which satisfies the conditions of Proposition 2.1, so that the single hidden layered fully connected neural network with this activation function given by
where are the input weights, are the input biases and are the output weights, can approximate any element in in the compact open topology with the right choice of . To facilitate easy computation and avoid unecessary complications with notations, we will consider the above mentioned neural network without the “bias” term, i.e. our working model will be
Assumptions on the activation function
The only assumptions we make on the activation function are the following:
We assume that there exist constants such that and for all .
Let . We then assume that for some constant .
An important example of such an activation function is given by the softplus activation function defined as . Clearly, it satisfies Assumption (a). Using the inequality for all , it can be shown that also satisfies Assumption (b).
Suppose we are given a training data , where for each and corresponding responses , where for each . Denoting the collection of input weights of the neural network (1) by and output weights as , we define the quadratic loss function as
We will give a new lower bound on , the number of neurons in the hidden layer, and will prove that performing continuous-time gradient descent on the loss function with that many neurons, leads us to zero-loss solution, with the convergence rate being exponential. This new lower bound is better than the bounds available so far in the literature ([1, 2]). More precisely, we prove the following result:
Consider an activation function satisfying the assumptions mentioned in Subsection 2.2. Suppose we are given a training data with , , such that , and responses with and for some number . Let be the minimum eigenvalue of the matrix , whose entries are given by
Then we have the following:
Fix a such that , where . Let us select ( are the constants appearing in the assumption in Subsection 2.2) and consider the network with a single hidden layer:
Then with random initializations: and for all and the condition that we do not train the output layer i.e. ’s are kept fixed upon initialization, with probability at least , gradient descent with small enough step size converges to ’s at an exponential rate.
Define . There exists a constant such that implies . Then, as a corollary of point (b) we have that with random initialization as in the previous point, with probability at least , gradient descent with small enough step size converges to ’s at an exponential rate.
2.3. Why this new bound on is better than the available bounds
To the best of our knowledge, the first mathematical proof that with high probability, gradient descent indeed finds the optimal input weights while training a two-layered neural network with ReLU activation function, provided the number of neurons in the hidden layer is above a certain threshold, appeared in the work . The threshold was proven to be where is the failure probability (see [1, Theorem 3.2]). Mathematically this result is sound however, it is too loose: indeed, as we observe in practice gradient descent easily finds a zero-loss solution with much fewer number of hidden nodes [4, Section 6], a phenomenon whose mathematical explanation is still lacking to the best of our knowledge.
A subsequent improvement of the lower bounds appeared in the work [2, Theorem 1.4] namely, the bound was brought down to , which is still too loose. Although in [2, Theorem 1.6], a threshold which looks like has been proposed, this threshold introduces two quantities: and , which depend on training dataset. In the worst case and . Hence [2, Theorem 1.6] does not improve over quartic lower bound of [2, Theorem 1.4] in the general case. In contrast, our result holds without any additional assumptions on training data.
A recent work  has considered the problem of convergence of gradient descent in light of deep neural networks with analytic activation functions. It is worthwhile to say a few words about this work in the present context. In particular, [5, Theorem 1] applied to a single layered fully connected network apparently yields a threshold . However, we would like to point out that the consideration in  is fundamentally different from ours as well as that in [1, 2], within the context of a network with single hidden layer. A closer look at the proof of [5, Theorem 1] after equation (19) in [5, pp. 5] reveals that, the proof proceeds by enforcing zero learning rate for all but the last (output) layer. To put it differently, if we train our single layered network given by Equation (1) by the methodology proposed in the proof of [5, Theorem 1] upon making some random initialization, the only weights which are being updated in the sequel are the output weights namely . This means that we are essentially training a single layered linear network with gradient descent on randomly extracted features.
Let us show that in this case if the given number of nodes is at least , and all of the training points are different, gradient descent finds the global minima almost surely with respect to initialization. We give a brief proof of this. Recall the quadratic loss from Equation (2). Denoting it by , it follows that gradient of with respect to the output weights only has components given by
Denoting the gradient by , we see that , where is the matrix given by for and and is the vector whose components are . At a critical point we must have that which means . Multiplying the last equation from the left by , this means . If is invertible, this would mean that the only critical point is the one for which i.e. for all , which in that case would be a point of global minima. will be invertible if and only if . Now we may observe that is an analytic function of the input weights and biases. By [6, Lemma 1.2] it follows that either the set of zeroes of the analytic function are of Lebesgue measure zero, or is identically zero. Since the number of nodes is not less than the number of training points , and all of the training points are different, there exist such for which is not zero. Hence the first case holds. This means if we randomly initialize the input weights and biases by sampling from any distribution which has a density with respect to the Lebesgue measure, almost surely we will have the matrix invertible, so that the loss function will have a unique critical point, which would be a point of global minima. It follows now that gradient descent in this case will always converge to a global minimum with zero loss. In practice, however, networks are usually trained as a whole. Training the last layer only corresponds to so-called “kernel” or “lazy training” regime , which is not our consideration. The above-mentioned work argues that it is unlikely that such a lazy training regime is behind many succeses of deep learning.
Throughout the section and henceforth, will denote the tensor product of matrices and will denote the direct sum of matrices. We will not define these notions, as they could be found in any standard text book on linear algebra.
3.1. Khatri-Rao product of matrices
Suppose and be matrices. The Khatri-Rao product of and [8, Lemma 13] denoted is a matrix, whose column is obtained by taking tensor product of the columns of and . For example, let and . Then would be the matrix obtained as: , where we have the convention that .
3.2. Concentration inequalities for Lipschitz functions of Gaussian random variables
The following result is a very powerful concentration inequality, a nice proof of which could be found in the online lecture notes:
https://www.stat.berkeley.edu/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf where it appears as Theorem 2.4.
Let be a vector of iid standard Gaussian variables, and let be -Lipschitz with respect to the Euclidean norm. Then the variable
is sub-Gaussian with parameter at most , and hence
3.3. Minimal eigenvalues of perturbed matrices
For a real matrix , let denotes the -norm (i.e. ) and denotes the least eigenvalue. For two positive semidefinite matrices , we have that .
Since and are positive semidefinite matrices, by spectral theorem it implies that and . Now
so that we have
for all , so that taking the minimum of the left hand side yields , as desired. ∎
Suppose and are two positive semidefinite matrices such that for each we have . Then .
For a matrix , let denotes the Frobenius norm. Then it is a well known fact that . Now
Applying Lemma 3.2, we have the result. ∎
4. Dynamics of the Gramian matrix associated with the gradient of the loss function and related results
Setting , we define
. is a differentiable function with respect to the input weights . We make a random assignment for so that for each , . Let us explicitly compute .
Recall each is a vector in , so that we can write . This means that is a function with variables in . Now
We compute .
Let be the vector whose components are given by for . Let us define a matrix whose entries are given by for and , as well as a matrix whose entries are given by for and . It then follows that
or in other words we have that
where is the Khatri-Rao product of matrices as described in Subsection 3.1 and denotes the direct sum of matrices. Let us look at a typical column of the matrix . The j-column () looks like:
Let us consider the matrix . The element of this matrix is given by
where we have used the fact that for all . Let us consider iid normal variates , where for each and . We will prove a result similar to [1, Theorem 3.1]. Consider the matrix given by
Let denotes the measure space with the measure induced by the random variable and denotes the Hilbert space of square integrable -valued functions, so that for , we have that
For , let us define by . Clearly, for all . To prove the hypothesis of the theorem, we may note that the matrix is the Gramian matrix given by , so that the problem boils down to proving that the vectors are linearly independent in . So let for some scalars . This means for almost all , we have that . Now is a topological space (carries the topology of ) and the measure on induced by is a Radon measure, which assigns positive mass to open subsets of . is a continuous -valued function which is almost everywhere (with respect to this measure on ) zero. Thus it must be zero everywhere, i.e. we have for all i.e. for all , so that in particular for all and . Let us select such that for each the numbers are different from each other i.e. for (See Section 5 for a proof of this). Also, let us assume without loss of generality that . With this we have that
It follows by a direct computation that for all . Differentiating equation (4) with respect to -times and evaluating at zero, we see that
Recalling that for all we must have , the above equation implies that
Now letting , noting that for all so that , we have that which implies that as . Considering the above sum from onwards, we can prove similarly that , which proves the linear independence, which in turn implies the hypothesis of the theorem. ∎
Given any with , for (see Subsection 2.2 for the definitions of ), with probability at least (where the probability is the product probability on induced by iid random variables ), the matrix described by
For each , let us define a random variable
and the quantity given by
It follows that . can be thought of as a function in variables (a function of the input weights). Let us write in a convenient form. Consider the vectors
Then it follows that . Recalling that for all , it follows that
We have that and by the assumption on activation function, as mentioned in Subsection 2.2. This implies that
and similarly . This implies that . By virtue of mean value theorem it now follows that can be regarded as a Lipschitz function with parameter . Thus by the inequality stated in Subsection 3.2 we have that for any
Putting , we get
We now have
Let us relate the elements and matrices and . We may note that
Let us also recall that for all . Thus we have that
Thus we have that for any , . This in turn means
Since , this means that the event has a probability of at least . Now would imply that , which means with probability at least , , as required. ∎
4.1. Convergence of gradient descent
In our discourse, we will not train the output weights, we will only train the input weights. Let us make a random initialization of the output weights by selecting them uniformly from the set . Then the loss function becomes a function of the input weights only, so that we can regard it as a function in variables (recall that each of the input weights and there are many of them). The gradient descent then proceeds by updating the input weights as:
where is the step size and is some random initialization and by we are denoting the vector consisting of input weights i.e. which is a vector in . We would like to select a very small step size, and moreover, let us note that our loss function is differentiable as is a differentiable function. So we can rephrase the above equation in terms of a differential equation:
a solution of which will be a continuous curve in given by . At any given time point , let us consider the matrix described by
We now closely follow the technique outlined in [1, p. 5–6]. Let for . Then for , so that letting and , we can write in the vectorial form as . The following lemma can be proven exactly as in [1, Lemma 3.3], so we skip the proof.
Suppose for we have that . Then
For all , we have .
Let us observe the elements of the matrix more closely, namely elements of the form