On the convergence of gradient descent for two layer neural networks

On the convergence of gradient descent for two layer neural networks

Lei Li leili2010@sjtu.edu.cn School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University, Shanghai, 200240, P. R. China.
Abstract

It has been shown that gradient descent can yield the zero training loss in the over-parametrized regime (the width of the neural networks is much larger than the number of data points). In this work, combining the ideas of some existing works, we investigate the gradient descent method for training two-layer neural networks for approximating some target continuous functions. By making use the generic chaining technique from probability theory, we show that gradient descent can yield an exponential convergence rate, while the width of the neural networks needed is independent of the size of the training data. The result also implies some strong approximation ability of the two-layer neural networks without curse of dimensionality.

1 Introduction

The universal approximation theorem tells us that a two-layer neural networks can approximate a broad class of functions, provided that the width is sufficiently large, i.e. in the over-parametrized regime [1, 2]. Recently, it has been shown that gradient descent can find the optimal parameters for the wide two layer neural networks using optimal transport theory [3, 4, 5], and using direct estimation on particular neural network structure [6, 7, 8]. All these results are in the over-parametrized regime. We are interested to see whether the loss function can converge to zero in the under-parametrized regime, i.e., the number of data points are larger than the width of the two layer neural networks.

Following [4, 6], we consider the loss function given by the quadratic function

(1.1)

where is the approximation given by the two layer neural network with width , is a compact set and , is the dimension, and is some probability measure on . We aim to show that under suitable setup, with high probability, the convergence rate is exponentially fast, and we desire the width of the neural networks is independent of the training data, by using suitable approximation . Note that this does not contradict with the result of [9], which shows that the under-parametrized networks cannot approximate the function well. The reason is that the loss function here only measures the approximation at the given data points, instead of approximation for the whole function.

Here, we mention the two main works that motivate this work. In [4], the following form of approximations were considered inspired by weak approximation of measures

(1.2)

where represents any unit function with parameter . We now also regard as a function of since we are considering that are varying. For example, it could be the Gaussian, the one-layer neural networks or deep neural networks. They were able study the dynamics as the interacting particle systems, so that the decaying of the loss function can be understood as gradient flows in Wasserstein-2 space using optimal transport theory. However, the decaying can only be shown in the limiting regime where and the decaying rate is unknown. The mechanism is that the parameters will converge to some minimization regions.

In the work [6] by S. Du et al., the two layer of neural networks were studied, i.e.,

Here, is the so-called activation function. In their work, the following form of approximation is used:

(1.3)

Though this seems to be a little change, the dynamics has been changed significantly compared with (1.2). They are actually able to show that the weights ’s will not converge under the gradient descent method: they are close to the initialization. They claim that the loss function converges to zero exponentially in the overparamterized regime. They obtain this result by considering the dynamics of the prediction . In [6], the convergence largely is due to their Assumption 3.1 and Theorem 3.1. The analysis for their matrix needs to exclude the cases that there are two vectors being nearly parallel, which is not desired in the regime, where is the number of training data. We note that if we rely on the dynamics of , the matrix can be positive definite simply due to the positivity of , and this is the strategy we adopt here.

Below, we aim to combine and improve the ideas from these two works to show a convergence rate of the gradient descent independent of the sample size . We use the approximation from [6], i.e. (1.3), to study the dynamics of the training loss directly, as done in [4, 8]. We makes use the key observation from [10, 6] that the weights are close to the initial values, but instead rely on the dynamics on ’s to obtain the positive definiteness of the Gram matrix. Another important technique is the generic chaining [11] from probability theory that kind of guarantees the the Gram matrix is close to some positive definite matrix uniformly in data . In section 2, we set up the mathematical formulation and state the main results. The proof is then performed in section 3. Lastly, we make some discussions in section 4.

2 Mathematical setup and the main results

We focus on the following special setup, and consider two layer neural networks as in (1.3). The generalization to other cases will be done soon in subsequent works.

About the data set and measure for , we assume the following:

Assumption 1.

The domain is a compact set such that there is a constant that is independent of satisfying for all . The measure is a probability measure supported on a countable subset of .

Below, we will denote the support of by :

(2.1)

One common example of is the empirical measure

(2.2)

where ’s are the training data sampled i.i.d from some underlying distribution. Of course, in our discussion in this work, ’s are fixed.

Moreover, we assume the initial conditions as follows.

Assumption 2.

For the weight , we assume

and ’s are i.i.d such that .

The condition is important because it guarantees that the approximation (1.3) is with high probability at to close up the estimation later (Lemma 4). The distribution for ’s are assumed to be Gaussians only for technical convenience. The argument can be extended to other distributions with finite second moments.

Moreover, we assume

Assumption 3.

The activation function satisfies , and . Moreover, , and thus

The sigmoid function should satisfy the condition. This condition may be too strong, however, we impose this only for simplicity. The generalization to other functions should be doable.

Under the gradient descent with loss function (1.1), the parameters ’s satisfy

(2.3)

where means the time derivative. Similarly, the parameters ’s satisfy

(2.4)

Instead of studying the prediction as in [6], we study the dynamics of the loss function directly as motivated by [4, 8],

(2.5)

The matrix

(2.6)

is called the Gram matrix in the terminology from [6]. We aim to use the positive definiteness of this matrix to show the exponential convergence. The second part is positive semi-definite (in fact in [6], it was argued that this part is also positive definite when the data points are not parallel), so we can control easily as

(2.7)

One key observation from [6] is that the weights do not change much during the dynamics. The choice (1.3) ensures the form , so that one has kind of law of large number convergence here. Moreover, we note that are i.i.d D Gaussian variables regardless the dimension of itself. Using the generic chaining approach [11], this allows us obtain the convergence rate of the loss function independent of and when is large enough ( without exponential dependence in ).

Theorem 1.

Under Assumptions 1-3, for any small, when , with probability , it holds that

(2.8)

where and are independent of , and .

Remark 1.

The dependence in the lower bound of comes from a constant associated with the geometry of , as in Proposition 1. In our case, the domain is like a ball whose volume goes to zero as . If we consider other domains, like , the dependence in will be changed. On one hand, the bounds of will be changed and also the constant in Proposition 1 will be changed.

We remark that as does not mean approximates well. It only approximates the function values on the . However, there is another implication of the result. In the case of the empirical measures (2.2), the width and convergence rates are independent of the data set . This somehow says that any bounded functions can be approximated by the two-layer neural networks with high probability. In fact, it seems that the following approximation ability of two layer neural networks holds.

Corollary 1.

Suppose satisfying Assumption 1 is the union of closures of some domains (open connected sets) so that the Lebesgue measure can be defined. Then, the two layer neural networks satisfying Assumption 3 with width are able to approximate any functions , under the norm.

Here, means the norm with Lebesgue measure. As commented in section 4, it seems that such good approximations are available ”everywhere” in the parameter space. In Theorem 1, the lower bound of depends on , but in this corollary, there is no dependence in the lower bound of . The reason is that one can approximate first and then scale ’s to be . The dependence of origins from the gradient descent dynamics (2.3) where the derivatives depend on .

3 The proof of the main results

Below, we use to indicate a constant independent of , whose concrete meaning can change from line to line.

Lemma 1.

The derivatives of the parameters satisfy

(3.1)

and

(3.2)

where is independent of , and .

The proof is straightforward and we omit. Let us just remark that if we consider, for example, . Using (2.3), by Hölder’s inequality, it is straightforward to obtain

Using Lemma 1, we are able to establish:

(3.3)

where we have introduced

(3.4)

and

(3.5)

We now estimate these two terms separately.

Lemma 2.

The term satisfies

where

(3.6)
Proof.

First of all, one finds

Using (3.2), one has (note that )

Hence,

Consequently, one has

(3.7)

where we have used the fact that

The claim then follows. ∎

We are now in a position to take in . In particular, we need to estimate

(3.8)

For fixed , it is clear that this converges to

(3.9)

and the rate is independent of . However, to obtain the uniform convergence rate for is challenging. The usual technique is the union bound, which gives a quantity involving , and this further requires the over-parametrized regime. One may also use the convergence of empirical measures in the so called Wasserstein spaces [12], which yields a quantity independent of the training data, but suffers from curse of dimensionality. In our problem here, we are essentially considering the projections of high dimensional Gaussians. Hence, we expect the convergence rate will not suffer from curse of dimensionality. We find that the generic chaining approach in [11, 13] useful.

We introduce the following result essentially derived from [13, Proposition 2.2] and the proof of [13, Theorem 1.1]. The proof is then deferred to Appendix A using generic chaining approach.

Proposition 1.

Let be as in Assumptions 1. Consider

Suppose that for any given , ’s are i.i.d, and for any , and are independent for but may have different distributions. Moreover, ’s satisfy the following.

  1. They are uniformly bounded, have mean zero.

  2. There is a constant such that

Then there is a parameter only depending on with magnitude , such that for any small and that is countable, when , with probability ,

(3.10)

We now move to the estimates of . We have

Lemma 3.

For any small , when , with probability , it holds that

(3.11)
Proof.

We only need to verify the conditions in Proposition 1. Here, the domain we work on is while is the countable subset we consider. The distance is defined by .

Clearly, we need to identity

Note that even if , are still i.i.d.

Noting that

one has

Taking expectation, one finds . The claim follows.

Note that in this case, the dimension is , but clearly, . ∎

Note that the boundedness of is also used here for the Bernstein inequality in Appendix A. Consequently, regarding defined in (3.4), when (3.11) holds, one has

(3.12)

where we introduced

(3.13)

Now, we move onto the initial values of . If we identify

and use Proposition 1, we can find that when , with probability , it holds that

(3.14)

Here, depends on . However, according to [8, Lemma 1], one has the following.

Lemma 4.

For any small , with probability , it holds that

(3.15)

The proof is done via the so-called Rademacher complexity, and we refer the readers to [8].

Consequently, we find that for any small , with probability , it holds that

(3.16)

Now, we estimate the Gram matrix.

Lemma 5.

There exists independent of such that

(3.17)
Proof.

If either or is zero, the conclusion is clearly correct.

Now, assume . Then, , which is a 2D Gaussian with covariant matrix

Consider the interval where . Then, contains for some .

Now, we should estimate . The expectation can be rewritten as

where and are independent standard normal variables, and

Note that is bounded, hence, we only need and where is a number chosen such that and are in for all . Clearly can be made universal independent of . The probability hence has a lower bound independent of . Then, setting will suffice. ∎

We are now able to prove the main result.

Proof of Theorem 1.

By the previous lemmas, for any small, when , we may pick an event with probability , such that outside this, we have , and also .

Then, on

We first of all note that is nonincreasing by (2.7). We now consider a fixed . Then, all the quantities below are essentially discussed for such , and we omit this dependence for convenience.

Define

Since , if , then . Then, for it holds that

Then,

Hence, for all

Hence, if is chosen such that

also holds, then we must have . In other words, for some ,

will imply .

Otherwise, suppose . Then,

still holds. Then, by continuity, there exists such that on this still holds. This then contradicts with the definition of .

Now that , holds for all . Lastly, we note , the claim then follows. ∎

We now move to the corollary regarding approximation.

Proof of Corollary 1.

If we can approximate the functions with , we can multiply functions with suitable constants to approximate any functions. Hence, we assume below. Moreover, one can use functions to approximate with small error under norm. Hence, we assume that below.

Fix . One first of all has the following estimates:

(3.18)

where

is the empirical measure for data .

By the standard result of convergence empirical measures [12], there exists such with such that,

where is the Wasserstein distance.

Then, we set . Then, we take the event such that the claims in Theorem 1 hold for , and such that . Hence, we are able to find the parameters for large enough such that

The important consequence of Theorem 1 is that is independent of (and thus ).

Now, we fix such and estimate the Lipschitz constant of . The gradient of this function is

Hence, we need to estimate and .

By Lemma 3.1,

Similarly,

Hence,

Hence, and . In other words, the Lipschitz constant of is controlled by a polynomial of and the Lipschitz constant of (but independent of ), since is independent of .

The whole error is thus given by

The constant is independent of , and the concrete expression of . Hence, it is a fixed number once are given. As is arbitrary, the claim follows. ∎

4 Conclusion and discussion

Current works show the convergence of loss function in the overparametrized regime for neural networks. By considering the loss function directly, we obtain the decay of the loss function in a rate independent of the number of training data, where the approximation is given specifically by (1.3). The key observation was made by S. Du et al that the weights are close to the initialization. In this sense, the loss function has global minimizations “everywhere” in the parameter space. As we have seen, the proof works for large number of data , and as Corollary 1 indicates, the parameters that can achieve the good approximation seems to be available ”everywhere” in the parameter space, which we strongly believe leads to the good performance of stochastic gradient descent. If we instead use the approximation of the form (1.2) inspired by mean field limit, then one must wait the weights to converge to some specific regions to gain global minimization. This probably means the approximation (1.3) has larger capacity and may be used to explain why SGD behaves well.

The loss function used here is quadratic functions. However, we argue that quadratic functions are general enough since near the minima, all functions look like quadratic functions.

There are many instant subsequent works, for example, one may explore whether the ideas and techniques here can be applied to improve the the generalization error estimates in literature, and to get better explanation of why stochastic gradient descent works well. These should be challenging but exciting problems.

Acknowledgement

L. Li would like to thank Haizhao Yang for helpful comments. The work of L. Li was partially sponsored by NSFC 11901389 and 11971314, and Shanghai Sailing Program 19YF1421300.

Appendix A Proof of Proposition 1

The proof divides into several steps. Below, gain denotes some generic constant that is independent of whose concrete meaning can change from line to line.


Step 1. Basic concentration inequalities.

Define

(A.1)

and define

(A.2)

Hence, using , one has

Note that

where ’s are still i.i.d with mean zero. By Bernstein inequality [14], one has for any

(A.3)

where .

Consequently,

Hence, one defines the metric (note that the maximum of two metrics is still a metric)

(A.4)

then one has

(A.5)

while for ,

(A.6)

Step 2 Generic chaining

Now consider the domain equipped with the metric in (A.4). Define a number

(A.7)

where is a set containing points. The is taken over the collections of such sets . From to , the number of points in is squared. Hence, the distance decays double exponentially if the sequence is chosen suitably. Hence, is well-defined. For a given , for every , there must be , such that

and we define

(A.8)

Let be the usual Euclidean distance, and we find that on the suitable sequence of subsets, . Hence, the number is comparable to

Below, we focus only on . The number of for the unit ball is like as shown in [11, sec. 2.2], where is the dimension.

By the definition of , we are able to choose the collection of such that

which holds for all .


Step 3 Controlling .

Let be such that . Consider any . By step 1, for any fixed , one has

where is the number of elements in .

Define