Convergence Analysis of the Dynamics of a Special Kind of Two-Layered Neural Networks with \ell_{1} and \ell_{2} Regularization

# Convergence Analysis of the Dynamics of a Special Kind of Two-Layered Neural Networks with ℓ1 and ℓ2 Regularization

Zhifeng Kong
###### Abstract

In this paper, we made an extension to the convergence analysis of the dynamics of two-layered bias-free networks with one output. We took into consideration two popular regularization terms: the and norm of the parameter vector , and added it to the square loss function with coefficient . We proved that when is small, the weight vector converges to the optimal solution (with respect to the new loss function) with probability under random initiations in a sphere centered at the origin, where is a small value and is a constant. Numerical experiments including phase diagrams and repeated simulations verified our theory.

## 1 Introduction

A substantial issue in deep learning is the theoretical analysis of complex systems. Unlike multi-layer perceptrons, deep neural networks have various structures [3], which mainly come from intuitions, and they sometimes yield good results. On the other hand, the optimization problem usually turns out to be non-convex, thus it is difficult to analyze whether the system will converge to the optimal solution with simple methods such as stochastic gradient descent.

In Theorem 4 in [2], convergence for a system with square loss with regularization is analyzed. However, assumption 6 in [2] requires the activation function to be three times differentiable with on its domain. Thus the analysis cannot be applied to some popular activation functions such as [4] and [5], where and .

Theorem 3.3 in [1] provides another point of view to analyze the situation by using the Lyapunov method [6]. The conclusion is weaker: the probability of convergence is less than . However, this method successfully deals with this activation function. In this paper, we take into consideration and regularization and analyze the convergence of these two systems with an analogous method. Also, a similar conclusion is drawn in the end.

The square of the and norms of a vector are

 ∥v∥21=(n∑i=1|vi|)2,  ∥v∥22=n∑i=1v2i=v⊤v. (1)

These two regularization terms are popular because they control the scale of . Because there is an important difference between and regularization (usually it is possible to acquire an explicit solution of a system with regularization, but hard for a system with regularization), we need different tools to deal with the problems.

## 2 Preliminary

In this paper a two-layered neural network with one output is considered. Let , an matrix (), be the input data. Assume that the columns of are identically distributed Gaussian independent random -dimensional vector variables: ’s . Let , a vector with length , be the vector of weights (parameters) to be learned by the model. Let be the optimal weight with respect to . Let be the activation function. Then, the output with input vector and weight is . For convenience, define an vector with element . Now, we are able to write down the loss function with the regularization term :

 E(w)=12N∥g(X,w∗)−g(X,w)∥2+λ2R(w), (2)

where is a parameter. When , there is no regularization. In this paper, we focus on the situation where or and is very small.

We have a easy way to represent by introducing a new matrix function given by where if and if . Then, can be written in matrix form:

 g(X,w)=D(w)Xw. (3)

Additionally, let for convenience.

Now we introduce the gradient descent algorithm for the model. The iteration has the form

 wt+1=wt+ηΔwt, (4)

where is the learning rate (usually small) and is the negative gradient of the loss function. According to [1] has the closed form

 Δw=1NX⊤D(w)(D∗Xw∗−D(w)Xw)+λ2∂R∂w. (5)

Its expectation (corresponding to ) is given explicitly by

 EΔw=12(w∗−w)+12π((αsinθ)w−θw∗)+λ2∂R∂w, (6)

where and is the angle between and .

## 3 Theoretical Analysis

Usually, does not converge to because of the regularization term. Let be the optimal weight vector that minimizes , i.e. . First, we’ll solve for small , and then we prove that will converge to in using the Lyapunov method [6], where and is the line .

We firstly provide three lemmas that help the analysis in sections 3.2 and 3.3. The lemmas show that extreme situations will happen with small probability, and provide some mathematical tricks that are useful in the theoretical analysis.

### 3.1 Preparation

Lemma 1: for .

Proof: Let , then . Thus,

 Prob{rank(D∗)=i}=(Ni)i∏j=1Prob{x⊤jw∗>0}⋅N∏j=i+1Prob{x⊤jw∗≤0}=2−N(Ni) (7)

Finally,

 Ak=k∑i=0Prob{rank(D∗)=i}=2−Nk∑i=0(Ni) (8)

When , is a small value bounded by .

Lemma 2: .

Proof: First we show when , . Since ’s are , any rows of are linearly independent with probability 1. This implies that with probability 1 doesn’t contain more that ’s . However, has more than ’s, so Then since , the probability that is positive definite also equals to this amount.

Lemma 3: For a positive definite matrix and a small value

 (B−εI)−1=(I+εB−1+o(ε))B−1 (9)

where refers to a matrix with every element .

Proof: Since is positive definite, exists. Then,

 (10)

This shows that and are closed to each other.

### 3.2 Convergence Area for the ℓ2 Regularization Case

In this case, we have , and . Then, the loss function is given in the following equation:

 E(w)=12N∥g(X,w∗)−g(X,w)∥2+λ2∥w∥22 (11)

Theorem 1: When is small, can be solved explicitly with probability .

Proof: Let , and according to equation (5), we have

 X⊤D(^w)(D∗Xw∗−D(^w)X^w)+λN^w=0 (12)

Let’s first assume that . Then the equation can be simplified as

 X⊤D∗Xw∗=(X⊤D∗X−λNId)^w (13)

Thus, we have

 ^w=(X⊤D∗X−λNId)−1X⊤D∗Xw∗ (14)

The inverse exists with probability according to lemmas 2 and 3.

We now show that when is small enough, this ensures that . According to lemmas 2 and 3, we have

 ^w=w∗+λN(X⊤D∗X)−1w∗+od×1(λ) (15)

It is sufficient to show that and , two vectors in , share the same signs in the positions with probability 1. These two vectors are related by the equation

 X^w=Xw∗+λNX(X⊤D∗X)−1w∗+oN×1(λ) (16)

Since doesn’t contain 0 with probability 1, we can exclude these cases. Then, all terms after above don’t influence the sign of when

 λ≤12Nmin1≤i≤N|(Xw∗)i||(X(X⊤D∗X)−1w∗)i| (17)

The ”2” on the denominator is used for eliminating the effects of .

Now, we have shown that is closed to when is small. The next step is to show that converges to in a certain area, which the Lyapunov method [6] is very good at. In order to apply the Lyapunov method, we regard as a continuous index.

Theorem 2: With probability , the following statement holds. When is large and is small, consider the Lyapunov function . We have in , and thus the system is asymptotically stable. That is, as .

Proof: We can write as:

 ˙V=(w−^w)⊤EΔw (18)

In order to simplify, let , where is given by

 T=N(X⊤D∗X)−1w∗+od×1(1)∈Rd. (19)

Note ; can be written as where

 K=M+λP (20)

According to Lemma7.3 [1], is given by the following:

 M=14π(2π−(2π−θ)cosθ−sinθ−(2π−θ)cosθ−sinθsin2θ+2π−2θ) (21)

can also be divided into two parts: , where

 P1=−⎛⎝1−cosθ2−cosθ20⎞⎠ (22)

and satisfies that . From this, we see that is bounded. Since is positive definite for according to Lemma7.3 [1], when is small, is also positive definite for . As a result, , which leads to the result that the system is asymptotically stable in .

### 3.3 Convergence Area for the ℓ1 Regularization Case

In this case, we have , and , where is the vector of signs of elements in . Then, the loss function is given in the following equation:

 E(w)=12N∥g(X,w∗)−g(X,w)∥2+λ2∥w∥21 (23)

Theorem 3: When is small, can be solved (not explicitly) with probability .

Proof: Let , and according to equation (5), we have

 X⊤D(^w)(D∗Xw∗−D(^w)X^w)+λN∥^w∥1sign(^w)=0 (24)

We still assume that to simplify the problem. Then, the equation becomes

 f(λ,w∗,^w)=X⊤D∗Xw∗−X⊤D∗X^w+λN∥^w∥1sign(^w)=0 (25)

This problem is hard to solve, so we use the Implicit Function Theorem [7] here. The key is to examine whether the Jacobian matrix is invertible, where . The result is

 J(i,j)=−(X⊤D∗X)ij+λNsign(^wi)sign(^wj)=−N∑k=1I(x⊤kw∗>0)xkixkj+λNsign(^wi)sign(^wj) (26)

Since is positive definite with probability according to Lemma 2, and when is small the second term doesn’t influence, we know that is then invertible. Thus, there exists a unique continuously differentiable function such that is the solution. Notice that when , is the solution. As a result, can be extended as for some vector . Additionally, might be very large because there is an after in equations (24)-(26).

Then, we show that for small, we have . The analysis is quite similar to Theorem 1. When

 λ≤12min(Xw∗)i>0(Xw∗)i|(Xu)i| (27)

we have that .

Remark: In Theorem 3 the bound of is given in equation (27), where there is an unknown vector on the denominator . In fact, we are able to estimate its value from known quantities. When we apply the extension to equation (25), we have

 X⊤D∗X(−λu+o(λ))+λN∥^w∥1sign(^w)=0, (28)

which is equivalent to the following equation

 X⊤D∗Xu=N∥^w∥1sign(^w)+o(1). (29)

As assumed in Theorem 3, , and assume that for . Let be the matrix consisting the first rows of . Then, . Thus, we have

 u=N∥^w∥1(X⊤δXδ)−1sign(^w)+o(1). (30)

Then we have

 Xu=N∥^w∥1X(X⊤δXδ)−1sign(^w)+o(1), (31)

which indicates that

 max(Xw∗)i>0(Xu)i≤2N∥^w∥1∥X(X⊤δXδ)−1∥∞ (32)

for small such that with small value that eliminates the effect of in equation (31). Finally, we are able to modify the bound in equation (27) by using the upper bound of in equation (32) to substitute this amount. The explicit bound is then given by the following equation:

 λ≤min(Xw∗)i>0(Xw∗)i4N∥w∗∥1∥X(X⊤δXδ)−1∥∞. (33)

Although the explicit solution of can’t be found, we still draw the conclusion that is closed to for small . This is enough for the Lyapunov method, because we are able to control in equation (20) with a similar way.

Theorem 4: The statement in Theorem 2 still holds for regularization.

Proof: Similar to the analysis in Theorem 2, we still have equation (20) in this case with a different . Thus, when is small enough is positive definite, and the conclusion is still correct here.

### 3.4 The Final Result

Since it’s hard to draw samples from , we consider a small sphere centered at the origin. The analysis is in Theorem 7.4 (proof of Theorem 3.3) in [1].

Theorem 5: For both and regularization, if the initial weight vector is sampled uniformly in with , converges to with probability .

Proof: The proof is almost exactly the same as the proof of Theorem 7.4 in [1]. The only thing to notice is that we exclude the line because we need . However, the line has measure zero and thus doesn’t change the conclusion.

Now, we have proved that Theorem 3.3 in [1] still applies for and regularization with small . And this result is consistent with the argument that initial weights should be small rather than being large [8].

## 4 Experiment Results and Analysis

First, in Figure 1 we demonstrate all possibilities: the dynamics converge/do not converge with / regularization. The parameters are: and . In Figure 2 we show two phase diagrams (or vector fields, after normalized) of the dynamics with and regularization with randomly selected and parameters , and . The big black point is , the small black points are the grid points uniformly selected in the plane, and green lines refer to the orientations of (from the end with a black point to the end without any point). Especially, when equals to in the case the dynamic is meaningless because does not exist.

Then, in order to examine the prediction given by Theorem 5, we made the following simulation. Under different values of , and , we simulated the dynamics for 500 times and compared the experiment ratio of convergence to the theoretical ratio (that is, the probability) of convergence in Theorem 5. Specifically, for both and situation was selected in , was selected in , and was selected in . The learning rate was set to be 0.05 and was set to be 0.1. Each time was sampled according to normal distribution and was sampled uniformly in . The results for the and regularization case are demonstrated in Table 1.

According to Table 1, we are able to make the following discussion. As shown in the table, there are four bold numbers, all of which lie in the regularization case when , indicating that for the situation is beyond the upper bound of for Theorem 1 or Theorem 2. In most situations, the experiment ratio of convergence decreases as increases, and the gap between and is much larger than the gap between and , which implies that also plays an important role in the convergence probability in Theorem 5. In most cases the experiment ratio is much larger than the theoretical ratio. This indicates that outside the sphere in Theorem 2 and Theorem 4 there is still much area in which the initial weights will converge to . Under the same parameters, the experiment ratio of convergence in the case is always greater than that in the case. This shows that the regularization makes the dynamic easier to converge than the regularization does.

## 5 Conclusion and Future Work

In this paper, we presented our convergence analysis of the dynamics of two-layered bias-free networks with one output, where the loss function includes the square error loss and or regularization on the weight vector. This is an extension to Theorem 3.3 in [1]. We first solved the optimal weight vector with small regularization coefficient for both cases, and then used the Lyapunov method [6] to show that the system is asymptotically stable in certain area. In the final step, we claimed that Theorem 3.3 in [1] is still correct in these two situations. We also verified our theory through numerical experiments including plotting the phase diagrams and making computer simulations.

Our work made a theoretical justification of convergence for two popular models. We started from the intuition that small regularization doesn’t change the system too much, and our conclusion is compatible with this intuition. In the future, we plan to analyze the system with larger regularization, since in real situations is fixed to be, for example, 0.5, which may be larger than the bound in equations (17) and (27). This is more difficult since we won’t expect , and other advanced techniques may be applied. We also plan to consider other popular regularization terms, and provide a more general theory on this topic.

## References

• [1] Yuandong Tian, Symmetry-Breaking Convergence Analysis of Certain Two-Layered Neural Networks with ReLU Nonlinearity, 2016.
https://openreview.net/pdf?id=Hk85q85ee
• [2] Song Mei, Yu Bai, and Andrea Montanari, The Landscape of Empirical Risk for Non-convex Losses, 2017.
https://arxiv.org/abs/1607.06534v3
• [3] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, Deep learning, in NATURE, vol. 521, pp. 436–444, 2015.
• [4] Vinod Nair, and Geoffrey Hinton, Rectified linear units improve restricted Boltzmann machines, in ICML, pp. 807–814, 2010.
• [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015.
https://arxiv.org/pdf/1502.01852.pdf
• [6] LaSalle, J. P. and Lefschetz, S. Stability by Lyapunov¡¯s Second Method with Applications. New York: Academic Press, 1961.
• [7]
• [8] Geoffrey Hinton, Coursera: Neural Networks for Machine Learning.
https://www.coursera.org/learn/neural-networks
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters