Step Size Matters in Deep Learning

# Step Size Matters in Deep Learning

Kamil Nar   S. Shankar Sastry
Electrical Engineering and Computer Sciences
University of California, Berkeley
###### Abstract

Training a neural network with the gradient descent algorithm gives rise to a discrete-time nonlinear dynamical system. Consequently, behaviors that are typically observed in these systems emerge during training, such as convergence to an orbit but not to a fixed point or dependence of convergence on the initialization. Step size of the algorithm plays a critical role in these behaviors: it determines the subset of the local optima that the algorithm can converge to, and it specifies the magnitude of the oscillations if the algorithm converges to an orbit. To elucidate the effects of the step size on training of neural networks, we study the gradient descent algorithm as a discrete-time dynamical system, and by analyzing the Lyapunov stability of different solutions, we show the relationship between the step size of the algorithm and the solutions that can be obtained with this algorithm. The results provide an explanation for several phenomena observed in practice, including the deterioration in the training error with increased depth, the hardness of estimating linear mappings with large singular values, and the distinct performance of deep residual networks.

Step Size Matters in Deep Learning

Kamil Nar   S. Shankar Sastry Electrical Engineering and Computer Sciences University of California, Berkeley

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

The depth of a neural network determines the size of the class of functions that it can represent. As the depth is increased, this class of functions expands provided that the new layers are able to express the identity mapping. Therefore, the minimum training error that can be achieved by a network diminishes as its depth is increased. However, the training error of most neural networks degrades in practice once the number of layers exceeds a certain value; and deeper networks start to perform worse than their shallower counterparts, as shown in Figure 1 (He et al., 2016). This deterioration in the training error with increased depth indicates a problem with the method used for training the neural network; namely, a problem with the convergence of the gradient descent algorithm.

When gradient descent is used to minimize a function, say , it leads to a discrete-time dynamical system:

 x[k+1]=x[k]−δ∇f(x[k]), (1)

where is the state of the system, which consists of the parameters updated by the algorithm, and is the step size of the algorithm. Every fixed point of the system (1) is called an equilibrium of the system, and they correspond to the critical points of the function .

Unless is a quadratic function of the parameters, the system described by (1) is either a nonlinear system or a hybrid system that switches from one dynamics to another over time. Consequently, the system (1) can exhibit behaviors that are typically observed in nonlinear and hybrid systems, such as convergence to an orbit but not to a fixed point, or dependence of the convergence on the initialization. The step size of the gradient descent algorithm has a critical effect on these behaviors, as shown in the following two examples.

Example 1. Convergence to a periodic orbit: Consider the continuously differentiable and convex function , which has a unique local minimum at the origin. The gradient descent algorithm on this function yields

 x[k+1]={x[k]−δ√x[k],x[k]≥0,x[k]+δ√−x[k],x[k]<0.

As expected, the origin is the only equilibrium of this system. Interestingly, however, converges to the origin only when the initial state belongs to a countable set :

 S={0,δ2,−δ2,…}.

For all other initializations, converges to an oscillation between and . This implies that, if the initial state is randomly drawn from a continuous distribution, then almost surely, does not converge to the origin, yet converges to . In other words, with probability 1, the state does not converge to a fixed point, such as a local optimum or a saddle point, even though the estimation error converges to a finite non-optimal value.

Example 2. Dependence of convergence on the initialization: Consider the function where is an even number larger than 2. The gradient descent results in the system

 x[k+1]=x[k]−δLx[k]L−1.

The state converges to the origin if the initial state satisfies and diverges if .

These two examples demonstrate:

1. the convergence of training error does not imply the convergence of the gradient descent algorithm to a local optimum or a saddle point,

2. the step size determines the magnitude of the oscillations if the algorithm converges to an orbit but not to a fixed point,

3. the step size influences the convergence of the algorithm differently for each initialization.

Note that these are consequences of the nonlinear dynamics of the algorithm and not of the (non)convexity of the function to be minimized. While both of the functions used in the examples are convex, the identical behaviors are observed during the minimization of nonconvex cost functions of neural networks as well. In fact, only these behaviors can provide a satisfactory explanation for a phenomenon observed in Figure 1: right after the step size of the algorithm is decreased, the training error plummets. It cannot be the case that the parameters are slowly converging to an equilibrium right before the step size is changed, nor can they be stuck near a local optimum or a saddle point, because if either were the case, decreasing the step size would have further slowed down the convergence. These sharp falls can only be explained by the fact that the initial step size is too large for some regions in the parameter space, and the parameters are oscillating around a local optimum right before the step size is changed. Once the step size is decreased, the radius of the oscillations around the equilibrium point diminishes, the distance to the equilibrium point in the parameter space falls sharply, and consequently, so does the training error.

While training a deep neural network, the dynamical system created by the gradient descent will usually have multiple equilibria, which coincide with the critical points of the training cost function. Convergence to these equilibria is in general affected unequally by the step size. For example, for a given step size, the algorithm might be able to converge to a subset of the local optima but not to the others independent of the initializations. Therefore, the step size also plays a critical role in understanding why some solutions are more likely to be obtained instead of the others when the gradient descent algorithm is used.

### 1.1 Our contributions

In this paper, we study the gradient descent algorithm as a discrete-time dynamical system during training deep neural networks, and we show the relationship between the step size of the algorithm and the solutions that can be obtained with this algorithm. In particular, we achieve the following:

1. We highlight that one of the reasons the training error stops decreasing during training is the fact that the step size of the algorithm becomes larger than it should be for certain regions in the parameter space, and the parameters keep oscillating around a local optimum rather than slowly converge to it or be stuck near a saddle point. In particular, for every fixed step size, there is a significant possibility that the algorithm converges to an orbit instead of a critical point of the training cost function.

2. We analyze the Lyapunov stability of the gradient descent algorithm on deep linear networks and find different upper bounds on the step size to enable convergence to each solution. We show that for every step size, the algorithm can converge to only a subset of the local optima, and there are always some local optima that the algorithm cannot converge to independent of the initialization.

3. We show that symmetric positive definite matrices can be estimated with a deep linear network by initializing the weight matrices as the identity, and this initialization allows the use of the largest step size. Conversely, the gradient descent is most likely to converge for an arbitrarily chosen step size if the weight matrices are initialized as the identity.

4. We show that symmetric matrices with negative eigenvalues cannot be estimated with the identity initialization, and the gradient descent converges to the closest positive semidefinite matrix in the Frobenius norm.

5. For 2-layer neural networks with ReLU activations, we obtain an explicit relationship between the step size of the gradient descent and the output of the solution that the algorithm can converge to.

### 1.2 Related work

It is a well-known problem that the gradient of the training cost function can become disproportionate for different parameters when training a neural network. Several works in the literature tried to address this problem. For example, changing the geometry of optimization was proposed in (Neyshabur et al., 2017) and a regularized descent algorithm was proposed to prevent the gradients from exploding and vanishing during training.

Deep residual networks, which is a specific class of neural networks, yielded exceptional results in practice with their peculiar structure (He et al., 2016). By keeping each layer of the network close to the identity function, these networks were able to attain lower training and test errors as the depth of the network was increased. To explain their distinct behavior, the training cost function of their linear versions was shown to possess some crucial properties (Hardt & Ma, 2016). Later, equivalent results were also derived for nonlinear residual networks under some conditions (Bartlett et al., 2018a).

The effect of the step size on training neural networks was empirically investigated in (Daniel et al., 2016). A step size adaptation scheme was proposed in (Rolinek & Martius, 2018) for the stochastic gradient method and shown to outperform the training with a constant step size. Similarly, some heuristic methods with variable step size were introduced and tested empirically in (Magoulas et al., 1997) and (Jacobs, 1988).

Two-layer linear networks were first studied in (Baldi & Kornik, 1989). The analysis was extended to deep linear networks in (Kawaguchi, 2014), and it was shown that all local optima of these networks were also the global optima. It was discovered in (Hardt & Ma, 2016) that the only critical points of these networks were actually the global optima as long as all layers remained close to the identity function during training. The dynamics of training these networks were also analyzed in (Saxe et al., 2013) and (Gunasekar et al., 2017) by assuming an infinitesimal step size and using a continuous-time approximation to the dynamics.

Lyapunov analysis from the dynamical system theory (Khalil, 2002; Sastry, 1999), which is the main tool for our results in this work, was used in the past to understand and improve the training of neural networks – especially that of the recurrent neural networks (Michel et al., 1988; Matsuoka, 1992; Barabanov & Prokhorov, 2002). State-of-the-art feedforward networks, however, have not been analyzed from this perspective.

We summarize the major differences between our contributions and the previous works as follows:

1. We relate the vanishing and exploding gradients that arise during training feedforward networks to the Lyapunov stability of the gradient descent algorithm.

2. Unlike the continuous-time analyses given in (Saxe et al., 2013) and (Gunesekar et al., 2017), we study the discrete-time dynamics of the gradient descent with an emphasis on the step size. By doing so, we obtain upper bounds on the step size to be used, and we show that the step size restricts the set of local optima that the algorithm can converge to. Note that these results cannot be obtained with a continuous-time approximation.

3. For deep linear networks with residual structure, (Hardt & Ma, 2016) shows that the gradient of the cost function cannot vanish away from a global optimum. This is not enough, however, to suggest the fast convergence of the algorithm. Given a fixed step size, the algorithm may also converge to an oscillation around a local optimum, as in the case of Example 1. We rule out this possibility and provide a step size so that the algorithm converges to a global optimum with a linear rate.

4. We recently found out that the convergence of the gradient descent algorithm was also studied in (Bartlett et al., 2018b) for symmetric positive definite matrices independently of and concurrently with our preliminary work (Nar & Sastry, 2018). However, unlike (Bartlett et al., 2018b), we explicitly give a step size value for the algorithm to converge with a linear rate, and we emphasize the fact that the identity initialization allows convergence with the largest step size.

## 2 Upper bounds on the step size for training deep linear networks

Deep linear networks are a special class of neural networks that do not contain nonlinear activations. They represent a linear mapping and can be described by a multiplication of a set of matrices:

 WLWL−1⋯W2W1,

where for each . Even though the functions they represent are linear, their training cost is not a quadratic function of the parameters, and therefore, the dynamics of the gradient descent is always nonlinear during training of these networks. For this reason, they provide a simple model to study some of the nonlinear behaviors observed during training.

The training cost functions of the deep linear networks always have multiple optima. In fact, given a cost function , if the point is a local minimum, then is also a local minimum for every set of scalars satisfying . Consequently, independent of the function , none of the local optima is isolated in the parameter space, and the cost function is not strongly convex at any point.

Although multiple local optima attain the same training cost for deep linear networks, the dynamics of the gradient descent algorithm exhibits distinct behaviors around these points. In particular, the step size required to render each of these local optima stable in the sense of Lyapunov is very different. Since the Lyapunov stability of a point is a necessary condition for the convergence of the algorithm to that point, the step size that allows convergence to each solution is also different, which is formalized in Theorem 1.

###### Theorem 1.

Given a nonzero matrix and a set of points in that satisfy , assume that is estimated as a multiplication of the matrices by minimizing the squared error loss

 12N∑Ni=1∥Rxi−WLWL−1…W2W1xi∥22 (2)

where for all . Then the gradient descent algorithm can converge to a solution only if the step size satisfies

 δ≤2∑Lj=1p2j−1q2j+1 (3)

where

 pj=∥∥^Wj⋯^W2^W1v∥∥,qj=∥∥u⊤^WL^WL−1⋯^Wj∥∥∀j∈[L],

and and are the left and right singular vectors of corresponding to its largest singular value.

Considering all the solutions that satisfy , the bound in (3) can be arbitrarily small for some of the local optima. Therefore, given a fixed step size , the gradient descent can converge to only a subset of the local optima, and there are always some solutions that the gradient descent cannot converge to independent of the initialization.

Remark 1. Theorem 1 provides a necessary condition for convergence to a specific solution. It does not state that given a step size , the algorithm converges to a solution which satisfies (3). It only rules out the possibility of converging to a large subset of the local optima. It might be the case, for example, that the algorithm converges to an oscillation around a local optimum which violates (3) even though there are some other local optima which satisfy (3).

As a necessary condition for the convergence to a global optimum, we can also find an upper bound on the step size independent of the weight matrices of the solution, which is given next.

###### Corollary 1.

For the minimization problem in Theorem 1, the gradient descent with step size cannot converge to a global optimum unless the step size satisfies

 δ≤2Lρ(R)2(L−1)/L (4)

where is the largest singular value of .

Remark 2. Corollary 1 shows that, unlike the optimization of the ordinary least squares problem, the step size required for the convergence of the algorithm depends on the parameter to be estimated, . Consequently, estimating linear mappings with larger singular values requires the use of a smaller step size. Moreover, the step size used during training gives information about the solution obtained if the algorithm converges, which is stated next.

###### Corollary 2.

Assume that the gradient descent with step size has converged to a local optimum for the minimization problem in Theorem 1. Then the largest singular value of satisfies

 ρ(^R)≤(2Lδ)L/(2L−2).

## 3 Identity initialization allows the largest step size for estimating symmetric positive definite matrices

Corollary 1 provides only a necessary condition for the convergence of the algorithm, and the bound (4) is not tight for every estimation problem. However, if the matrix to be estimated is symmetric and positive definite, the algorithm can converge to a solution with step sizes close to (4), which requires a specific initialization of the weight parameters.

###### Theorem 2.

Assume that is a symmetric positive semidefinite matrix, and given a set of points which satisfy , the matrix is estimated as a multiplication of the square matrices by minimizing

 12NN∑i=1∥Rxi−WL…W1xi∥22.

If the weight parameters are initialized as for all and the step size satisfies

 δ≤min{1L, 1Lρ(R)2(L−1)/L},

then each converges to with a linear rate.

Remark 3. Theorem 2 shows that the algorithm converges to a global optimum despite the nonconvexity of the optimization, and it provides a case where the bound (4) is tight. The tightness of the bound implies that for the same step size, most of the other global optima are unstable in the sense of Lyapunov, and therefore, the algorithm cannot converge to them independent of the initialization. Consequently, using identity initialization allows convergence to a solution which is most likely to be stable in the sense of Lyapunov for an arbitrarily chosen step size.

Remark 4. The fact that the identity initialization allows convergence to a global optimum even for large step sizes is remarkable. Given that the identity initialization on deep linear networks is equivalent to the zero initialization of linear residual networks (Hardt & Ma, 2016), Theorem 2 provides an explanation for the exceptional performance of residual networks as well (He et al., 2016).

When the matrix to be estimated is symmetric but not positive semidefinite, the bound (4) is still tight for some of the global optima. In this case, however, the eigenvalues of the estimate cannot attain negative values if the weight matrices are initialized with the identity.

###### Theorem 3.

Let in Theorem 2 be a symmetric matrix such that the minimum eigenvalue of , , is negative. If the weight parameters are initialized as for all and the step size satisfies

 δ≤ min{11−λmin(R),1L,1Lρ(R)2(L−1)/L},

then the estimate converges to the closest positive semidefinite matrix to in Frobenius norm.

## 4 Effect of step size on training two-layer networks with ReLU activations

In Section 2, we analyzed the relationship between the step size of the gradient descent algorithm and the solutions that can be obtained by training deep linear networks. A similar relationship exists for nonlinear networks as well. The following theorem, for example, provides an upper bound on the step size for the convergence of the algorithm when the network has two layers and ReLU activations.

###### Theorem 4.

Given a set of points in , let a function be estimated by a two-layer neural network with ReLU activations by minimizing the squared error loss:

 minW,V 12∑Ni=1∥Wg(Vxi−b)−f(xi)∥22,

where is the ReLU function, is the fixed bias vector, and the optimization is only over the weight parameters and . If the gradient descent algorithm with step size converges to a solution , then the estimate satisfies

 maxi∈[N] ∥xi∥2∥^f(xi)∥2≤1δ.

Theorem 4 shows that if the algorithm is able to converge with a large step size, then the estimate  must have a small magnitude for large values of .

Similar to Corollary 1, the bound given by Theorem 4 is not necessarily tight. Nevertheless, it highlights the effect of the step size on the convergence of the algorithm. To demonstrate that small changes in the step size could lead to significantly different solutions, we generated a piecewise continuous function and estimated it with a two-layer network by minimizing

with two different step sizes , where , , and for all . The initial values of and the constant vector were all drawn from independent standard normal distributions; and the vector was kept the same for both of the step sizes used. As shown in Figure 2, training with converged to a fixed solution, which provided an estimate close the original function . In contrast, training with converged to an oscillation and not to a fixed point. That is, after sufficient training, the estimate kept switching between and at each iteration of the gradient descent.

## 5 Discussion

When gradient descent is used to minimize a function, typically only three possibilities are considered: convergence to a local optimum, to a global optimum, or to a saddle point. In this work, we considered the fourth possibility: the gradient descent may not converge at all – even in the deterministic setting. The training error may not reflect the oscillations in the dynamics, or when a stochastic optimization method is used, the oscillations in the training error might be wrongly attributed to the stochasticity of the algorithm. We underlined that, if the training error of an algorithm converges to a non-optimal value, that does not imply the algorithm is stuck near a bad local optimum or a saddle point; it might simply be the case that the algorithm has not converged at all.

We showed that step size of the gradient descent influences the dynamics of the algorithm substantially. It renders some of the local optima unstable in the sense of Lyapunov, and the algorithm cannot converge to these points independent of the initialization. It also determines the magnitude of the oscillations if the algorithm converges to an orbit around an equilibrium point in the parameter space.

In Corollary 2 and Theorem 4, we showed that the step size required for convergence to a specific solution depends on the solution itself. This reveals that some solutions, such as linear functions with large singular values, are harder to converge to. Given that there exists a relationship between the Lipschitz constants of the estimated functions and their generalization error (Bartlett et al., 2017), this result could provide a better understanding of the generalization of deep neural networks.

## References

• [1] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, Vol. 2, pp. 53–58, 1989.
• [2] N. E. Barabanov and D. V. Prokhorov. Stability analysis of discrete-time recurrent neural networks. IEEE Transactions on Neural Networks, Vol. 13, No. 2, pp. 292–303, 2002.
• [3] P. L. Bartlett, S. Evans, and P. Long. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization. arXiv:1804.05012 [cs.LG], 2018a.
• [4] P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 2017.
• [5] P. L. Bartlett, D. P. Helmbold, and P. Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. arXiv:1802.06093 [cs.LG], 2018b.
• [6] C. Daniel, J. Taylor, and S. Nowozin. Learning step size controllers for robust neural network training. In AAAI Conference on Artificial Intelligence, 2016.
• [7] S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pp. 6152–6160, 2017.
• [8] M. Hardt and T. Ma. Identity matters in deep learning. arXiv:1611.04231 [cs.LG], 2016.
• [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
• [10] R. A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, Vol. 1, pp. 295–307, 1988.
• [11] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
• [12] H. K. Khalil. Nonlinear Systems, 3rd Edition. Prentice Hall, 2002.
• [13] G. D. Magoulas, M. N. Vrahatis and G. S. Androulakis. Effective backpropagation training with variable stepsize. Neural Networks, Vol. 10, No. 1, pp. 69–82, 1997.
• [14] K. Matsuoka. Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. Neural Networks, Vol. 5, No. 3, pp. 495–500, 1992.
• [15] A. N. Michel, J. A. Farrell, and W. Porod. Stability results for neural networks. In Neural Information Processing Systems, pp. 554–563, 1988.
• [16] K. Nar and S. Shankar Sastry. Residual networks: Lyapunov stability and convex decomposition. a͡rXiv:1803.08203 [cs.LG], 2018.
• [17] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv:1705.03071 [cs.LG], 2017.
• [18] S. Sastry. Nonlinear Systems: Analysis, Stability, and Control. Springer: New York, NY, 1999.
• [19] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 [cs.NE], 2013.
• [20] M. Rolinek and G. Martius. L4: Practical loss-based stepsize adaptation for deep learning. arXiv:1802.05074 [cs.LG], 2018.

## Appendix A Proof of Theorem 1 and Corollary 1

###### Lemma 1.

Let be symmetric and positive semidefinite. Then, .

###### Proof.

We can write as , where for all and if . Then,

 ⟨A,B⟩=trace{AB}=trace{A∑ni=1λiuiu⊤i}=∑ni=1λiu⊤iAui≥0.\qed
###### Lemma 2.

Let be a linear map defined as where and are symmetric positive semidefinite matrices for all . Then, for every nonzero and , the largest eigenvalue of satisfies

 λmax(f)≥1∥u∥22∥v∥22∑Li=1(u⊤Aiu)(v⊤Biv).
###### Proof.

First, we show that is symmetric and positive semidefinite. Given two matrices , we can write

 ⟨X,f(Y)⟩=trace{∑iX⊤AiYBi}=trace{∑iBiY⊤AiX}=⟨Y,f(X)⟩,
 ⟨X,f(X)⟩=trace{∑iX⊤AiXBi}=∑i⟨X⊤AiX,Bi⟩≥0,

where the last inequality follows from Lemma 1. This shows that is symmetric and positive semidefinite. Then, for every nonzero , we have

 λmax(f)≥1⟨X,X⟩⟨X,f(X)⟩.

In particular, given two nonzero vectors and ,

 λmax(f)≥1⟨uv⊤,uv⊤⟩⟨uv⊤,f(uv⊤)⟩=1∥u∥22∥v∥22∑Li=1(u⊤Aiu)(v⊤Biv).\qed

Proof of Theorem 1. The cost function (2) in Theorem 1 can be written as

 12trace{(WL⋯W1−R)⊤(WL⋯W1−R)}.

Let denote the error in the estimate, i.e. . The gradient descent yields

 Wi[k+1]=Wi[k]−δW⊤i+1[k]⋯W⊤L[k]E[k]W⊤1[k]⋯W⊤i−1[k]∀i∈[L]. (5)

By multiplying the update equations of and subtracting , we can obtain the dynamics of as

 E[k+1]=E[k]−δ∑Li=1Ai[k]E[k]Bi[k]+o(E[k]), (6)

where denotes the higher order terms, and

 Ai=WLWL−1⋯Wi+1W⊤i+1⋯W⊤L−1W⊤L∀i∈[L],
 Bi=W⊤1W⊤2⋯W⊤i−1Wi−1⋯W2W1∀i∈[L].

Lyapunov’s indirect method of stability (Khalil, 2002; Sastry, 1999) states that given a dynamical system its equilibrium is stable in the sense of Lyapunov only if the linearization of the system around

 (x[k+1]−x∗)=(x[k]−x∗)+∂F∂x∣∣∣x=x∗(x[k]−x∗)

does not have any eigenvalue larger than 1 in magnitude. By using this fact for the system defined by (5)-(6), we can observe that an equilibrium with is stable in the sense of Lyapunov only if the system

 (E[k+1]−^R+R)=(E[k]−^R+R)−δ∑Li=1Ai∣∣{^Wj}(E[k]−^R+R)Bi∣∣{^Wj}

does not have any eigenvalue larger than 1 in magnitude, which requires that the mapping

 f(~E)=∑Li=1Ai∣∣{^Wj}~EBi∣∣{^Wj} (7)

does not have any real eigenvalue larger than . Let and be the left and right singular vectors of corresponding to its largest singular value, and let and be defined as in the statement of Theorem 1. Then, by Lemma 2, the mapping in (7) does not have an eigenvalue larger than only if

 ∑Li=1p2i−1q2i+1≤2δ,

which completes the proof.

Proof of Corollary 1. Note that

 qi+1pi=∥u⊤WLWL−1⋯Wi+1∥2∥Wi⋯W2W1v∥2≥∥u⊤WL⋯W1v∥2=ρ(R).

As long as , we have for all , and therefore,

 p2i−1q2i+1≥p2i−1p2iρ(R)2. (8)

Using inequality (8), the bound in Theorem 1 can be relaxed as

 δ≤2(∑Li=1p2i−1p2iρ(R)2)−1. (9)

Since , we also have the inequality

 ∑Li=1p2i−1p2iρ(R)2≥∑Li=1ρ(R)2(ρ(R)1/L)2=Lρ(R)2(L−1)/L,

and the bound in (9) can be simplified as

 δ≤2Lρ(R)2(L−1)/L.

## Appendix B Proof of Theorem 2

###### Lemma 3.

Let be estimated as a multiplication of the scalar parameters by minimizing via gradient descent. Assume that for all . If the step size is chosen to be less than or equal to

 δc={L−1λ−2(L−1)/L if λ∈[1,∞),(1−λ)−1(1−λ1/L) if λ∈(0,1),

then for all , where

 β(δ)={1−δ(λ−1)(λ1/L−1)−1 if λ∈(1,∞),1−δLλ2(L−1)/L if λ∈(0,1].
###### Proof.

Due to symmetry, for all for all . Denoting any of them by , we have

 w[k+1]=w[k]−δwL−1[k](wL[k]−λ).

To show that converges to , we can write

 w[k+1]−λ1/L=μ(w[k])(w[k]−λ1/L),

where

 μ(w)=1−δwL−1∑L−1j=0wjλ(L−1−j)/L.

If there exists some such that

 0≤μ(w[k])≤β for all k∈N, (9)

then is always larger or always smaller than , and its distance to decreases by a factor of at each step. Since is a monotonic function in , the condition (9) holds for all if it holds only for and , which gives us and

Proof of Theorem 2. There exists a common invertible matrix that can diagonalize all the matrices in the system created by the gradient descent: , for all . Then the dynamical system turns into independent update rules for the diagonal elements of and . Lemma 3 can be applied to each of the systems involving the diagonal elements. Since in Lemma 3 is monotonically decreasing in , the bound for the maximum eigenvalue of guarantees linear convergence.

## Appendix C Proof of Theorem 3

###### Lemma 4.

Assume that and is used for all to initialize the gradient descent algorithm to solve

 min(w1,…,wL)∈RL12(wL…w2w1−λ)2.

Then, each converges to 0 unless .

###### Proof.

We can write the update rule for any weight as

 w[k+1]=w[k](1−δσwL−2[k](wL[k]−λ))

which has one equilibrium at and another at . If and , it can be shown by induction that

 0≤1−δσwL−2[k](wL[k]−λ)<1

for all . As a result, converges to 0. ∎

Proof of Theorem 3. Similar to the proof of Theorem 2, the system created by the gradient descent can be decomposed into independent systems of the diagonal elements of the matrices and . Then, Lemma 3 and Lemma 4 can be applied to the systems with positive and negative eigenvalues of , respectively.

## Appendix D Proof of Theorem 4

To find a necessary condition for the convergence of the gradient descent algorithm to , we analyze the local stability of that solution in the sense of Lyapunov. Since the analysis is local and the function is fixed, for each point we can use a matrix that satisfies . Note that is a diagonal matrix and all of its diagonal elements are either 0 or 1. Then, we can write the cost function around an equilibrium as

 12∑Ni=1trace{[WGi(Vxi−b)−f(xi)]⊤[WGi(Vxi−b)−f(xi)]}.

Denoting the error by , the gradient descent gives

 W[k+1]=W[k]−δ∑Ni=1ei[k](V[k]xi−b)⊤G⊤i,
 V[k+1]=V[k]−δ∑Ni=1G⊤iW[k]⊤ei[k]x⊤i.

Let denote the vector . Then we can write the update equation of as

 ej[k+1] = ej[k]−δW[k]Gj∑iG⊤iW[k]⊤ei[k]x⊤ixj −δ∑iei[k](V[k]xi−b)⊤G⊤iGj(V[k]xj−b)+o(e[k]).

Similar to the proof of Theorem 1, the equilibrium can be stable in the sense on Lyapunov only if the system

 ej[k+1]=ej[k]−δ∑i^WGjG⊤i^W⊤ei[k]x⊤ixj−δ∑iei[k](^Vxi−b)⊤G⊤iGj(^Vxj−b) (10)

does not have any eigenvalue larger than 1 in magnitude. Note that the linear system in (10) can be described by a symmetric matrix, whose eigenvalues cannot be larger in magnitude than the eigenvalues of its sub-blocks on the diagonal, in particular those of the system

 ej[k+1]=ej[k]−δ^WGjG⊤j^W⊤ej[k]x⊤jxj−δej[k](^Vxj−b)⊤G⊤jGj(^Vxj−b). (11)

The eigenvalues of the system are less than 1 in magnitude only if the eigenvalues of the system

 h(u)=^WGjG⊤j^W⊤ux⊤jxj+u(^Vxj−b)⊤G⊤jGj(^Vxj−b)

are less than . This requires that for all for which ,

 2δ ≥ ⟨^f(xj),h(^f(xj))⟩⟨^f(xj),^f(xj)⟩ = 1∥^f(xj)∥2(∥G⊤j^W⊤^f(xj)∥2∥xj∥2+∥^f(xj)∥2∥Gj(^Vxj−b)∥2) ≥ 1∥^f(xj)∥2∥(^Vxj−b)⊤G⊤jG⊤j^W⊤^f(xj)∥2∥(^Vxj−b)⊤G⊤j∥2∥xj∥2+∥Gj(^Vxj−b)∥2 = 1∥Gj(^Vxj−b)∥2∥^f(xj)∥2∥xj∥2+∥Gj(^Vxj−b)∥2 ≥ 2∥^f(xj)∥∥xj∥.

As a result, Lyapunov stability of the solution requires

 1δ≥maxi∥^f(xi)∥∥xi∥.
Comments 0
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters

Loading ...
198984

You are asking your first question!
How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description