An Improved Analysis of Training Over-parameterized Deep Neural Networks

# An Improved Analysis of Training Over-parameterized Deep Neural Networks

Difan Zou    and    Quanquan Gu Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: knowzou@cs.ucla.eduDepartment of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: qgu@cs.ucla.edu
###### Abstract

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size n (e.g., O(n^{24})). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.

## 1 Introduction

Recent study (Zhang et al., 2016) has revealed that deep neural networks trained by gradient-based algorithms can fit training data with random labels and achieve zero training error. Since the loss landscape of training deep neural network is highly nonconvex or even nonsmooth, conventional optimization theory cannot explain why gradient descent (GD) and stochastic gradient descent (SGD) can find the global minimum of the loss function (i.e., achieving zero training error). To better understand the training of neural networks, there is a line of research (Tian, 2017; Brutzkus and Globerson, 2017; Du et al., 2017; Li and Yuan, 2017; Zhong et al., 2017; Du and Lee, 2018; Zhang et al., 2018) studying two-layer (i.e., one-hidden-layer) neural networks, where it assumes there exists a teacher network (i.e., an underlying ground-truth network) generating the output given the input, and casts neural network learning as weight matrix recovery for the teacher network. However, these studies not only make strong assumptions on the training data, but also need special initialization methods that are very different from the commonly used initialization method (He et al., 2015) in practice. Li and Liang (2018); Du et al. (2018b) advanced this line of research by proving that under much milder assumptions on the training data, (stochastic) gradient descent can attain a global convergence for training over-parameterized (i.e.,sufficiently wide) two-layer ReLU network with widely used random initialization method (He et al., 2015). More recently, Allen-Zhu et al. (2018b); Du et al. (2018a); Zou et al. (2018) generalized the global convergence results from two-layer networks to deep neural networks. However, there is a huge gap between the theory and practice since all these work Li and Liang (2018); Du et al. (2018b); Allen-Zhu et al. (2018b); Du et al. (2018a); Zou et al. (2018) require unrealistic over-parameterization conditions on the width of neural networks, especially for deep networks. In specific, in order to establish the global convergence for training two-layer ReLU networks, Du et al. (2018b) requires the network width, i.e., number of hidden nodes, to be at least \Omega(n^{6}/\lambda_{0}^{4}), where n is the training sample size and \lambda_{0} is the smallest eigenvalue of the so-called Gram matrix defined in Du et al. (2018b), which is essentially the neural tangent kernel (Jacot et al., 2018; Chizat and Bach, 2018) on the training data. Under the same assumption on the training data, Wu et al. (2019) improved the iteration complexity of GD in Du et al. (2018b) from O\big{(}n^{2}\log(1/\epsilon)/\lambda_{0}^{2}\big{)} to O\big{(}n\log(1/\epsilon)/\lambda_{0}\big{)} and Oymak and Soltanolkotabi (2019) improved the over-parameterization condition to \Omega(n\|\mathbf{X}\|_{2}^{6}/\lambda_{0}^{4}), where \epsilon is the target error and \mathbf{X}\in\mathbb{R}^{n\times d} is the input data matrix. For deep ReLU networks, the best known result was established in Allen-Zhu et al. (2018b), which requires the network width to be at least \widetilde{\Omega}(kn^{24}L^{12}\phi^{-8})111Here \widetilde{\Omega}(\cdot) hides constants and the logarithmic dependencies on problem dependent parameters except \epsilon. to ensure the global convergence of GD and SGD, where L is the number of hidden layers, \phi is the minimum data separation distance and k is the output dimension.

This paper continues the line of research, and improves the over-parameterization condition and the global convergence rate of (stochastic) gradient descent for training deep neural networks. In specific, under the same setting as in Allen-Zhu et al. (2018b), we prove faster global convergence rates for both GD and SGD under a significantly milder condition on the neural network width. Furthermore, when specializing our result to two-layer ReLU networks, it also outperforms the best-known result proved in Oymak and Soltanolkotabi (2019). The improvement in our result is due to the following two innovative proof techniques: (a) a tighter gradient lower bound, which leads to a faster rate of convergence for GD/SGD; and (b) a sharper characterization of the trajectory length for GD/SGD until convergence.

We highlight our main contributions as follows:

• We show that, with Gaussian random initialization (He et al., 2015) on each layer, when the number of hidden nodes per layer is \widetilde{\Omega}\big{(}kn^{8}L^{12}\phi^{-4}\big{)}, GD can achieve \epsilon training loss within \widetilde{O}\big{(}n^{2}L^{2}\log(1/\epsilon)\phi^{-1}\big{)} iterations, where L is the number of hidden layers, \phi is the minimum data separation distance, n is the number of training examples, and k is the output dimension. Compared with the state-of-the-art result (Allen-Zhu et al., 2018b), our over-parameterization condition is milder by a factor of \widetilde{\Omega}(n^{16}\phi^{-4}), and our iteration complexity is better by a factor of \widetilde{O}(n^{4}\phi^{-1}).

• We also prove a similar convergence result for SGD. We show that with Gaussian random initialization (He et al., 2015) on each layer, when the number of hidden nodes per layer is \widetilde{\Omega}\big{(}kn^{17}L^{12}B^{-4}\phi^{-8}\big{)}, SGD can achieve \epsilon expected training loss within \widetilde{O}\big{(}n^{5}\log(1/\epsilon)B^{-1}\phi^{-2}\big{)} iterations, where B is the minibatch size of SGD. Compared with the corresponding results in Allen-Zhu et al. (2018b), our results are strictly better by a factor of \widetilde{\Omega}(n^{7}B^{5}) and \widetilde{O}(n^{2}) respectively regarding over-parameterization condition and iteration complexity.

• When specializing our results of training deep ReLU networks with GD to two-layer ReLU networks, it also outperforms the corresponding results (Du et al., 2018b; Wu et al., 2019; Oymak and Soltanolkotabi, 2019). In addition, for training two-layer ReLU networks with SGD, we are able to show much better result than training deep ReLU networks with SGD.

For the ease of comparison, we summarize the best-known results (Du et al., 2018b; Allen-Zhu et al., 2018b; Du et al., 2018a; Wu et al., 2019; Oymak and Soltanolkotabi, 2019) of training overparameterized neural networks with GD and compare with them in terms of over-parameterization condition and iteration complexity in Table 1. We will show in Section 3 that, under the assumption that all training data points have unit \ell_{2} norm, which is the common assumption made in all these work (Du et al., 2018b; Allen-Zhu et al., 2018b; Du et al., 2018a; Wu et al., 2019; Oymak and Soltanolkotabi, 2019), \lambda_{0}>0 is equivalent to the fact that all training data are separated by some distance \phi, and we have \lambda_{0}=O(n^{-2}\phi) (Oymak and Soltanolkotabi, 2019). Substituting \lambda_{0}=\Omega(n^{-2}\phi) into Table 1, it is evident that our result outperforms all the other results under the same assumptions.

Notation For scalars, vectors and matrices, we use lower case, lower case bold face, and upper case bold face letters to denote them respectively. For a positive integer, we denote by [k] the set \{1,\dots,k\}. For a vector \mathbf{x}=(x_{1},\dots,x_{d})^{\top} and a positive integer p, we denote by \|\mathbf{x}\|_{p}=\big{(}\sum_{i=1}^{d}|x_{i}|^{p}\big{)}^{1/p} the \ell_{p} norm of \mathbf{x}. In addition, we denote by \|\mathbf{x}\|_{\infty}=\max_{i=1,\dots,d}|x_{i}| the \ell_{\infty} norm of \mathbf{x}, and \|\mathbf{x}\|_{0}=|\{x_{i}:x_{i}\neq 0,i=1,\dots,d\}| the \ell_{0} norm of \mathbf{x}. For a matrix \mathbf{A}\in\mathbb{R}^{m\times n}, we denote by \|\mathbf{A}\|_{F} the Frobenius norm of \mathbf{A}, \|\mathbf{A}\|_{2} the spectral norm (maximum singular value), \lambda_{\min}(\mathbf{A}) the smallest singular value, \|\mathbf{A}\|_{0} the number of nonzero entries, and \|\mathbf{A}\|_{2,\infty} the maximum \ell_{2} norm over all row vectors, i.e., \|\mathbf{A}\|_{2,\infty}=\max_{i=1,\dots,m}\|\mathbf{A}_{i*}\|_{2}. For a collection of matrices \mathbf{W}=\{\mathbf{W}_{1},\dots,\mathbf{W}_{L}\}, we denote \|\mathbf{W}\|_{F}=\sqrt{\sum_{l=1}^{L}\|\mathbf{W}_{l}\|_{F}^{2}}, \|\mathbf{W}\|_{2}=\max_{l\in[L]}\|\mathbf{W}_{l}\|_{2} and \|\mathbf{W}\|_{2,\infty}=\max_{l\in[L]}\|\mathbf{W}_{l}\|_{2,\infty}. Given two collections of matrices \widetilde{\mathbf{W}}=\{\widetilde{\mathbf{W}}_{1},\dots,\widetilde{\mathbf{W% }}_{L}\} and \widehat{\mathbf{W}}=\{\widehat{\mathbf{W}}_{1},\dots,\widehat{\mathbf{W}}_{L}\}, we define their inner product as \langle\widetilde{\mathbf{W}},\widehat{\mathbf{W}}\rangle=\sum_{l=1}^{L}% \langle\widetilde{\mathbf{W}}_{l},\widehat{\mathbf{W}}_{l}\rangle. For two sequences \{a_{n}\} and \{b_{n}\}, we use a_{n}=O(b_{n}) to denote that a_{n}\leq C_{1}b_{n} for some absolute constant C_{1}>0, and use a_{n}=\Omega(b_{n}) to denote that a_{n}\geq C_{2}b_{n} for some absolute constant C_{2}>0. In addition, we use \widetilde{O}(\cdot) and \widetilde{\Omega}(\cdot) to hide logarithmic factors.

## 2 Problem setup and algorithms

In this section, we introduce the problem setup and the training algorithms.

Following Allen-Zhu et al. (2018b), we consider the training of an L-hidden layer fully connected neural network, which takes \mathbf{x}\in\mathbb{R}^{d} as input, and outputs \mathbf{y}\in\mathbb{R}^{k}. In specific, the neural network is a vector-valued function \mathbf{f}_{\mathbf{W}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m}, which is defined as

 \displaystyle\mathbf{f}_{\mathbf{W}}(\mathbf{x})=\mathbf{V}\sigma(\mathbf{W}_{% L}\sigma(\mathbf{W}_{L-1}\cdots\sigma(\mathbf{W}_{1}\mathbf{x})\cdots)),

where \mathbf{W}_{1}\in\mathbb{R}^{m\times d}, \mathbf{W}_{2},\dots,\mathbf{W}_{L}\in\mathbb{R}^{m\times m} denote the weight matrices for the hidden layers, and \mathbf{V}\in\mathbb{R}^{k\times m} denotes the weight matrix in the output layer, \sigma(x)=\max\{0,x\} is the entry-wise ReLU activation function. In addition, we denote by \sigma^{\prime}(x)=\operatorname*{\mathds{1}}(x) the derivative of ReLU activation function and \mathbf{w}_{l,j} the weight vector of the j-th node in the l-th layer.

Given a training set \{(\mathbf{x}_{i},\mathbf{y}_{i})\}_{i=1,\dots,n} where \mathbf{x}_{i}\in\mathbb{R}^{d} and \mathbf{y}_{i}\in\mathbb{R}^{k}, the empirical loss function for training the neural network is defined as

 \displaystyle L(\mathbf{W}):=\frac{1}{n}\sum_{i=1}^{n}\ell(\widehat{\mathbf{y}% }_{i},\mathbf{y}_{i}), (2.1)

where \ell(\cdot,\cdot) is the loss function, and \widehat{\mathbf{y}}_{i}=\mathbf{f}_{\mathbf{W}}(\mathbf{x}_{i}). In this paper, for the ease of exposition, we follow Allen-Zhu et al. (2018b); Du et al. (2018b, a); Oymak and Soltanolkotabi (2019) and consider square loss as follows

 \displaystyle\ell(\widehat{\mathbf{y}}_{i},\mathbf{y}_{i})=\frac{1}{2}\|% \mathbf{y}_{i}-\widehat{\mathbf{y}}_{i}\|_{2}^{2},

where \widehat{\mathbf{y}}_{i}=\mathbf{f}_{\mathbf{W}}(\mathbf{x}_{i})\in\mathbb{R}^% {k} denotes the output of the neural network given input \mathbf{x}_{i}. It is worth noting that our result can be easily extended to other loss functions such as cross entropy loss (Zou et al., 2018) as well.

We will study both gradient descent and stochastic gradient descent as training algorithms, which are displayed in Algorithm 1. For gradient descent, we update the weight matrix \mathbf{W}_{l}^{(t)} using full partial gradient \nabla_{\mathbf{W}_{l}}L(\mathbf{W}^{(t)}). For stochastic gradient descent, we update the weight matrix \mathbf{W}_{l}^{(t)} using stochastic partial gradient 1/B\sum_{s\in\mathcal{B}^{(t)}}\nabla_{\mathbf{W}_{l}}\ell\big{(}\mathbf{f}_{% \mathbf{W}^{(t)}}(\mathbf{x}_{s}),\mathbf{y}_{s}\big{)}, where \mathcal{B}^{(t)} with |\mathcal{B}^{(t)}|=B denotes the minibatch of training examples at the t-th iteration. Both algorithms are initialized in the same way as Allen-Zhu et al. (2018b), which is essentially the initialization method (He et al., 2015) widely used in practice. In the remaining of this paper, we denote by

 \displaystyle\nabla L(\mathbf{W}^{(t)})=\{\nabla_{\mathbf{W}_{l}}L(\mathbf{W}^% {(t)})\}_{l\in[L]}\quad\mbox{and}\quad\nabla\ell\big{(}\mathbf{f}_{\mathbf{W}^% {(t)}}(\mathbf{x}_{i}),\mathbf{y}_{i}\big{)}=\{\nabla_{\mathbf{W}_{l}}\ell\big% {(}\mathbf{f}_{\mathbf{W}^{(t)}}(\mathbf{x}_{i}),\mathbf{y}_{i}\big{)}\}_{l\in% [L]}

the collections of all partial gradients of L(\mathbf{W}^{(t)}) and \ell\big{(}\mathbf{f}_{\mathbf{W}^{(t)}}(\mathbf{x}_{i}),\mathbf{y}_{i}\big{)}.

## 3 Main theory

In this section, we present our main theoretical results. We make the following assumptions on the training data.

###### Assumption 3.1.

For any \mathbf{x}_{i}, it holds that \|\mathbf{x}_{i}\|_{2}=1 and (\mathbf{x}_{i})_{d}=\mu, where \mu is an positive constant.

The same assumption has been made in all previous work along this line (Du et al., 2018a; Allen-Zhu et al., 2018b; Zou et al., 2018; Oymak and Soltanolkotabi, 2019). Note that requiring the norm of all training examples to be 1 is not essential, and this assumption can be relaxed to be \|\mathbf{x}_{i}\|_{2} is lower and upper bounded by some constants.

###### Assumption 3.2.

For any two different training data points \mathbf{x}_{i} and \mathbf{x}_{j}, there exists a positive constant \phi>0 such that \|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}\geq\phi.

This assumption has also been made in Allen-Zhu et al. (2018c, b), which is essential to guarantee zero training error for deep neural networks. It is a quite mild assumption for the regression problem as studied in this paper. Note that Du et al. (2018a) made a different assumption on training data, which requires the Gram matrix \mathbf{K}^{(L)} (See their paper for details) defined on the L-hidden-layer networks is positive definite. However, their assumption is not easy to verify for neural networks with more than two layers.

Based on Assumptions 3.1 and 3.2, we are able to establish the global convergence rates of GD and SGD for training deep ReLU networks. We start with the result of GD for L-hidden-layer networks.

### 3.1 Training L-hidden-layer ReLU networks with GD

The global convergence of GD for training deep neural networks is stated in the following theorem.

###### Theorem 3.3.

Under Assumptions 3.1 and 3.2, and suppose the number of hidden nodes per layer satisfies

 \displaystyle m=\Omega\big{(}kn^{8}L^{12}\log^{3}(m)/\phi^{4}\big{)}. (3.1)

Then if set the step size \eta=O\big{(}k/(L^{2}m)\big{)}, with probability at least 1-O(n^{-1}), gradient descent is able to find a point that achieves \epsilon training loss within

 \displaystyle T=O\big{(}n^{2}L^{2}\log(1/\epsilon)/\phi\big{)}

iterations.

###### Remark 3.4.

The state-of-the-art results for training deep ReLU network are provided by Allen-Zhu et al. (2018b), where the authors showed that GD can achieve \epsilon-training loss within O\big{(}n^{6}L^{2}\log(1/\epsilon)/\phi^{2}\big{)} iterations if the neural network width satisfies m=\widetilde{\Omega}\big{(}kn^{24}L^{12}/\phi^{8}\big{)}. As a clear comparison, our result on the iteration complexity is better than theirs by a factor of O(n^{4}/\phi), and our over-parameterization condition is milder than theirs by a factor of \widetilde{\Omega}(n^{16}/\phi^{4}). Du et al. (2018a) also proved the global convergence of GD for training deep neural network with smooth activation functions. As shown in Table 1, the over-parameterization condition and iteration complexity in Du et al. (2018a) have an exponential dependency on L, which is much worse than the polynomial dependency on L as in Allen-Zhu et al. (2018b) and our result.

We now specialize our results in Theorem 3.3 to two-layer networks by removing the dependency on the number of hidden layers, i.e., L. We state this result in the following corollary.

###### Corollary 3.5.

Under the same assumptions made in Theorem 3.3. For training two-layer ReLU networks, if set the number of hidden nodes m=\Omega\big{(}kn^{8}\log^{3}(m)/\phi^{4}\big{)} and step size \eta=O(k/m), then with probability at least 1-O(n^{-1}), GD is able to find a point that achieves \epsilon-training loss within T=O\big{(}n^{2}\log(1/\epsilon)/\phi\big{)} iterations.

For training two-layer ReLU networks, Du et al. (2018b) made a different assumption on the training data to establish the global convergence of GD. Specifically, Du et al. (2018b) defined a Gram matrix, which is also known as neural tangent kernel (Jacot et al., 2018), based on the training data \{\mathbf{x}_{i}\}_{i=1,\dots,n} and assumed that the smallest eigenvalue of such Gram matrix is strictly positive. In fact, for two-layer neural networks, their assumption is equivalent to Assumption 3.2, as shown in the following proposition.

###### Proposition 3.6.

Under Assumption 3.1, define the Gram matrix \mathbf{H}\in\mathbb{R}^{n\times n} as follows

 \displaystyle\mathbf{H}_{ij}=\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(0,\mathbf{I% })}[\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\sigma^{\prime}(\mathbf{w}^{\top}% \mathbf{x}_{i})\sigma^{\prime}(\mathbf{w}^{\top}\mathbf{x}_{j})],

then the assumption \lambda_{0}=\lambda_{\min}(\mathbf{H})>0 is equivalent to Assumption 3.2. In addition, there exists a sufficiently small constant C such that \lambda_{0}\geq C\phi n^{-2}.

###### Remark 3.7.

According to Proposition 3.6, we can make a direct comparison between our convergence results for two-layer ReLU networks in Corollary 3.5 with those in Du et al. (2018b); Oymak and Soltanolkotabi (2019). In specific, as shown in Table 1, the iteration complexity and over-parameterization condition proved in Du et al. (2018b) can be translated to O(n^{6}\log(1/\epsilon)/\phi^{2}) and \Omega(n^{14}/\phi^{4}) respectively under Assumption 3.2. Oymak and Soltanolkotabi (2019) improved the result in Du et al. (2018b) and the improved iteration complexity and over-parameterization condition can be translated to O\big{(}n^{2}\|\mathbf{X}\|_{2}^{2}\log(1/\epsilon)/\phi\big{)} 222It is worth noting that \|\mathbf{X}\|_{2}^{2}=O(1) if d\lesssim n, \|\mathbf{X}\|_{2}^{2}=O(n/d) if \mathbf{X} is randomly generated, and \|\mathbf{X}\|_{2}^{2}=O(n) in the worst case. and \Omega\big{(}n^{9}\|\mathbf{X}\|_{2}^{6}/\phi^{4}\big{)} respectively, where \mathbf{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{n}]^{\top}\in\mathbb{R}^{d\times n} is the input data matrix. Our iteration complexity for two-layer ReLU networks is better than that in Oymak and Soltanolkotabi (2019) by a factor of O(\|\mathbf{X}\|_{2}^{2}) 333Here we set k=1 in order to match the problem setting in Du et al. (2018b); Oymak and Soltanolkotabi (2019)., and the over-parameterization condition is also strictly milder than the that in Oymak and Soltanolkotabi (2019) by a factor of O(n\|\mathbf{X}\|_{2}^{6}).

### 3.2 Extension to training L-hidden-layer ReLU networks with SGD

Then we extend the convergence results of GD to SGD in the following theorem.

###### Theorem 3.8.

Under Assumptions 3.1 and 3.2, and suppose the number of hidden nodes per layer satisfies

 \displaystyle m=\Omega\big{(}kn^{17}L^{12}\log^{3}(m)/(B^{4}\phi^{8})\big{)}. (3.2)

Then if set the step size as \eta=O\big{(}kB\phi/(n^{3}m\log(m))\big{)}, with probability at least 1-O(n^{-1}), SGD is able to achieve \epsilon expected training loss within

 \displaystyle T=O\big{(}n^{5}\log(m)\log^{2}{(1/\epsilon)}/(B\phi^{2})\big{)}

iterations.

###### Remark 3.9.

We first compare our result with the state-of-the-art proved in Allen-Zhu et al. (2018b), where they showed that SGD can converge to a point with \epsilon-training loss within \widetilde{O}\big{(}n^{7}\log(1/\epsilon)/(B\phi^{2})\big{)} iterations if m=\widetilde{\Omega}\big{(}n^{24}L^{12}Bk/\phi^{8}\big{)}. In stark contrast, our result on the over-parameterization condition is strictly better than it by a factor of \widetilde{\Omega}(n^{7}B^{5}), and our result on the iteration complexity is also faster by a factor of O(n^{2}).

Moreover, we also characterize the convergence rate and over-parameterization condition of SGD for training two-layer networks. Unlike the gradient descent, which has the same convergence rate and over-parameterization condition for training both deep and two-layer networks in terms of training data size n, we find that the over-parameterization condition of SGD can be further improved for training two-layer neural networks. We state this improved result in the following theorem.

###### Theorem 3.10.

Under the same assumptions made in Theorem 3.8. For two-layer ReLU networks, if set the number of hidden nodes and step size as

 \displaystyle m=\Omega\big{(}k^{5/2}n^{11}\log^{3}(m)/(\phi^{5}B)\big{)},\quad% \eta=O\big{(}kB\phi/(n^{3}m\log(m))\big{)},

then with probability at least 1-O(n^{-1}), stochastic gradient descent is able to achieve \epsilon training loss within T=O\big{(}n^{5}\log(m)\log(1/\epsilon)/(B\phi^{2})\big{)} iterations.

###### Remark 3.11.

From Theorem 3.8, we can also obtain the convergence results of SGD for two-layer ReLU networks by choosing L=1. However, the resulting over-parameterization condition is m=\Omega\big{(}kn^{17}\log^{3}(m)B^{-4}\phi^{-8}\big{)}, which is much worse than that in Theorem 3.10. This is because for two-layer networks, the training loss enjoys nicer local properties around the initialization, which can be leveraged to improve the convergence of SGD. Due to space limit, we defer more details to Appendix A.

## 4 Proof sketch of the main theory

In this section, we provide the proof sketch for Theorems 3.3, and highlight our technical contributions and innovative proof techniques.

### 4.1 Overview of the technical contributions

The improvements in our result are mainly attributed to the following two aspects: (1) a tighter gradient lower bound leading to faster convergence; and (2) a sharper characterization of the trajectory length of the algorithm.

We first define the following perturbation region based on the initialization,

 \displaystyle\mathcal{B}(\mathbf{W}^{(0)},\tau)=\{\mathbf{W}:\|\mathbf{W}_{l}-% \mathbf{W}_{l}^{(0)}\|_{2}\leq\tau\mbox{ for all }l\in[L]\},

where \tau>0 is the preset perturbation radius for each weight matrix \mathbf{W}_{l}.

Tighter gradient lower bound. By the definition of \nabla L(\mathbf{W}), we have \|\nabla L(\mathbf{W})\|_{F}^{2}=\sum_{l=1}^{L}\|\nabla_{\mathbf{W}_{l}}L(% \mathbf{W})\|_{F}^{2}\geq\|\nabla_{\mathbf{W}_{L}}L(\mathbf{W})\|_{F}^{2}. Therefore, we can focus on the partial gradient of L(\mathbf{W}) with respect to the weight matrix at the last hidden layer. Note that we further have \|\nabla_{\mathbf{W}_{L}}L(\mathbf{W})\|_{F}^{2}=\sum_{j=1}^{m}\|\nabla_{% \mathbf{w}_{L,j}}L(\mathbf{W})\|_{2}^{2}, where

 \displaystyle\nabla_{\mathbf{w}_{L,j}}L(\mathbf{W})=\frac{1}{n}\sum_{i=1}^{n}% \langle\mathbf{f}_{\mathbf{W}}(\mathbf{x}_{i})-\mathbf{y}_{i},\mathbf{v}_{j}% \rangle\sigma^{\prime}\big{(}\langle\mathbf{w}_{L,j},\mathbf{x}_{L-1,i}\rangle% \big{)}\mathbf{x}_{L-1,i},

and \mathbf{x}_{L-1,i} denotes the output of the (L-1)-th hidden layer with input \mathbf{x}_{i}. In order to prove the gradient lower bound, for each \mathbf{x}_{L-1,i}, we introduce a region namely “gradient region”, denoted by \mathcal{W}_{j}, which is almost orthogonal to \mathbf{x}_{L-1,i}. Then we prove two major properties of these n regions \{\mathcal{W}_{1},\dots,\mathcal{W}_{n}\}: (1) \mathcal{W}_{i}\cap\mathcal{W}_{j}=\emptyset if i\neq j, and (2) if \mathbf{w}_{L,j}\in\mathcal{W}_{i} for any i, with probability at least 1/2, \|\nabla_{\mathbf{w}_{L,j}}L(\mathbf{W})\|_{2} is sufficiently large. We visualize these “gradient regions” in Figure 1(a). Since \{\mathbf{w}_{L,j}\}_{j\in[m]} are randomly generated at the initialization, in order to get a larger bound of \|\nabla_{\mathbf{W}_{L}}L(\mathbf{W})\|_{F}^{2}, we hope the size of these “gradient regions” to be as large as possible. We take the union of the “gradient regions” for all training data, i.e., \cup_{i=1}^{n}\mathcal{W}_{i}, which is shown in Figure 1(a). As a comparison, Allen-Zhu et al. (2018b); Zou et al. (2018) only leveraged the “gradient region” for one training data point to establish the gradient lower bound, which is shown in Figure 1(b). Roughly speaking, the size of “gradient regions” utilized in our proof is n times larger than those used in Allen-Zhu et al. (2018b); Zou et al. (2018), which consequently leads to an O(n) improvement on the gradient lower bound. The improved gradient lower bound is formally stated in the following lemma.

###### Lemma 4.1 (Gradient lower bound).

Let \tau=O\big{(}\phi^{3/2}n^{-3}L^{-6}\log^{-3/2}(m)\big{)}, then for all \mathbf{W}\in\mathcal{B}(\mathbf{W}^{(0)},\tau), with probability at least 1-\exp\big{(}O(m\phi/(dn))), it holds that

 \displaystyle\|\nabla \displaystyle L(\mathbf{W})\|_{F}^{2}\geq O\big{(}m\phi L(\mathbf{W})/(kn^{2})% \big{)}.

Sharper characterization of the trajectory length. The improved analysis of the trajectory length is motivated by the following observation: at the t-th iteration, the decrease of the training loss after one-step gradient descent is proportional to the gradient norm, i.e., L(\mathbf{W}^{(t)})-L(\mathbf{W}^{(t+1)})\propto\|\nabla L(\mathbf{W}^{(t)})\|% _{F}^{2}. In addition, the gradient norm \|\nabla L(\mathbf{W}^{(t)})\|_{F} determines the trajectory length in the t-th iteration. Putting them together, we can obtain

 \displaystyle\|\mathbf{W}_{l}^{(t+1)}-\mathbf{W}_{l}^{(t)}\|_{2}=\eta\|\nabla_% {\mathbf{W}_{l}}L(\mathbf{W}^{(t)})\|_{2}\leq\sqrt{Ckn^{2}/(m\phi)}\cdot\Big{(% }\sqrt{L(\mathbf{W}^{(t)})}-\sqrt{L(\mathbf{W}^{(t+1)})}\Big{)}, (4.1)

where C is an absolute constant. (4.1) enables the use of telescope sum, which yields \|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\sqrt{Ckn^{2}L(\mathbf{W}% ^{(0)})/m\phi}. In stark contrast, Allen-Zhu et al. (2018b) bounds the trajectory length as

 \displaystyle\|\mathbf{W}_{l}^{(t+1)}-\mathbf{W}_{l}^{(t)}\|_{2}=\eta\|\nabla_% {\mathbf{W}_{l}}L(\mathbf{W}^{(t)})\|_{2}\leq\eta\sqrt{C^{\prime}mL(\mathbf{W}% ^{(t)})/k},

and further prove that \|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\sqrt{C^{\prime}kn^{6}L^{% 2}(\mathbf{W}^{(0)})/(m\phi^{2})} by taking summation over t, where C^{\prime} is an absolute constant. Our sharp characterization of the trajectory length is formally summarized in the following lemma.

###### Lemma 4.2.

Assuming all iterates are staying inside the region \mathcal{B}(\mathbf{W}^{(0)},\tau) with \tau=O\big{(}\phi^{3/2}n^{-3}L^{-6}\log^{-3/2}(m)\big{)}, if set the step size \eta=O\big{(}k/(L^{2}m)\big{)}, with probability least 1-O(n^{-1}), the following holds for all t\geq 0 and l\in[L],

 \displaystyle\|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq O\big{(}% \sqrt{kn^{2}\log(n)/(m\phi)}\big{)}.

### 4.2 Proof of Theorem 3.3

Our proof road map can be organized in three steps: (i) prove that the training loss enjoys good curvature properties within the perturbation region \mathcal{B}(\mathbf{W}^{(0)},\tau); (ii) show that gradient descent is able to converge to global minima based on such good curvature properties; and (iii) ensure all iterates stay inside the perturbation region until convergence.

Step (i) Training loss properties. We first show some key properties of the training loss within \mathcal{B}(\mathbf{W}^{(0)},\tau), which are essential to establish the convergence guarantees of gradient descent.

###### Lemma 4.3.

If m\geq O(L\log(nL)), with probability at least 1-O(n^{-1}) it holds that L(\mathbf{W}^{(0)})\leq\widetilde{O}(1).

Lemma 4.3 suggests that the training loss L(\mathbf{W}) at the initial point does not depend on the number of hidden nodes per layer, i.e., m.

Moreover, the training loss L(\mathbf{W}) is nonsmooth due to the non-differentiable ReLU activation function. Generally speaking, smoothness is essential to achieve linear rate of convergence for gradient-based algorithms. Fortunately, Allen-Zhu et al. (2018b) showed that the training loss satisfies locally semi-smoothness property, which is summarized in the following lemma.

###### Lemma 4.4 (Semi-smoothness (Allen-Zhu et al., 2018b)).

Let

 \displaystyle\tau\in\big{[}\Omega\big{(}1/(k^{3/2}m^{3/2}L^{3/2}\log^{3/2}(m))% \big{)},O\big{(}1/(L^{4.5}\log^{3/2}(m))\big{)}\big{]}.

Then for any two collections \widehat{\mathbf{W}}=\{\widehat{\mathbf{W}}_{l}\}_{l\in[L]} and \widetilde{\mathbf{W}}=\{\widetilde{\mathbf{W}}_{l}\}_{l\in[L]} satisfying \widehat{\mathbf{W}},\widetilde{\mathbf{W}}\in\mathcal{B}(\mathbf{W}^{(0)},\tau), with probability at least 1-\exp(-\Omega(-m\tau^{3/2}L)), there exist two constants C^{\prime} and C^{\prime\prime} such that

 \displaystyle L(\widetilde{\mathbf{W}}) \displaystyle\leq L(\widehat{\mathbf{W}})+\langle\nabla L(\widehat{\mathbf{W}}% ),\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}\rangle \displaystyle\quad+C^{\prime}\sqrt{L(\widehat{\mathbf{W}})}\cdot\frac{\tau^{1/% 3}L^{2}\sqrt{m\log(m)}}{\sqrt{k}}\cdot\|\widetilde{\mathbf{W}}-\widehat{% \mathbf{W}}\|_{2}+\frac{C^{\prime\prime}L^{2}m}{k}\|\widetilde{\mathbf{W}}-% \widehat{\mathbf{W}}\|_{2}^{2}. (4.2)

Lemma 4.4 is a rescaled version of Theorem 4 in Allen-Zhu et al. (2018b), since the training loss L(\mathbf{W}) in (2.1) is divided by the training sample size n, as opposed to the training loss in Allen-Zhu et al. (2018b). This lemma suggests that if the perturbation region is small, i.e., \tau\ll 1, the non-smooth term (third term on the R.H.S. of (4.4)) is small and dominated by the gradient term (the second term on the the R.H.S. of (4.4)). Therefore, the training loss behaves like a smooth function in the perturbation region and the linear rate of convergence can be proved.

Step (ii) Convergence rate of GD. Now we are going to establish the convergence rate for gradient descent under the assumption that all iterates stay inside the region \mathcal{B}(\mathbf{W}^{(0)},\tau), where \tau will be specified later.

###### Lemma 4.5.

Assume all iterates stay inside the region \mathcal{B}(\mathbf{W}^{(0)},\tau), where \tau=O\big{(}\phi^{3/2}n^{-3}L^{-6}\log^{-3/2}(m)\big{)}. Then under Assumptions 3.1 and 3.2, if set the step size \eta=O\big{(}k/(L^{2}m)\big{)}, with probability least 1-\exp\big{(}-O(m\tau^{3/2}L)\big{)}, it holds that

 \displaystyle L(\mathbf{W}^{(t)})\leq\bigg{(}1-O\bigg{(}\frac{m\phi\eta}{kn^{2% }}\bigg{)}\bigg{)}^{t}L(\mathbf{W}^{(0)}).

Lemma 4.5 suggests that gradient descent is able to decrease the training loss to zero at a linear rate.

Step (iii) Verifying all iterates of GD stay inside the perturbation region. Then we are going to ensure that all iterates of GD are staying inside the required region \mathcal{B}(\mathbf{W}^{(0)},\tau). Note that we have proved the distance \|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2} in Lemma 4.2. Therefore, it suffices to verify that such distance is smaller than the preset value \tau. Thus, we can complete the proof of Theorem 3.3 by verifying the conditions based on our choice of m. Note that we have set the required number of m in (3.1), plugging (3.1) into the result of Lemma 4.2, we have with probability at least 1-O(n^{-1}), the following holds for all t\leq T and l\in[L]

 \displaystyle\|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq O\big{(}% \phi^{3/2}{n^{-3}L^{-6}\log^{-3/2}(m)}\big{)},

which is exactly in the same order of \tau in Lemma 4.5. Therefore, our choice of m guarantees that all iterates are inside the required perturbation region. In addition, by Lemma 4.5, in order to achieve \epsilon accuracy, we require

 \displaystyle T\eta=O\big{(}kn^{2}\log\big{(}1/\epsilon\big{)}m^{-1}\phi^{-1}% \big{)}. (4.3)

Then substituting our choice of step size \eta=O\big{(}k/(L^{2}m)\big{)} into (4.3) and applying Lemma 4.3, we can get the desired result for T.

## 5 Conclusions and future work

In this paper, we studied the global convergence of (stochastic) gradient descent for training over-parameterized ReLU networks, and improved the state-of-the-art results. Our proof technique can be also applied to prove similar results for other loss functions such as cross-entropy loss and other neural network architectures such as convolutional neural networks (CNN) (Allen-Zhu et al., 2018b; Du et al., 2018b) and ResNet (Allen-Zhu et al., 2018b; Du et al., 2018b; Zhang et al., 2019). One important future work is to investigate whether the over-parameterization condition and the convergence rate can be further improved. Another interesting future direction is to explore the use of our proof technique to improve the generalization analysis of overparameterized neural networks trained by gradient-based algorithms (Allen-Zhu et al., 2018a; Cao and Gu, 2019; Arora et al., 2019).

## Appendix A Proof of the Main Theory

### A.1 Proof of Proposition 3.6

We prove this proposition by two steps: (1) we prove that if there is no duplicate training data, it must hold that \lambda_{\min}(\mathbf{H})>0; (2) we prove that if there exists at least one duplicate training data, we have \lambda_{\min}(\mathbf{H})=0.

The first step can be done by applying Theorem 3 in Du et al. (2018b), where the author showed that if for any i\neq j, \mathbf{x}_{i}\nparallel\mathbf{x}_{j}, then it holds that \lambda_{\min}(\mathbf{H})>0. Since under Assumption 3.1, we have \|\mathbf{x}_{i}\|_{2}=\|\mathbf{x}_{j}\|_{2}. Then it can be shown that \mathbf{x}_{i}\neq\mathbf{x}_{j} for all i\neq j is an sufficient condition to \lambda_{\min}(\mathbf{H}).

Then we conduct the second step. Clearly, if we have two training data with \mathbf{x}_{i}=\mathbf{x}_{j}, it can be shown that \mathbf{H}_{ik}=\mathbf{H}_{jk} for all k=1,\dots,n. This immediately implies that there exist two identical rows in \mathbf{H}, which further suggests that \lambda_{\min}(\mathbf{H})=0.

The last argument can be directly proved by Lemma I.1 in Oymak and Soltanolkotabi (2019), where the authors showed that \lambda_{0}=\lambda_{\min}(\mathbf{H})\geq\phi/(100n^{2}).

By combining the above discussions, we are able to complete the proof.

### A.2 Proof of Theorem 3.8

Now we sketch the proof of Theorem 3.8. Following the same idea of proving Theorem 3.3, we split the whole proof into three steps.

Step (i) Initialization and perturbation region characterization. Unlike the proof for GD, in addition to the crucial gradient lower bound specified in Lemma 4.1, we also require the gradient upper bound, which is stated in the following lemma.

###### Lemma A.1 (Gradient upper bounds (Allen-Zhu et al., 2018b)).

Let \tau=O\big{(}\phi^{3/2}n^{-3}L^{-6}\log^{-3/2}(m)\big{)}, then for all \mathbf{W}\in\mathcal{B}(\mathbf{W}^{(0)},\tau), with probability at least 1-\exp\big{(}O(m\phi/(dn))), it holds that

 \displaystyle\|\nabla L(\mathbf{W})\|_{F}^{2}\leq O\bigg{(}\frac{mL(\mathbf{W}% )}{k}\bigg{)},\quad\|\nabla\ell(\mathbf{f}_{\mathbf{W}}(\mathbf{x}_{i}),% \mathbf{y}_{i})\|_{F}^{2}\leq O\bigg{(}\frac{m\ell(\mathbf{f}_{\mathbf{W}}(% \mathbf{x}_{i}),\mathbf{y}_{i})}{k}\bigg{)}.

In later analysis, we show that the gradient upper bound will be exploited to bound the distance between iterates of SGD and its initialization. Besides, note that Lemmas 4.3 and 4.4 hold for both GD and SGD, we do not state them again in this part.

Step (ii) Convergence rate of SGD. Analogous to the proof for GD, the following lemma shows that SGD is able to converge to the global minima at a linear rate.

###### Lemma A.2.

Assume all iterates stay inside the region \mathcal{B}(\mathbf{W}^{(0)},\tau), where \tau=O\big{(}\phi^{3}B^{3/2}n^{-6}L^{-6}\log^{-3/2}(m)\big{)}. Then under Assumptions 3.1 and 3.2, if set the step size \eta=O\big{(}B\phi/(L^{2}mn^{2})\big{)}, with probability least 1-\exp\big{(}-O(m\tau^{3/2}L)\big{)}, it holds that

 \displaystyle\mathbb{E}[L(\mathbf{W}^{(t)})]\leq\bigg{(}1-O\bigg{(}\frac{m\phi% \eta}{kn^{2}}\bigg{)}\bigg{)}^{t}L(\mathbf{W}^{(0)}).

Step (iii) Verifying all iterates of SGD stay inside the perturbation region. Similar to the proof for GD, the following lemma characterizes the distance from each iterate to the initial point for SGD.

###### Lemma A.3.

Under the same assumptions made in Lemma A.2, if set the step size \eta=O\big{(}kB\phi/(n^{3}m\log(m))\big{)}, suppose m\geq O(T\cdot n), with probability at least 1-O(n^{-1}), the following holds for all t\leq T and l\in[L],

 \displaystyle\|\mathbf{W}^{(t)}_{l}-\mathbf{W}_{l}^{(0)}\|_{2}\leq O\big{(}k^{% 1/2}n^{5/2}B^{-1/2}m^{-1/2}\phi^{-1}\big{)}.
###### Proof of Theorem 3.8.

Compared with Lemma 4.2, the trajectory length of SGD is much larger than that of GD. In addition, we require a much smaller step size to guarantee that the iterates do not go too far away from the initial point. This makes over-parameterization condition of SGD worse than that of GD.

We complete the proof of Theorem 3.8 by verifying our choice of m in (3.2). By substituting (3.2) into Lemma A.3, we have with probability at least 1-O(n^{-1}), the following holds for all t\leq T and l\in[L]

 \displaystyle\|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2}=O\big{(}\phi^{3% /2}B^{3/2}n^{-6}L^{-6}\log^{-3/2}(m)\big{)},

which is exactly in the same order of \tau in Lemma A.2. Then by Lemma A.2, we know that in order to achieve \epsilon expected training loss, it suffices to set

 \displaystyle T\eta=O\big{(}kn^{2}m^{-1}\phi^{-1}\log(1/\epsilon)\big{)}.

Then applying our choice of step size, i.e., \eta=O\big{(}kB\phi/(n^{3}m\log(m))\big{)}, we can get the desired result for T. This completes the proof. ∎

### A.3 Proof of Theorem 3.10

Before proving Theorem 3.10, we first deliver the following two lemmas. The first lemma states the upper bound of stochastic gradient in \|\cdot\|_{2,\infty} norm.

###### Lemma A.4.

With probability at least 1-O(m^{-1}), it holds that

 \displaystyle\|\nabla\ell(\mathbf{f}_{\mathbf{W}}(\mathbf{x}_{i}),\mathbf{y}_{% i})\|_{2,\infty}^{2}\leq O\big{(}\ell(\mathbf{f}_{\mathbf{W}}(\mathbf{x}_{i}),% \mathbf{y}_{i})\cdot\log(m)\big{)}

for all \mathbf{W}\in\mathbb{R}^{m\times d} and i\in[n].

The following lemma gives a different version of semi-smoothness for two-layer ReLU network.

###### Lemma A.5 (Semi-smoothness for two-layer ReLU network).

For any two collections \widehat{\mathbf{W}}=\{\widehat{\mathbf{W}}_{l}\}_{l\in[L]} and \widetilde{\mathbf{W}}=\{\widetilde{\mathbf{W}}_{l}\}_{l\in[L]} satisfying \widehat{\mathbf{W}},\widetilde{\mathbf{W}}\in\mathcal{B}(\mathbf{W}^{(0)},\tau), with probability at least 1-\exp(-O(-m\tau^{2/3})), there exist two constants C^{\prime} and C^{\prime\prime} such that

 \displaystyle L(\widetilde{\mathbf{W}}) \displaystyle\leq L(\widehat{\mathbf{W}})+\langle\nabla L(\widehat{\mathbf{W}}% ),\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}\rangle \displaystyle\quad+C^{\prime}\sqrt{L(\widehat{\mathbf{W}})}\cdot\frac{\tau^{2/% 3}m\sqrt{\log(m)}}{\sqrt{k}}\cdot\|\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}% \|_{2,\infty}+\frac{C^{\prime\prime}m}{k}\|\widetilde{\mathbf{W}}-\widehat{% \mathbf{W}}\|_{2}^{2}.

It is worth noting that Lemma 4.4 can also imply a \|\cdot\|_{2,\infty} norm based semi-smoothness result by applying the inequality \|\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}\|_{2}\leq\|\widetilde{\mathbf{W}% }-\widehat{\mathbf{W}}\|_{F}\leq\sqrt{m}\|\widetilde{\mathbf{W}}-\widehat{% \mathbf{W}}\|_{2,\infty}. However, this operation will maintain the dependency on \tau, i.e., \tau^{1/3}, which is worse than that in Lemma A.5 (e.g. \tau^{2/3}) since typically we have \tau\ll 1. Therefore, Lemma A.5 is crucial to establish a better convergence guarantee for SGD in training two-layer ReLU network.

###### Proof of Theorem 3.10.

To simplify the proof, we use the following short-hand notation to define mini-batch stochastic gradient at the t-th iteration

 \displaystyle\mathbf{G}^{(t)}=\frac{1}{|\mathcal{B}^{(t)}|}\sum_{s\in\mathcal{% B}^{(t)}}\nabla\ell\big{(}\mathbf{f}_{\mathbf{W}^{(t)}}(\mathbf{x}_{s}),% \mathbf{y}_{s}\big{)},

where \mathcal{B}^{(t)} is the minibatch of data indices with |\mathcal{B}^{(t)}|=B. Then we bound its variance as follows,

 \displaystyle\mathbb{E}[\|\mathbf{G}^{(t)}-\nabla L(\mathbf{W}^{(t)})\|_{F}^{2}] \displaystyle\leq\frac{1}{B}\mathbb{E}_{s}[\|\nabla\ell\big{(}\mathbf{f}_{% \mathbf{W}^{(t)}}(\mathbf{x}_{s}),\mathbf{y}_{s}\big{)}-\nabla L(\mathbf{W}^{(% t)})\|_{F}^{2}] \displaystyle\leq\frac{2}{B}\big{[}\mathbb{E}_{s}[\|\nabla\ell\big{(}\mathbf{f% }_{\mathbf{W}^{(t)}}(\mathbf{x}_{s}),\mathbf{y}_{i}\big{)}\|_{F}^{2}]+\|\nabla L% (\mathbf{W}^{(t)})\|_{F}^{2}\big{]} \displaystyle\leq\frac{4L(\mathbf{W}^{(t)})}{Bk},

where the expectation is taken over the random choice of training data and the second inequality follows from Young’s inequality and the last inequality is by Lemma A.1. Moreover, we can further bound the expectation \mathbb{E}[\|\mathbf{G}^{(t)}\|_{F}^{2}] as follows,

 \displaystyle\mathbb{E}[\|\mathbf{G}^{(t)}\|_{F}^{2}]\leq 2\mathbb{E}[\|% \mathbf{G}^{(t)}-\nabla L(\mathbf{W}^{(t)})\|_{F}^{2}]+2\|\nabla L(\mathbf{W}^% {(t)})\|_{F}^{2}\leq\frac{8mL(\mathbf{W}^{(t)})}{Bk}+2\|\nabla L(\mathbf{W}^{(% t)})\|_{F}^{2}. (A.1)

By Lemma A.5, we have the following for one-step stochastic gradient descent

 \displaystyle L(\mathbf{W}^{(t+1)}) \displaystyle\leq L(\mathbf{W}^{(t)})-\eta\langle\nabla L(\mathbf{W}^{(t)}),% \mathbf{G}^{(t)}\rangle \displaystyle\quad+C^{\prime}\eta\sqrt{L\big{(}\mathbf{W}^{(t)}\big{)}}\cdot% \frac{\tau^{2/3}m\sqrt{\log(m)}}{\sqrt{k}}\cdot\|\mathbf{G}^{(t)}\|_{2,\infty}% +\frac{C^{\prime\prime}m\eta^{2}}{k}\cdot\|\mathbf{G}^{(t)}\|_{2}^{2}.

Taking expectation conditioned on \mathbf{W}^{(t)}, we obtain

 \displaystyle\mathbb{E}[L(\mathbf{W}^{(t+1)})|\mathbf{W}^{(t)}] \displaystyle\leq L(\mathbf{W}^{(t)})-\eta\langle\nabla L(\mathbf{W}^{(t)}),% \mathbf{G}^{(t)}\rangle \displaystyle\quad+C^{\prime}\eta\sqrt{L\big{(}\mathbf{W}^{(t)}\big{)}}\cdot% \frac{\tau^{2/3}m\sqrt{\log(m)}}{\sqrt{k}}\cdot\mathbb{E}[\|\mathbf{G}^{(t)}\|% _{2,\infty}|\mathbf{W}^{(t)}] \displaystyle\quad+\frac{C^{\prime\prime}m\eta^{2}}{k}\cdot\mathbb{E}[\|% \mathbf{G}^{(t)}\|_{2}^{2}|\mathbf{W}^{(t)}]. (A.2)

By Lemma A.4, with probability at least 1-O(m^{-1}) we have the following upper bound on the quantity \mathbb{E}[\|\mathbf{G}^{(t)}\|_{2,\infty}|\mathbf{W}^{(t)}] for all t=1,\dots,T,

 \displaystyle\mathbb{E}[\|\mathbf{G}^{(t)}\|_{2,\infty}|\mathbf{W}^{(t)}]\leq% \mathbb{E}[\|\nabla\ell(\mathbf{f}_{\mathbf{W}^{(t)}}(\mathbf{x}_{i}),\mathbf{% y}_{i})\|_{2,\infty}|\mathbf{W}^{(t)}]\leq O\big{(}\sqrt{L(\mathbf{W}^{(t)})% \log(m)}\big{)}.

Then based on Lemma B.2, plugging (A.1) and the above inequality into (A.3), and set

 \displaystyle\eta=O\bigg{(}\frac{k}{mn^{2}}\bigg{)}\quad\mbox{and}\quad\tau=O% \bigg{(}\frac{\phi^{3}}{n^{3}k^{3/4}\log^{3/2}(m)}\bigg{)}.

Then with proper adjustment of constants we can obtain

 \displaystyle\mathbb{E}[L(\mathbf{W}^{(t+1)})|\mathbf{W}^{(t)}]\leq L(\mathbf{% W}^{(t)})-\frac{\eta}{2}\|\nabla L(\mathbf{W}^{(t)})\|_{F}^{2}\leq\bigg{(}1-% \frac{m\phi\eta}{2kn^{2}}\bigg{)}L(\mathbf{W}^{(t)}),

where the last inequality follows from Lemma 4.1. Then taking expectation on \mathbf{W}^{(t)}, we have with probability 1-O(m^{-1}),

 \displaystyle\mathbb{E}[L(\mathbf{W}^{(t+1)})]\leq\bigg{(}1-\frac{m\phi\eta}{2% kn^{2}}\bigg{)}\mathbb{E}[L(\mathbf{W}^{(t)})]\leq\bigg{(}1-\frac{m\phi\eta}{2% kn^{2}}\bigg{)}^{t+1}\mathbb{E}[L(\mathbf{W}^{(0)})] (A.3)

holds for all t>0. Then by Lemma A.3, we know that if set \eta=O\big{(}kB\phi/(n^{3}m\log(m))\big{)}, with probability at least 1-O(n^{-1}), it holds that

 \displaystyle\|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq O\bigg{(}% \frac{k^{1/2}n^{5/2}}{B^{1/2}m^{1/2}\phi}\bigg{)},

for all t\leq T. Then by our choice of m, it is easy to verify that with probability at least 1-O(n^{-1})-O(m^{-1})=1-O(n^{-1}),

 \displaystyle\|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2}\leq O\bigg{(}% \frac{k^{1/2}n^{5/2}}{B^{1/2}\phi}\cdot\frac{\phi^{4}B^{1/2}}{k^{5/4}n^{11/2}% \log^{3/2}(m)}\bigg{)}=\tau.

Moreover, note that in Lemma A.3 we set the step size as \eta=O\big{(}kB\phi/(n^{3}m\log(m))\big{)} and (A.3) suggests that we need

 \displaystyle T\eta=O\bigg{(}\frac{kn^{2}}{m\phi}\bigg{)}

to achieve \epsilon expected training loss. Therefore we can derive the number of iteration as

 \displaystyle T=O\bigg{(}\frac{n^{5}\log(m)\log(1/\epsilon)}{B\phi^{2}}\bigg{)}.

This completes the proof. ∎

## Appendix B Proof of Lemmas in Section 4 and Appendix A

### B.1 Proof of Lemma 4.1

We first provide the following useful lemmas before starting the proof of Lemma 4.1.

The following lemma states that with high probability the norm of the output of each hidden layer is bounded by constants.

###### Lemma B.1 ((Zou et al., 2018)).

If m\geq O(L\log(nL)), with probability at least 1-\exp(-O(m/L)), it holds that 1/2\leq\|\mathbf{x}_{l,i}\|_{2}\leq 2 and \big{\|}\mathbf{x}_{l,i}/\|\mathbf{x}_{l,i}\|_{2}-\mathbf{x}_{l,j}/\|\mathbf{x% }_{l,j}\|_{2}\big{\|}_{2}\geq\phi/2 for all i,j\in[n] and l\in[L], where \mathbf{x}_{l,i} denotes the output of the l-th hidden layer given the input \mathbf{x}_{i}.

###### Lemma B.2.

Assume m\geq\widetilde{O}(n^{2}k^{2}\phi^{-1}), then there exist an absolute constant C>0 such that with probability at least 1-\exp\big{(}-O(m\phi/(kn))\big{)}, it holds that

 \displaystyle\sum_{j=1}^{m}\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}\langle\mathbf{u}% _{i},\mathbf{v}_{j}\rangle\sigma^{\prime}\big{(}\langle\mathbf{w}_{L,j}^{(0)},% \mathbf{x}_{L-1,i}\rangle\big{)}\mathbf{x}_{L-1,i}\bigg{\|}_{2}^{2}\geq\frac{C% \phi m\sum_{i=1}^{n}\|\mathbf{u}_{i}\|_{2}^{2}}{kn^{3}}.

If we set \mathbf{u}_{i}=\mathbf{f}_{\mathbf{W}^{(0)}}(\mathbf{x}_{i})-\mathbf{y}_{i}, Lemma B.2 corresponds to the gradient lower bound at the initialization. Then the next step is to prove the bounds for all \mathbf{W} in the required perturbation region. Before proceeding to our final proof, we present the following lemma that provides useful results regarding the neural network within the perturbation region.

###### Lemma B.3 ((Allen-Zhu et al., 2018b)).

Consider a collection of weight matrices \widetilde{\mathbf{W}}=\{\widetilde{\mathbf{W}}_{l}\}_{l=1,\dots,L} such that \widetilde{\mathbf{W}}\in\mathcal{B}(\mathbf{W}^{(0)},\tau), with probability at least 1-\exp(-O(m\tau^{2/3}L)), there exists constants C^{\prime}, C^{\prime\prime} and C^{\prime\prime\prime} such that

• \|\widetilde{\bm{\Sigma}}_{L,i}-\bm{\Sigma}_{L,i}\big{\|}_{0}\leq C^{\prime}% \tau^{2/3}L

• \|\mathbf{V}(\widetilde{\bm{\Sigma}}_{L,i}-\bm{\Sigma}_{L,i})\|_{2}\leq C^{% \prime\prime}\tau^{1/3}L^{2}\sqrt{m\log(m)}/\sqrt{k}

• \|\widetilde{\mathbf{x}}_{L-1,i}-\mathbf{x}_{L-1,i}\|_{2}\leq C^{\prime\prime% \prime}\tau L^{5/2}\sqrt{\log(m)},

for all i=1,\dots,n, where \mathbf{x}_{L-1,i} and \widetilde{\mathbf{x}}_{L-1,i} denote the outputs of the L-1-th layer of the neural network with weight matrices \mathbf{W}^{(0)} and \widetilde{\mathbf{W}}, and \bm{\Sigma}_{L,i} and \widetilde{\bm{\Sigma}}_{L,i} are diagonal matrices with (\bm{\Sigma}_{L,i})_{jj}=\sigma^{\prime}(\langle\mathbf{w}_{L,j}^{(0)},\mathbf% {x}_{L-1}\rangle) and (\widetilde{\bm{\Sigma}}_{L,i})_{jj}=\sigma^{\prime}(\langle\widetilde{\mathbf% {w}}_{L,j},\widetilde{\mathbf{x}}_{L-1}\rangle) respectively.

Now we are ready to prove the lower and upper bounds of the Frobenious norm of the gradient.

###### Proof of Lemma 4.1.

The upper bound of the gradient norm can be proved according to Theorem 3 in Allen-Zhu et al. (2018b). We slightly modify their result since we consider average loss over all training examples while Allen-Zhu et al. (2018b) considers summation.

Then we focus on proving the lower bound. Note that the gradient \nabla_{\mathbf{W}_{L}}L(\widetilde{\mathbf{W}}) takes form

 \displaystyle\nabla_{\mathbf{W}_{L}}L(\widetilde{\mathbf{W}})=\frac{1}{n}\sum_% {i=1}^{n}\bigg{(}(\mathbf{f}_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{% y}_{i})^{\top}\mathbf{V}\widetilde{\bm{\Sigma}}_{L,i}\bigg{)}^{\top}\widetilde% {\mathbf{x}}_{L-1,i}^{\top},

where \widetilde{\bm{\Sigma}}_{L,i} is a diagonal matrix with (\widetilde{\bm{\Sigma}}_{L,i})_{jj}=\sigma^{\prime}(\widetilde{\mathbf{w}}_{L% -1,j},\widetilde{\mathbf{x}}_{L-1,i}) and \widetilde{\mathbf{x}}_{l-1,i} denotes the output of the l-th hidden layer with input \mathbf{x}_{i} and model weight matrices \widetilde{\mathbf{W}}. Let \mathbf{v}_{j}^{\top} denote the j-th row of matrix \mathbf{V}, and define

 \displaystyle\widetilde{\mathbf{G}}=\frac{1}{n}\sum_{i=1}^{n}\bigg{(}(\mathbf{% f}_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i})^{\top}\mathbf{V}% \bm{\Sigma}_{L,i}\bigg{)}^{\top}\mathbf{x}_{L-1,i}^{\top},

where \bm{\Sigma}_{L,i} is a diagonal matrix with (\widetilde{\bm{\Sigma}}_{L,i})_{jj}=\sigma^{\prime}(\mathbf{w}^{(0)}_{L-1,j},% \mathbf{x}_{L-1,i}) Then by Lemma B.2, we have with probability at least 1-\exp\big{(}-O(m\phi/(kn))\big{)}, the following holds for any \widetilde{\mathbf{W}},

 \displaystyle\|\widetilde{\mathbf{G}}\|_{F}^{2} \displaystyle=\frac{1}{n^{2}}\sum_{j=1}^{m}\bigg{\|}\sum_{i=1}^{n}\langle% \mathbf{f}_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i},\mathbf{v}_% {j}\rangle\sigma^{\prime}(\langle\mathbf{w}_{L,j},\mathbf{x}_{L-1,i})\mathbf{x% }_{L-1,i}\bigg{\|}_{2}^{2} \displaystyle\geq\frac{C_{0}\phi m\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{% \mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}^{2}}{kn^{3}},

where C_{0} is an absolute constant. Then we have

 \displaystyle\big{\|}\widetilde{\mathbf{G}}-\nabla_{\mathbf{W}_{L}}L(% \widetilde{\mathbf{W}})\big{\|}_{F} \displaystyle=\frac{1}{n^{2}}\bigg{\|}\sum_{i=1}^{n}\bigg{(}(\mathbf{f}_{% \widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i})^{\top}\mathbf{V}\bm{% \Sigma}_{L,i}\bigg{)}^{\top}\mathbf{x}_{L-1,i}^{\top}-\sum_{i=1}^{n}\bigg{(}(% \mathbf{f}_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i})^{\top}% \mathbf{V}\widetilde{\bm{\Sigma}}_{L,i}\bigg{)}^{\top}\widetilde{\mathbf{x}}_{% L-1,i}^{\top}\bigg{\|}_{F} \displaystyle\leq\frac{1}{n^{2}}\bigg{[}\bigg{\|}\sum_{i=1}^{n}\bigg{(}(% \mathbf{f}_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i})^{\top}% \mathbf{V}(\bm{\Sigma}_{L,i}-\widetilde{\bm{\Sigma}}_{L,i})\bigg{)}^{\top}% \mathbf{x}_{L-1,i}^{\top}\bigg{\|}_{F} \displaystyle\quad+\bigg{\|}\sum_{i=1}^{n}\bigg{(}(\mathbf{f}_{\widetilde{% \mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i})^{\top}\mathbf{V}\widetilde{\bm{% \Sigma}}_{L,i}\bigg{)}^{\top}\big{(}\mathbf{x}_{L-1,i}-\widetilde{\mathbf{x}}_% {L-1,i}\big{)}^{\top}\bigg{\|}_{F}\bigg{]}.

By Lemmas B.1 and B.3, we have

 \displaystyle\bigg{\|}\sum_{i=1}^{n}\bigg{(}(\mathbf{f}_{\widetilde{\mathbf{W}% }}(\mathbf{x}_{i})-\mathbf{y}_{i})^{\top}\mathbf{V}(\bm{\Sigma}_{L,i}-% \widetilde{\bm{\Sigma}}_{L,i})\bigg{)}^{\top}\mathbf{x}_{L-1,i}^{\top}\bigg{\|% }_{F} \displaystyle\leq\sum_{i=1}^{n}\big{\|}\mathbf{f}_{\widetilde{\mathbf{W}}}(% \mathbf{x}_{i})-\mathbf{y}_{i}\big{\|}_{2}\big{\|}\mathbf{V}(\bm{\Sigma}_{L,i}% -\widetilde{\bm{\Sigma}}_{L,i})\big{\|}_{2}\|\mathbf{x}_{L-1,i}\|_{2} \displaystyle\leq\frac{C_{1}\tau^{1/3}L^{2}\sqrt{m\log(m)}}{\sqrt{k}}\cdot\sum% _{i=1}^{n}\big{\|}\mathbf{f}_{\widetilde{\mathbf{W}}(\mathbf{x}_{i})}-\mathbf{% y}_{i}\big{\|}_{2},

where the second inequality follows from Lemma B.3 and C_{1} is an absolute constant. In addition, we also have

 \displaystyle\bigg{\|}\sum_{i=1}^{n}\bigg{(}(\mathbf{f}_{\widetilde{\mathbf{W}% }}(\mathbf{x}_{i})-\mathbf{y}_{i})^{\top}\mathbf{V}\widetilde{\bm{\Sigma}}_{L,% i}\bigg{)}^{\top}\big{(}\mathbf{x}_{L-1,i}-\widetilde{\mathbf{x}}_{L-1,i}\big{% )}^{\top}\bigg{\|}_{F} \displaystyle\leq\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W}}}(\mathbf{x% }_{i})-\mathbf{y}_{i}\|_{2}\|\mathbf{V}\|_{2}\|\mathbf{x}_{L-1,i}-\widetilde{% \mathbf{x}}_{L-1,i}\|_{2} \displaystyle\leq\frac{C_{2}\tau L^{5/2}\sqrt{m\log(m)}}{\sqrt{k}}\cdot\sum_{i% =1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i}\|_% {2},

where the second inequality follows from Lemma B.3 and C_{2} is an absolute constant. Combining the above bounds we have

 \displaystyle\big{\|}\mathbf{G}-\nabla_{\mathbf{W}_{L}}L(\widetilde{\mathbf{W}% })\big{\|}_{F} \displaystyle\leq\frac{\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W}}}(% \mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}}{n}\cdot\bigg{(}\frac{C_{1}\tau^{1/3}L^{2% }\sqrt{m\log(m)}}{\sqrt{k}}+\frac{C_{2}\tau L^{5/2}\sqrt{m\log(m)}}{\sqrt{k}}% \bigg{)} \displaystyle\leq\frac{\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W}}}(% \mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}}{n}\cdot\frac{C_{3}\tau^{1/3}L^{2}\sqrt{m% \log(m)}}{\sqrt{k}},

where the second inequality follows from the fact that \tau\leq O(L^{-4/3}). Then by triangle inequality, we have the following lower bound of \|\nabla_{\mathbf{W}_{L}}L(\widetilde{\mathbf{W}})\|_{F}

 \displaystyle\|\nabla_{\mathbf{W}_{L}}L(\widetilde{\mathbf{W}})\|_{F} \displaystyle\geq\|\mathbf{G}\|_{F}-\|\mathbf{G}-\nabla_{\mathbf{W}_{L}}L(% \widetilde{\mathbf{W}})\|_{F} \displaystyle\geq\frac{C_{0}\phi^{1/2}m^{1/2}\sqrt{n\sum_{i=1}^{n}\|\mathbf{f}% _{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}^{2}}}{\sqrt{k}n% ^{2}} \displaystyle\quad-\frac{\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W}}}(% \mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}}{n}\cdot\frac{C_{3}\tau^{1/3}L^{2}\sqrt{m% \log(m)}}{\sqrt{k}}.

By Jensen’s inequality we know that n\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf{y% }_{i}\|_{2}^{2}\geq\big{(}\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W}}}(% \mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}\big{)}^{2}. Then we set

 \displaystyle\tau=\frac{C_{3}\phi^{3/2}}{2C_{0}n^{3}L^{6}\log^{3/2}(m)}=O\bigg% {(}\frac{\phi^{3/2}}{n^{3}L^{6}\log^{3/2}(m)}\bigg{)},

and obtain

 \displaystyle\|\nabla_{\mathbf{W}_{L}}L(\widetilde{\mathbf{W}})\|_{F}\geq\frac% {C_{0}\phi^{1/2}m^{1/2}\sqrt{n\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W% }}}(\mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}^{2}}}{2\sqrt{k}n^{2}}.

Then plugging the fact that 1/n\sum_{i=1}^{n}\|\mathbf{f}_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})-\mathbf% {y}_{i}\|_{2}^{2}=L(\widetilde{\mathbf{W}}), we are able to complete the proof. ∎

### B.2 Proof of Lemma 4.2

###### Proof of Lemma 4.2.

Note that we assume that all iterate are staying inside the region \mathcal{B}\big{(}\mathbf{W}^{(0)},\tau\big{)}, then by Lemma 4.4, with probability at least 1-\exp(-O(m\tau^{2/3}L)), we have the following after one-step gradient descent

 \displaystyle L\big{(}\mathbf{W}^{(t+1)}\big{)} \displaystyle\leq L\big{(}\mathbf{W}^{(t)}\big{)}-\eta\|\nabla L\big{(}\mathbf% {W}^{(t)}\big{)}\|_{F}^{2} \displaystyle\quad+C^{\prime}\eta\sqrt{L\big{(}\mathbf{W}^{(t)}\big{)}}\cdot% \frac{\tau^{1/3}L^{2}\sqrt{m\log(m)}}{\sqrt{k}}\cdot\|\nabla L\big{(}\mathbf{W% }^{(t)}\big{)}\|_{2}+\frac{C^{\prime\prime}L^{2}m\eta^{2}}{k}\|\nabla L\big{(}% \mathbf{W}^{(t)}\big{)}\|_{2}^{2}. (B.1)

We first choose the step size

 \displaystyle\eta=\frac{k}{4C^{\prime\prime}L^{2}m}=O\bigg{(}\frac{k}{L^{2}m}% \bigg{)},

then (B.2) yields

 \displaystyle L\big{(}\mathbf{W}^{(t+1)}\big{)} \displaystyle\leq L\big{(}\mathbf{W}^{(t)}\big{)}-\frac{3\eta}{4}\|\nabla L% \big{(}\mathbf{W}^{(t)}\big{)}\|_{F}^{2}+C^{\prime}\eta\sqrt{L\big{(}\mathbf{W% }^{(t)}\big{)}}\cdot\frac{\tau^{1/3}L^{2}\sqrt{m\log(m)}}{\sqrt{k}}\cdot\|% \nabla L\big{(}\mathbf{W}^{(t)}\big{)}\|_{2} \displaystyle\leq L\big{(}\mathbf{W}^{(t)}\big{)}-\eta\|\nabla L\big{(}\mathbf% {W}^{(t)}\big{)}\|_{F}\bigg{(}\frac{\|\nabla L\big{(}\mathbf{W}^{(t)}\big{)}\|% _{F}}{2}-C^{\prime}\sqrt{L\big{(}\mathbf{W}^{(t)}\big{)}}\cdot\frac{\tau^{1/3}% L^{2}\sqrt{m\log(m)}}{\sqrt{k}}\bigg{)},

where we use the fact that \|\nabla L\big{(}\mathbf{W}^{(t)}\big{)}\|_{2}\leq\|\nabla L\big{(}\mathbf{W}^% {(t)}\big{)}\|_{F}. Then by Lemma 4.1, we know that with probability at least 1-\exp\big{(}-O(m\phi/(kn))\big{)}

 \displaystyle\|\nabla_{\mathbf{W}}L(\mathbf{W}^{(t)})\|_{F}^{2}\geq\|\nabla_{% \mathbf{W}_{L}}L(\mathbf{W}^{(t)})\|_{F}^{2}\geq\frac{Cm\phi}{kn^{2}}L\big{(}% \mathbf{W}^{(t)}\big{)}, (B.2)

where C is an absolute constant. Thus, we can choose the radius \tau as

 \displaystyle\tau=\frac{C^{3/2}\phi^{3/2}}{64n^{3}C^{\prime 3}L^{6}\log^{3/2}(% m)}=O\bigg{(}\frac{\phi^{3/2}}{n^{3}L^{6}\log^{3/2}(m)}\bigg{)}, (B.3)

and thus the following holds with probability at least 1-\exp(-O(m\tau^{2/3}L))-\exp\big{(}-O(m\phi/(kn))\big{)}=1-\exp(-O(m\tau^{2/3% }L)),

 \displaystyle L\big{(}\mathbf{W}^{(t+1)}\big{)} \displaystyle\leq L\big{(}\mathbf{W}^{(t)}\big{)}-\frac{\eta}{2}\|\nabla L\big% {(}\mathbf{W}^{(t)}\big{)}\|_{F}^{2}, (B.4)

where the second inequality follows from (B.2). This completes the proof. By triangle inequality, we have

 \displaystyle\|\mathbf{W}^{(t)}_{l}-\mathbf{W}^{(0)}_{l}\|_{2} \displaystyle\leq\sum_{s=0}^{t-1}\eta\|\nabla_{\mathbf{W}_{l}}L\big{(}\mathbf{% W}^{(s)}\big{)}\|_{2}\leq\sum_{s=0}^{t-1}\eta\|\nabla L\big{(}\mathbf{W}^{(s)}% \big{)}\|_{F}. (B.5)

Moreover, we have

 \displaystyle\sqrt{L(\mathbf{W}^{(s)})}-\sqrt{L(\mathbf{W}^{(s+1)})} \displaystyle=\frac{L(\mathbf{W}^{(s)})-L(\mathbf{W}^{(s+1)})}{\sqrt{L(\mathbf% {W}^{(s)})}+\sqrt{L(\mathbf{W}^{(s+1)})}} \displaystyle\geq\frac{\eta\|\nabla L(\mathbf{W}^{(s)})\|_{F}^{2}}{4\sqrt{L(% \mathbf{W}^{(s)})}} \displaystyle\geq\sqrt{\frac{Cm\phi}{kn^{2}}}\cdot\frac{\eta\|\nabla L(\mathbf% {W}^{(s)})\|_{F}}{4},

where the second inequality is by (B.4) and the fact that L(\mathbf{W}^{(s+1)})\leq L(\mathbf{W}^{(s)}), and the last inequality follows from (B.2). Plugging the above result into (B.5), we have with probability at least 1-\exp(-O(m\tau^{2/3}L)),

 \displaystyle\|\mathbf{W}_{l}^{(t)}-\mathbf{W}_{l}^{(0)}\|_{2} \displaystyle\leq\sum_{s=0}^{t-1}\eta\|\nabla L(\mathbf{W}^{(s)})\|_{F} \displaystyle\leq 4\sqrt{\frac{kn^{2}}{Cm\phi}}\sum_{s=0}^{t-1}\Big{[}\sqrt{L(% \mathbf{W}^{(s)})}-\sqrt{L(\mathbf{W}^{(s+1)})}\Big{]} \displaystyle\leq 4\sqrt{\frac{kn^{2}}{Cm\phi}}\cdot\sqrt{L(\mathbf{W}^{(0)})}. (B.6)

Note that (B.2) holds for all l and t. Then apply Lemma 4.3, we are able to complete the proof.

### B.3 Proof of Lemma 4.3

###### Proof of Lemma 4.3.

Note that the output of the neural network can be formulated as

 \displaystyle f_{\mathbf{W}^{(0)}}(\mathbf{x}_{i})=\mathbf{V}\mathbf{x}_{L,i},

where \mathbf{x}_{L,i} denotes the output of the last hidden layer with input \mathbf{x}_{i}. Note that each entry in \mathbf{V} is i.i.d. generated from Gaussian distribution \mathcal{N}(0,1/k). Thus, we know that with probability at least 1-\delta, it holds that \|\mathbf{V}\mathbf{x}_{L,i}\|_{2}\leq\sqrt{\log(1/\delta)}\cdot\|\mathbf{x}_{% L,i}\|_{2}. Then by Lemma B.1 and union bound, we have \|\mathbf{V}\mathbf{x}_{L,i}\|_{2}\leq 2\sqrt{\log(1/\delta)} for all i\in[n] with probability at least 1-\exp(-O(m/L))-n\delta. Then we set \delta=O(n^{-2}) and use the fact that m\geq O(L\log(nL)), we have

 \displaystyle f_{\mathbf{W}^{(0)}}(\mathbf{x}_{i})=\|\mathbf{V}\mathbf{x}_{L,i% }\|_{2}^{2}\leq O(\log(n))

for all i\in[n] with probability at least 1-O(n^{-1}). Then by our definition of training loss, it follows that

 \displaystyle L(\mathbf{W}^{(0)}) \displaystyle=\frac{1}{2n}\sum_{i=1}\|\mathbf{f}_{\mathbf{W}^{(0)}}(\mathbf{x}% _{i})-\mathbf{y}_{i}\|_{2}^{2} \displaystyle\leq\frac{1}{n}\sum_{i=1}\big{[}\|\mathbf{f}_{\mathbf{W}^{(0)}}(% \mathbf{x}_{i})\|_{2}^{2}+\|\mathbf{y}_{i}\|_{2}^{2}\big{]} \displaystyle\leq O(\log(n))

with probability at least 1-O(n^{-1}), where the first inequality is by Young’s inequality and we assume that \|\mathbf{y}_{i}\|_{2}=O(1) for all i\in[n] in the second inequality. This completes the proof. ∎

### B.4 Proof of Lemma 4.5

###### Proof of Lemma 4.5.

By (B.4), we have

 \displaystyle L\big{(}\mathbf{W}^{(t+1)}\big{)} \displaystyle\leq L\big{(}\mathbf{W}^{(t)}\big{)}-\frac{\eta}{2}\|\nabla L\big% {(}\mathbf{W}^{(t)}\big{)}\|_{F}^{2} \displaystyle\leq\bigg{(}1-\frac{Cm\phi\eta}{2kn^{2}}\bigg{)}L\big{(}\mathbf{W% }^{(t)}\big{)} \displaystyle\leq\bigg{(}1-\frac{Cm\phi\eta}{2kn^{2}}\bigg{)}^{t+1}L\big{(}% \mathbf{W}^{(0)}\big{)}, (B.7)

where the second inequality follows from (B.2). This completes the proof. ∎

### B.5 Proof of Lemma A.2

###### Proof of Lemma A.2.

Let \mathbf{G}^{(t)} denote the stochastic gradient leveraged in the t-th iteration, where the corresponding minibatch is defined as \mathcal{B}^{(t)}. By Lemma 4.4, we have the following inequality regarding one-step stochastic gradient descent

 \displaystyle L(\mathbf{W}^{(t+1)}) \displaystyle\leq L(\mathbf{W}^{(t)})-\eta\langle\nabla L(\mathbf{W}^{(t)}),% \mathbf{G}^{(t)}\rangle \displaystyle\quad+C^{\prime}\eta\sqrt{L\big{(}\mathbf{W}^{(t)}\big{)}}\cdot% \frac{\tau^{1/3}L^{2}\sqrt{m\log(m)}}{\sqrt{k}}\cdot\|\mathbf{G}^{(t)}\|_{2}+% \frac{C^{\prime\prime}L^{2}m\eta^{2}}{k}\cdot\|\mathbf{G}^{(t)}\|_{2}^{2}.

Then taking expectation on both sides conditioned on \mathbf{W}^{(t)} gives

 \displaystyle\mathbb{E}\big{[}L(\mathbf{W}^{(t+1)})\big{|}\mathbf{W}^{(t)}\big% {]} \displaystyle\leq L(\mathbf{W}^{(t)})-\eta\|\nabla L(\mathbf{W}^{(t)})\|_{F}^{% 2}+C^{\prime}\eta\sqrt{L\big{(}\mathbf{W}^{(t)}\big{)}}\cdot\frac{\tau^{1/3}L^% {2}\sqrt{m\log(m)}}{\sqrt{k}}\cdot\mathbb{E}\big{[}\|\mathbf{G}^{(t)}\|_{2}% \big{|}\mathbf{W}^{(t)}\big{]} \displaystyle\quad+\frac{C^{\prime\prime}L^{2}m\eta^{2}}{k}\cdot\mathbb{E}\big% {[}\|\mathbf{G}^{(t)}\|_{2}^{2}\big{|}\mathbf{W}^{(t)}\big{]}. (B.8)

Note that given \mathbf{W}^{(t)}, the expectations on \|\mathbf{G}^{(t)}\|_{2} and \|\mathbf{G}^{(t)}\|_{2}^{2} are only taken over the random minibatch \mathcal{B}^{(t)}. Then by (A.1), we have

 \displaystyle\mathbb{E}\big{[}\|\mathbf{G}^{(t)}\|_{2}\big{|}\mathbf{W}^{(t)}% \big{]}^{2}\leq\mathbb{E}\big{[}\|\mathbf{G}^{(t)}\|_{F}^{2}\big{|}\mathbf{W}^% {(t)}\big{]}\leq\frac{8mL(\mathbf{W}^{(t)})}{Bk}+2\|\nabla L(\mathbf{W}^{(t)})% \|_{F}^{2}.

By (B.2), we know that there is a constant C such that \|\nabla L(\mathbf{W}^{(t)})\|_{F}^{2}\geq Cm\phi L(\mathbf{W}^{(t)})/(kn^{2}). Then we set the step size \eta and radius \tau as follows

 \displaystyle\eta \displaystyle=\frac{Cd}{64C^{\prime\prime}L^{2}mn^{2}}=O\bigg{(}\frac{k}{L^{2}% mn^{2}}\bigg{)} \displaystyle\tau \displaystyle=\frac{C^{3}\phi^{3/2}B^{3}}{64^{2}n^{6}C^{\prime 3}L^{6}\log^{3/% 2}(m)}=O\bigg{(}\frac{\phi^{3}B^{3/2}}{n^{6}L^{6}\log^{3/2}(m)}\bigg{)}

Then (B.5) yields that

 \displaystyle\mathbb{E}\big{[}L(\mathbf{W}^{(t+1)})\big{|}\mathbf{W}^{(t)}\big% {]} \displaystyle\leq L\big{(}\mathbf{W}^{(t)}\big{)}-\eta\|\nabla L\big{(}\mathbf% {W}^{(t)}\big{)}\|_{F}^{2}-\frac{C^{\prime\prime}L^{2}m\eta^{2}}{k}\bigg{(}% \frac{8n^{2}}{c\phi B}+2\bigg{)}\cdot\|\nabla L(\mathbf{W}^{(t)})\|_{F}^{2} \displaystyle\quad-C^{\prime}\eta\sqrt{L\big{(}\mathbf{W}^{(t)}\big{)}}\cdot% \frac{\tau^{1/3}L^{2}\sqrt{m\log(m)}}{\sqrt{k}}\cdot\sqrt{\frac{8n^{2}}{c\phi B% }+2}\cdot\|\nabla L(\mathbf{W}^{(t)})\|_{F} \displaystyle\leq L(\mathbf{W}^{(t)})-\frac{\eta}{2}\|\nabla L(\mathbf{W}^{(t)% })\|_{F}^{2}. (B.9)

Then applying (B.2) again and taking expectation over \mathbf{W}^{(t)} on both sides of (B.5), we obtain

 \displaystyle\mathbb{E}\big{[}L(\mathbf{W}^{(t+1)})\big{]}\leq\bigg{(}1-\frac{% Cm\phi\eta}{2kn^{2}}\bigg{)}\mathbb{E}[L(\mathbf{W}^{(t)})]\leq\bigg{(}1-\frac% {Cm\phi\eta}{2kn^{2}}\bigg{)}^{t+1}L(\mathbf{W}^{(0)}).

This completes the proof.

### B.6 Proof of Lemma A.3

###### Proof of Lemma A.3.

We prove this by standard martingale inequality. By Lemma 4.4 and our choice of \eta and \tau, we have

 \displaystyle L(\mathbf{W}^{(t+1)})\leq L(\mathbf{W}^{(t)})+2\eta\|\nabla L(% \mathbf{W}^{(t)})\|_{F}\cdot\|\mathbf{G}^{(t)}\|_{2}+\eta^{2}\|\mathbf{G}^{(t)% }\|_{2}^{2}. (B.10)

By Lemma A.1, we know that there exists an absolute constant C such that

 \displaystyle\|\nabla L(\mathbf{W}^{(t)})\|_{F}^{2}\leq\frac{CmL(\mathbf{W}^{(% t)})}{k}\mbox{ and }\|\mathbf{G}^{(t)}\|_{F}^{2}\leq\frac{CmnL(\mathbf{W}^{(t)% })}{Bk},

where B denotes the minibatch size. Then note that \eta\leq O(B/n), we have the following according to (B.10)

 \displaystyle L(\mathbf{W}^{(t+1)})\leq\bigg{(}1+\frac{C^{\prime}mn^{1/2}\eta}% {B^{1/2}d}\bigg{)}L(\mathbf{W}^{(t)}),

where C^{\prime} is an absolute constant. Taking logarithm on both sides further leads to

 \displaystyle\log\big{(}L(\mathbf{W}^{(t+1)})\big{)}\leq\log\big{(}L(\mathbf{W% }^{(t)})\big{)}+\frac{C^{\prime}mn^{1/2}\eta}{B^{1/2}d},

where we use the inequality \log(1+x)\leq x. By (B.2) and (B.5), we know that

 \displaystyle\mathbb{E}[L(\mathbf{W}^{(t+1)})|\mathbf{W}^{(t)}]\leq L(\mathbf{% W}^{(t)})-\frac{\eta}{4}\|\nabla L(\mathbf{W}^{(t)})\|_{F}^{2}\leq\bigg{(}1-% \frac{C^{\prime\prime}m\phi\eta}{2kn^{2}}\bigg{)}L(\mathbf{W}^{(t)}).

Then by Jensen’s inequality and the inequality \log(1+x)\leq x, we have

 \displaystyle\mathbb{E}\big{[}\log\big{(}L(\mathbf{W}^{(t+1)})\big{)}|\mathbf{% W}^{(t)}\big{]}\leq\log\big{(}\mathbb{E}[L(\mathbf{W}^{(t+1)})|\mathbf{W}^{(t)% }]\big{)}\leq\log\big{(}L(\mathbf{W}^{(t)})\big{)}-\frac{C^{\prime\prime}m\phi% \eta}{2kn^{2}},

which further yields the following by taking expectation on \mathbf{W}^{(t)} and telescope sum over t,

 \displaystyle\mathbb{E}\big{[}\log\big{(}L(\mathbf{W}^{(t)})\big{)}\big{]}\leq% \log\big{(}L(\mathbf{W}^{(t)})\big{)}-\frac{C^{\prime\prime}m\phi\eta}{2kn^{2}}. (B.11)

Therefore \{L(\mathbf{W}^{(t)})\}_{t=0,1\dots,} is a supermartingale. By one-side Azuma’s inequality, we know that with probability at least 1-\delta, the following holds for any t

 \displaystyle\log\big{(}L(\mathbf{W}^{(t)})\big{)} \displaystyle\leq\mathbb{E}[\log\big{(}L(\mathbf{W}^{(t)})\big{)}]+\frac{C^{% \prime}mn^{1/2}\eta}{B^{1/2}d}\sqrt{2t\log(1/\delta)} \displaystyle\leq\log\big{(}L(\mathbf{W}^{(0)})\big{)}-\frac{tC^{\prime\prime}% m\phi\eta}{2kn^{2}}+\frac{C^{\prime}mn^{1/2}\eta}{B^{1/2}d}\sqrt{2t\log(1/% \delta)} \displaystyle\leq\log\big{(}L(\mathbf{W}^{(0)})\big{)}-\frac{tC^{\prime\prime}% m\phi\eta}{4kn^{2}}+\frac{C^{\prime 2}mn^{3}\log(1/\delta)\eta}{C^{\prime% \prime}kB\phi}, (B.12)

where the second inequality is by (B.11) and we use the fact that -at+b\sqrt{t}\leq b^{2}/(4a) in the last inequality. Then we chose \delta=O(m^{-1}) and

 \displaystyle\eta=\frac{\log(2)C^{\prime\prime}kB\phi}{C^{\prime 2}mn^{3}\log(% 1/\delta)}=O\bigg{(}\frac{kB\phi}{n^{3}m\log(m)}\bigg{)}.

Plugging these into (B.6) gives

 \displaystyle\log\big{(}L(\mathbf{W}^{(t)})\big{)}\leq\log\big{(}2L(\mathbf{W}% ^{(0)})\big{)}-\frac{tC^{\prime\prime}m\phi\eta}{4kn^{2}},

which implies that

 \displaystyle L(\mathbf{W}^{(t)})\leq 2L(\mathbf{W}^{(0)})\cdot\exp\bigg{(}-% \frac{tC^{\prime\prime}m\phi\eta}{4kn^{2}}\bigg{)}. (B.13)

By Lemma A.1, we have

 \displaystyle\|\mathbf{G}^{(t)}\|_{2}\leq\|\mathbf{G}^{(t)}\|_{F}\leq O\bigg{(% }\frac{m^{1/2}n^{1/2}\sqrt{L(\mathbf{W}^{(t)})}}{B^{1/2}k^{1/2}}\bigg{)} (B.14)

for all t\leq T. Therefore, plugging (B.14) into (B.13) and taking union bound over all t\leq T, and apply the result in Lemma 4.3, the following holds for all t\leq T with probability at least 1-O(T\cdot m^{-1})-O(n^{-1})=1-O(n^{-1}),

 \displaystyle\|\mathbf{W}^{(t)}_{l}-\mathbf{W}_{l}^{(0)}\|_{2}\leq\sum_{s=0}^{% t-1}\eta\|\mathbf{G}^{(t)}\|_{2}\leq O\bigg{(}\frac{m^{1/2}n^{1/2}}{B^{1/2}k^{% 1/2}}\bigg{)}\cdot\sum_{s=0}^{t-1}\eta\sqrt{L(\mathbf{W}^{(s)})}\leq\widetilde% {O}\bigg{(}\frac{k^{1/2}n^{5/2}}{B^{1/2}m^{1/2}\phi}\bigg{)},

where the first inequality is by triangle inequality, the second inequality follows from (B.14) and the last inequality is by (B.13) and Lemma 4.3. This completes the proof.

### B.7 Proof of Lemma A.4

###### Proof of Lemma A.4.

We first write the formula of \nabla\ell\big{(}f_{\mathbf{W}}(\mathbf{x}_{i}),\mathbf{y}_{i}\big{)} as follows

 \displaystyle\nabla\ell\big{(}f_{\mathbf{W}}(\mathbf{x}_{i}),\mathbf{y}_{i}% \big{)}=\big{(}f_{\mathbf{W}}(\mathbf{x}_{i})-y_{i}\big{)}^{\top}\mathbf{V}\bm% {\Sigma}_{i}\big{)}^{\top}\mathbf{x}_{i}^{\top}.

Since \bm{\Sigma}_{i} is an diagonal matrix with \big{(}\bm{\Sigma}_{i}\big{)}_{jj}=\sigma^{\prime}(\langle\mathbf{w}_{j},% \mathbf{x}_{i}\rangle). Therefore, it holds that

 \displaystyle\|\nabla\ell(f_{\mathbf{W}}(\mathbf{x}_{i}),\mathbf{y}_{i})\|_{2,\infty} \displaystyle=\max_{j\in[m]}\langle f_{\widetilde{\mathbf{W}}}(\mathbf{x}_{i})% -\mathbf{y}_{i},\mathbf{v}_{j}\rangle\cdot\|\mathbf{x}_{i}\|_{2}\leq\max_{j\in% [m]}\|f_{\mathbf{W}}(\mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}\|\mathbf{v}_{j}\|_{2}, (B.15)

where \mathbf{v}_{j} denotes the j-th column of \mathbf{V} and we use the fact that \|\mathbf{x}_{i}\|_{2}=1. Note that \mathbf{v}_{j}\sim\mathcal{N}(0,\mathbf{I}/k), we have

 \displaystyle\mathbb{P}\big{(}\|\mathbf{v}_{j}\|_{2}^{2}\geq O\big{(}\log(m)% \big{)}\big{)}\leq O(m^{-1}).

Applying union bound over \mathbf{v}_{1},\dots,\mathbf{v}_{m}, we have with probability at least 1-O(m^{-1}),

 \displaystyle\max_{j\in[m]}\|\mathbf{v}_{j}\|_{2}\leq O\big{(}\log^{1/2}(m)% \big{)}.

Plugging this into (B.15) and applying the fact that \|f_{\mathbf{W}}(\mathbf{x}_{i})-\mathbf{y}_{i}\|_{2}=\sqrt{\ell(\mathbf{f}_{% \mathbf{W}}(\mathbf{x}_{i}),\mathbf{y}_{i})}, we are able to complete the proof. ∎

### B.8 Proof of Lemma A.5

Recall that the output of two-layer ReLU network can be formulated as

 \displaystyle\mathbf{f}_{\mathbf{W}}(\mathbf{x}_{i})=\mathbf{V}\bm{\Sigma}_{i}% \mathbf{W}\mathbf{x}_{i},

where \bm{\Sigma}_{i} is a diagonal matrix with only non-zero diagonal entry (\bm{\Sigma}_{i})_{jj}=\sigma^{\prime}(\mathbf{w}_{j}^{\top}\mathbf{x}_{i}). Then based on the definition of L(\mathbf{W}), we have

 \displaystyle L(\widetilde{\mathbf{W}})-L(\widehat{\mathbf{W}}) \displaystyle=\frac{1}{2n}\sum_{i=1}^{n}\|\mathbf{V}\widetilde{\bm{\Sigma}}_{i% }\widetilde{\mathbf{W}}\mathbf{x}_{i}-\mathbf{y}_{i}\|_{2}^{2}-\frac{1}{n}\sum% _{i=1}^{n}\|\mathbf{V}\widehat{\bm{\Sigma}}_{i}\widehat{\mathbf{W}}\mathbf{x}_% {i}-\mathbf{y}_{i}\|_{2}^{2} \displaystyle=\underbrace{\frac{1}{2n}\sum_{i=1}^{n}\big{\langle}\mathbf{V}% \widehat{\bm{\Sigma}}_{i}\widehat{\mathbf{W}}\mathbf{x}_{i}-\mathbf{y}_{i},% \mathbf{V}\widetilde{\bm{\Sigma}}_{i}\widetilde{\mathbf{W}}\mathbf{x}_{i}-% \mathbf{V}\widehat{\bm{\Sigma}}_{i}\widehat{\mathbf{W}}\mathbf{x}_{i}\big{% \rangle}}_{I_{1}}+\underbrace{\frac{1}{2n}\sum_{i=1}^{n}\big{\|}\mathbf{V}% \widetilde{\bm{\Sigma}}_{i}\widetilde{\mathbf{W}}\mathbf{x}_{i}-\mathbf{V}% \widehat{\bm{\Sigma}}_{i}\widehat{\mathbf{W}}\mathbf{x}_{i}\big{\|}_{2}^{2}}_{% I_{2}}.

Then we tackle the two terms on the R.H.S. of the above equation separately. Regarding the first term, i.e., I_{1}, we have

 \displaystyle I_{1} \displaystyle=\frac{1}{2n}\sum_{i=1}^{n}\big{\langle}\mathbf{V}\widehat{\bm{% \Sigma}}_{i}\widehat{\mathbf{W}}\mathbf{x}_{i}-\mathbf{y}_{i},\mathbf{V}% \widehat{\bm{\Sigma}}_{i}(\widetilde{\mathbf{W}}-\widehat{\mathbf{W}})\mathbf{% x}_{i}\big{\rangle} \displaystyle\quad+\frac{1}{2n}\sum_{i=1}^{n}\big{\langle}\mathbf{V}\widehat{% \bm{\Sigma}}_{i}\widehat{\mathbf{W}}\mathbf{x}_{i}-\mathbf{y}_{i},\mathbf{V}(% \widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i})\widetilde{\mathbf{W}}% \mathbf{x}_{i}\big{\rangle} \displaystyle\leq\langle\nabla L(\widehat{\mathbf{W}}),\widetilde{\mathbf{W}}-% \widehat{\mathbf{W}}\rangle+\frac{1}{2n}\sum_{i=1}^{n}\sqrt{\ell(f_{\widehat{% \mathbf{W}}}(\mathbf{x}_{i}),\mathbf{y}_{i})}\cdot\|\mathbf{V}(\widetilde{\bm{% \Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i})\widetilde{\mathbf{W}}\mathbf{x}_{i}\|_% {2} \displaystyle\leq\langle\nabla L(\widehat{\mathbf{W}}),\widetilde{\mathbf{W}}-% \widehat{\mathbf{W}}\rangle+\frac{\sqrt{L(\widehat{\mathbf{W}})}}{2}\cdot\|% \mathbf{V}(\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i})\widetilde{% \mathbf{W}}\mathbf{x}_{i}\|_{2},

where the last inequality follows from Jensen’s inequality. Note that the non-zero entries in \widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i} represent the nodes, say j, satisfying \text{sign}(\widetilde{\mathbf{w}}_{j}^{\top}\mathbf{x}_{i})\neq\text{sign}(% \widehat{\mathbf{w}}_{j}^{\top}\mathbf{x}_{i}), which implies \big{|}\widetilde{\mathbf{w}}_{j}^{\top}\mathbf{x}_{i}\big{|}\leq\big{|}(% \widetilde{\mathbf{w}}_{j}-\widehat{\mathbf{w}}_{j})^{\top}\mathbf{x}_{i}\big{|}. Therefore, we have

 \displaystyle\|\mathbf{V}(\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i% })\widetilde{\mathbf{W}}\mathbf{x}_{i}\|_{2}^{2}\leq\|\mathbf{V}(\widetilde{% \bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i})(\widetilde{\mathbf{W}}-\widehat{% \mathbf{W}})\mathbf{x}_{i}\|_{2}^{2}.

By Lemma B.3, we have \|\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i}\|_{0}\leq\|\widetilde{% \bm{\Sigma}}_{i}-\bm{\Sigma}_{i}\|_{0}+\|\widehat{\bm{\Sigma}}_{i}-\bm{\Sigma}% _{i}\|_{0}=O(m\tau^{2/3}). Then we define \bar{\bm{\Sigma}}_{i} as

 \displaystyle\big{(}\bar{\bm{\Sigma}}_{i}\big{)}_{jk}=|\big{(}\widetilde{\bm{% \Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i}\big{)}_{jk}|\quad\mbox{ for all $j,k$}.

Then we have

 \displaystyle\|\mathbf{V}(\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i% })\widetilde{\mathbf{W}}\mathbf{x}_{i}\|_{2} \displaystyle\leq\|\mathbf{V}(\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}% }_{i})\bar{\bm{\Sigma}}_{i}(\widetilde{\mathbf{W}}-\widehat{\mathbf{W}})% \mathbf{x}_{i}\|_{2} \displaystyle\leq\|\mathbf{V}(\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}% }_{i})\|_{2}\cdot\|\bar{\bm{\Sigma}}_{i}(\widetilde{\mathbf{W}}-\widehat{% \mathbf{W}})\|_{F} \displaystyle\leq\|\mathbf{V}(\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}% }_{i})\|_{2}\cdot\|\bar{\bm{\Sigma}}_{i}\|_{0}^{1/2}\cdot\|\widetilde{\mathbf{% W}}-\widehat{\mathbf{W}}\|_{2,\infty}.

By Lemma B.3, we have with probability 1-O(m\tau^{2/3})

 \displaystyle\|\mathbf{V}(\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i% })\widetilde{\mathbf{W}}\mathbf{x}_{i}\|_{2}\leq O(m\sqrt{\log(m)}\tau^{2/3}k^% {-1})\cdot\|\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}\|_{2,\infty}.

In what follows we are going to tackle the term I_{2}. Note that for each i, we have

 \displaystyle\|\mathbf{V}\widetilde{\bm{\Sigma}}_{i}\widetilde{\mathbf{W}}% \mathbf{x}_{i}-\mathbf{V}\widehat{\bm{\Sigma}}_{i}\widehat{\mathbf{W}}\mathbf{% x}_{i}\|_{2} \displaystyle=\|\mathbf{V}\widehat{\bm{\Sigma}}_{i}(\widetilde{\mathbf{W}}-% \widehat{\mathbf{W}})\mathbf{x}_{i}\|_{2}+\|\mathbf{V}(\widetilde{\bm{\Sigma}}% _{i}-\widehat{\bm{\Sigma}}_{i})\widetilde{\mathbf{W}}\mathbf{x}_{i}\|_{2} \displaystyle\leq\|\mathbf{V}\|_{2}\|\widetilde{\mathbf{W}}-\widehat{\mathbf{W% }}\|_{2}+\|\mathbf{V}(\widetilde{\bm{\Sigma}}_{i}-\widehat{\bm{\Sigma}}_{i})\|% _{2}\cdot\|\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}\|_{2} \displaystyle=O(m^{1/2}/k^{1/2})\cdot\|\widetilde{\mathbf{W}}-\widehat{\mathbf% {W}}\|_{2},

where the last inequality holds due to the fact that \|\mathbf{V}\|_{2}=O(m^{1/2}/k^{1/2}) with probability at least 1-\exp(-O(m)). This leads to I_{2}\leq O(m/k)\cdot\|\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}\|_{2}^{2}. Now we can put everything together, and obtain

 \displaystyle L(\widetilde{\mathbf{W}})-L(\widehat{\mathbf{W}}) \displaystyle=I_{1}+I_{2} \displaystyle\leq\langle\nabla L(\widehat{\mathbf{W}}),\widetilde{\mathbf{W}}-% \widehat{\mathbf{W}}\rangle+O(m\sqrt{\log(m)}\tau^{2/3}k^{-1/2})\cdot\sqrt{L(% \widehat{\mathbf{W}})}\cdot\|\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}\|_{2,\infty} \displaystyle+O(m/k)\cdot\|\widetilde{\mathbf{W}}-\widehat{\mathbf{W}}\|_{2}^{% 2}.

Then applying union bound on the inequality for I_{1} and I_{2}, we are able to complete the proof.

## Appendix C Proof of Technical Lemmas in Appendix B

### C.1 Proof of Lemma B.2

Let \mathbf{z}_{1},\ldots,\mathbf{z}_{n}\in\mathbb{R}^{d} be n vectors with 1/2\leq\min_{i}\{\|\mathbf{z}_{i}\|_{2}\}\leq\max_{i}\{\|\mathbf{z}_{i}\|_{2}% \}\leq 2. Let \bar{\mathbf{z}}_{i}=\mathbf{z}_{i}/\|\mathbf{z}_{i}\|_{2} and assume \min_{i,j}\|\bar{\mathbf{z}}_{i}-\bar{\mathbf{z}}_{j}\|_{2}\geq\widetilde{\phi}. Then for each \mathbf{z}_{i}, we construct an orthonormal matrix \mathbf{Q}_{i}=[\bar{\mathbf{z}}_{i},\mathbf{Q}_{i}^{\prime}]\in\mathbb{R}^{d% \times d}. Then consider a random vector \mathbf{w}\in\mathcal{N}(0,\mathbf{I}), it follows that \mathbf{u}_{i}:=\mathbf{Q}^{\top}_{i}\mathbf{w}\sim\mathcal{N}(0,\mathbf{I}). Then we can decompose \mathbf{w} as

 \displaystyle\mathbf{w}=\mathbf{Q}_{i}\mathbf{u}_{i}=\mathbf{u}_{i}^{(1)}\bar{% \mathbf{z}}_{i}+\mathbf{Q}^{\prime}_{i}\mathbf{u}_{i}^{\prime}, (C.1)

where \mathbf{u}_{i}^{(1)} denotes the first coordinate of \mathbf{u}_{i} and \mathbf{u}_{i}^{\prime}:=(\mathbf{u}_{i}^{(2)},\dots,\mathbf{u}_{i}^{(d)}). Then let \gamma=\sqrt{\pi}\widetilde{\phi}/(8n), we define the following set of \mathbf{w} based on \mathbf{z}_{i},

 \displaystyle\mathcal{W}_{i}=\big{\{}\mathbf{w}:|\mathbf{u}_{i}^{(1)}|\leq% \gamma,|\langle\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime},\bar{\mathbf{z}}% _{i}\rangle|\geq 2\gamma\text{ for all }\bar{\mathbf{z}}_{j}\mbox{ such that }% j\neq i\big{\}}.

Regarding the class of sets \{\mathcal{W}_{1},\dots,\mathcal{W}_{n}\}, we have the following lemmas.

###### Lemma C.1.

For each \mathcal{W}_{i} and \mathcal{W}_{j}, we have

 \displaystyle\mathbb{P}(\mathbf{w}\in\mathcal{W}_{i})\geq\frac{\widetilde{\phi% }}{n\sqrt{128e}}\quad\mbox{and}\quad\mathcal{W}_{i}\cap\mathcal{W}_{j}=\emptyset.

Then we deliver the following two lemmas which are useful to establish the required lower bound.

###### Lemma C.2.

For any \mathbf{a}=(a_{1},\ldots,a_{n})^{\top}\in\mathbb{R}^{n}, let \mathbf{h}(\mathbf{w})=\sum_{i=1}^{n}a_{i}\sigma^{\prime}(\langle\mathbf{w},% \mathbf{z}_{i}\rangle)\mathbf{z}_{i} where \mathbf{w}\sim N(\mathbf{0},\mathbf{I}) is a Gaussian random vector. Then it holds that

 \displaystyle\mathbb{P}\big{[}\|\mathbf{h}(\mathbf{w})\|_{2}\geq|a_{i}|/4\big{% |}\mathbf{w}\in\mathcal{W}_{i}\big{]}\geq 1/2.

Now we are able to prove Lemma B.2.

###### Proof of Lemma B.2.

We first prove the result for any fixed \mathbf{u}_{1},\dots,\mathbf{u}_{n}. Then we define a_{i}(\mathbf{v}_{j})=\langle\mathbf{u}_{i},\mathbf{v}_{j}\rangle, \mathbf{w}_{j}=\sqrt{m/2}\mathbf{w}_{L,j} and

 \displaystyle\mathbf{h}(\mathbf{v}_{j},\mathbf{w}_{j})=\sum_{i=1}^{n}a_{i}(% \mathbf{v}_{j})\sigma^{\prime}(\langle\mathbf{w}_{j},\mathbf{x}_{L-1,i}\rangle% )\mathbf{x}_{L-1,i}.

Then we define the event

 \displaystyle\mathcal{E}_{i}=\big{\{}j\in[m]:\mathbf{w}_{j}^{\prime}\in% \mathcal{W}_{i},\|\mathbf{h}(\mathbf{v}_{j},\mathbf{w}_{j})\|_{2}\geq|a_{i}(% \mathbf{v}_{j})|/4,|a_{i}(\mathbf{v}_{j})|\geq\|\mathbf{u}_{i}\|_{2}/\sqrt{k}% \big{\}}.

By Lemma B.1, we know that with high probability 1/2\leq\|\mathbf{x}_{L-1,i}\|_{2}\leq 2 for all i and \big{\|}\mathbf{x}_{L-1,i}/\|\mathbf{x}_{L-1,i}\|_{2}-\mathbf{x}_{L-1,j}/\|% \mathbf{x}_{L-1,j}\|_{2}\big{\|}\geq\phi/2 for all i\neq j. Then by Lemma C.1 we know that \mathcal{E}_{i}\cap\mathcal{E}_{j}=\emptyset if i\neq j and

 \displaystyle\mathbb{P}(j\in\mathcal{E}_{i}) \displaystyle=\mathbb{P}\big{[}\|\mathbf{h}(\mathbf{v}_{j},\mathbf{w}_{j})\|_{% 2}\geq|a_{i}(\mathbf{v}_{j})|/4|\mathbf{w}_{j}^{\prime}\in\mathcal{W}_{i}\big{% ]}\cdot\mathbb{P}\big{[}\mathbf{w}_{j}^{\prime}\in\mathcal{W}_{i}\big{]}\cdot% \mathbb{P}\big{[}|a_{i}(\mathbf{v}_{j})|\geq\|\mathbf{u}_{i}\|_{2}/\sqrt{k}% \big{]} \displaystyle\geq\frac{\phi}{64\sqrt{2}en}, (C.2)

where the first equality holds because \mathbf{w}_{j} and \mathbf{v}_{j} are independent, and the second inequality follows from Lemmas C.1, C.2 and the fact that \mathbb{P}(|a_{i}(\mathbf{v}_{j})|\geq\|\mathbf{u}_{i}\|_{2}/\sqrt{k})\geq 1/2. Then we have

 \displaystyle\|\nabla_{\mathbf{W}_{L}}L(\mathbf{W})\|_{F}^{2} \displaystyle=\frac{1}{n^{2}}\sum_{j=1}^{m}\|\mathbf{h}(\mathbf{v}_{j},\mathbf% {w}_{j})\|_{2}^{2} \displaystyle\geq\frac{1}{n^{2}}\sum_{j=1}^{m}\|\mathbf{h}(\mathbf{v}_{j},% \mathbf{w}_{j})\|_{2}^{2}\sum_{s=1}^{n}\operatorname*{\mathds{1}}\big{(}j\in% \mathcal{E}_{s}\big{)} \displaystyle\geq\frac{1}{n^{2}}\sum_{j=1}^{m}\sum_{s=1}^{n}\frac{\|\mathbf{u}% _{s}\|_{2}^{2}}{16k}\operatorname*{\mathds{1}}\big{(}j\in\mathcal{E}_{s}\big{)},

where the second inequality holds due to the fact that

 \displaystyle\|\mathbf{h}(\mathbf{v}_{j},\mathbf{w}_{j})\|_{2}^{2}% \operatorname*{\mathds{1}}\big{(}j\in\mathcal{E}_{s}\big{)} \displaystyle\geq\frac{a_{s}^{2}(\mathbf{v}_{j})}{16}\operatorname*{\mathds{1}% }(|a_{s}(\mathbf{v}_{j})|\geq\|\mathbf{u}_{s}\|_{2}/\sqrt{k})\cdot% \operatorname*{\mathds{1}}(j\in\mathcal{E}_{s}) \displaystyle\geq\frac{\|\mathbf{u}_{s}\|_{2}^{2}}{16k}\operatorname*{\mathds{% 1}}(j\in\mathcal{E}_{s}),

where the first inequality follows from the definition of \mathcal{E}_{s}. Then we further define

 \displaystyle Z_{j}=\sum_{s=1}^{n}\frac{\|\mathbf{u}_{s}\|_{2}^{2}}{16k}% \operatorname*{\mathds{1}}\big{(}j\in\mathcal{E}_{s}\big{)},

and provide the following results for \mathbb{E}[Z(\mathbf{w}_{j})] and \text{var}[Z(\mathbf{w}_{j})]

 \displaystyle\mathbb{E}[Z_{j}] \displaystyle=\sum_{s=1}^{n}\frac{\|\mathbf{u}_{s}\|_{2}^{2}}{16k}\mathbb{P}% \big{(}j\in\mathcal{E}_{s}\big{)},\qquad\text{var}[Z(\mathbf{w})]=\sum_{s=1}^{% n}\frac{\|\mathbf{u}_{s}\|_{2}^{4}}{256k^{2}}\mathbb{P}\big{(}j\in\mathcal{E}_% {s}\big{)}\big{[}1-\mathbb{P}\big{(}j\in\mathcal{E}_{s}\big{)}].

Then by Bernstein inequality, with probability at least 1-\exp\big{(}-O\big{(}m\mathbb{E}[Z(\mathbf{w})]/\max_{i\in[n]}\|\mathbf{u}_{i% }\|_{2}^{2}\big{)}\big{)}, it holds that

 \displaystyle\sum_{j=1}^{m}Z_{j}\geq\frac{m}{2}\mathbb{E}[Z_{j}]\geq\sum_{i=1}% ^{n}\frac{\|\mathbf{u}_{i}\|_{2}^{2}}{32d}\cdot\frac{m\phi}{64\sqrt{2}en}=% \frac{C\phi m\sum_{i=1}^{n}\|\mathbf{u}_{i}\|_{2}^{2}}{kn},

where the second inequality follows from (C.1) and C=1/(2096\sqrt{2}e) is an absolute constant. Therefore, with probability at least 1-\exp\big{(}-O(m\phi/(kn))\big{)} we have

 \displaystyle\sum_{j=1}^{m}\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}\langle\mathbf{u}% _{i},\mathbf{v}_{j}\rangle\sigma^{\prime}\big{(}\langle\mathbf{w}_{L,j},% \mathbf{x}_{L-1,i}\rangle\big{)}\mathbf{x}_{L-1,i}\bigg{\|}_{2}^{2}\geq\frac{1% }{n^{2}}\sum_{j=1}^{m}Z(\mathbf{w}_{j})\geq\frac{C\phi m\sum_{i=1}^{n}\|% \mathbf{u}_{i}\|_{2}^{2}}{kn^{3}}.

Till now we have completed the proof for one particular vector collection \{\mathbf{u}_{i}\}_{i=1,\dots,n}. Then we are going to prove that the above inequality holds for arbitrary \{\mathbf{u}_{i}\}_{i=1,\dots,n} with high probability. Taking \epsilon-net over all possible vectors \{\mathbf{u}_{1},\dots,\mathbf{u}_{n}\}\in(\mathbb{R}^{d})^{n} and applying union bound, the above inequality holds with probability at least 1-\exp\big{(}-O(m\phi/(kn))+nk\log(nk)\big{)}. Since we have m\geq\widetilde{O}\big{(}\phi^{-1}n^{2}k^{2}\big{)}, the desired result holds for all choices of \{\mathbf{u}_{1},\dots,\mathbf{u}_{n}\}.

## Appendix D Proof of Auxiliary Lemmas in Appendix C

###### Proof of Lemma C.1.

We first prove that any two sets \mathcal{W}_{i} and \mathcal{W}_{j} have not overlap region. Consider an vector \mathbf{w}\in\mathcal{W}_{i} with the decomposition

 \displaystyle\mathbf{w}=\mathbf{u}_{i}^{(1)}\bar{\mathbf{z}}_{i}+\mathbf{Q}_{i% }^{\prime}\mathbf{u}_{i}^{\prime}.

Then based on the definition of \mathcal{W}_{i} we have,

 \displaystyle\langle\mathbf{w},\bar{\mathbf{z}}_{j}\rangle=\langle\mathbf{u}^{% (1)}_{i}\bar{\mathbf{z}}_{i}+\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime},% \bar{\mathbf{z}}_{j}\rangle=\mathbf{u}_{i}^{(1)}\langle\bar{\mathbf{z}}_{i},% \bar{\mathbf{z}}_{j}\rangle+\langle\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{% \prime},\bar{\mathbf{z}}_{j}\rangle.

Since \mathbf{w}\in\mathcal{W}_{i}, we have |\mathbf{u}_{i}^{(1)}|\leq\gamma and |\langle\mathbf{Q}^{\prime}\mathbf{u}_{i}^{\prime},\bar{\mathbf{z}}_{j}\rangle% |\geq 2\gamma. Therefore, note that |\langle\bar{\mathbf{z}}_{i},\bar{\mathbf{z}}_{j}\rangle|\leq 1, it holds that

 \displaystyle|\langle\mathbf{w},\bar{\mathbf{z}}_{j}\rangle|\geq\big{|}|% \langle\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime},\bar{\mathbf{z}}_{j}% \rangle|-|\mathbf{u}_{i}^{(1)}|\big{|}>\gamma. (D.1)

Note that set \mathcal{W}_{j} requires |\mathbf{u}_{j}^{(1)}|=\langle\mathbf{w},\bar{\mathbf{z}}_{j}\rangle\leq\gamma, which conflicts with (D.1). This immediately implies that \mathcal{W}_{i}\cap\mathcal{W}_{j}=\emptyset.

Then we are going to compute the probability \mathbb{P}(\mathbf{w}\in\mathcal{W}_{i}). Based on the parameter \gamma, we define the following two events

 \displaystyle\mathcal{E}_{1}(\gamma)=\big{\{}|\mathbf{u}_{i}^{(1)}|\leq\gamma% \big{\}},~{}\mathcal{E}_{2}(\gamma)=\big{\{}|\langle\mathbf{Q}_{i}^{\prime}% \mathbf{u}_{i}^{\prime},\bar{\mathbf{z}}_{j}\rangle|\geq 2\gamma\text{ for all% }\bar{\mathbf{z}}_{j},j\neq i\big{\}}.

Evidently, we have \mathbb{P}(\mathbf{w}\in\mathcal{W}_{i})=\mathbb{P}(\mathcal{E}_{1})\mathbb{P}% (\mathcal{E}_{2}). Since \mathbf{u}_{i}^{(1)} is a standard Gaussian random variable, we have

 \displaystyle\mathbb{P}(\mathcal{E}_{1})=\frac{1}{\sqrt{2\pi}}\int_{-\gamma}^{% \gamma}\exp\bigg{(}-\frac{1}{2}x^{2}\bigg{)}\mathrm{d}x\geq\sqrt{\frac{2}{\pi e% }}\gamma.

Moreover, by definition, for any j=1,\ldots,n we have

 \displaystyle\langle\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime},\bar{% \mathbf{z}}_{j}\rangle\sim N\big{[}0,1-(\bar{\mathbf{z}}_{i}^{\top}\bar{% \mathbf{z}}_{j})^{2}\big{]}.

Note that for any j\neq i we have \|\bar{\mathbf{z}}_{i}-\bar{\mathbf{z}}_{j}\|_{2}\geq\widetilde{\phi}, then it follows that

 |\langle\mathbf{z}_{i},\mathbf{z}_{j}\rangle|\leq 1-\widetilde{\phi}^{2}/2,

and if \widetilde{\phi}^{2}\leq 2, then

 \displaystyle 1-(\bar{\mathbf{z}}_{i}^{\top}\bar{\mathbf{z}}_{j})^{2}\geq% \widetilde{\phi}^{2}-\widetilde{\phi}^{4}/4\geq\widetilde{\phi}^{2}/2.

Therefore for any j\neq i,

 \displaystyle\mathbb{P}[|\langle\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime}% ,\bar{\mathbf{z}}_{j}\rangle|<2\gamma]=\frac{1}{\sqrt{2\pi}}\int_{-2[1-(\bar{% \mathbf{z}}_{i}^{\top}\bar{\mathbf{z}}_{j})^{2}]^{-1/2}\gamma}^{2[1-(\bar{% \mathbf{z}}_{i}^{\top}\bar{\mathbf{z}}_{j})^{2}]^{-1/2}\gamma}\exp\bigg{(}-% \frac{1}{2}x^{2}\bigg{)}\mathrm{d}x\leq\sqrt{\frac{8}{\pi}}\frac{\gamma}{[1-(% \bar{\mathbf{z}}_{i}^{\top}\bar{\mathbf{z}}_{j})^{2}]^{1/2}}\leq\frac{4}{\sqrt% {\pi}}\gamma\widetilde{\phi}^{-1}.

By union bound over [n], we have

 \displaystyle\mathbb{P}(\mathcal{E}_{2})=\mathbb{P}[|\langle\mathbf{Q}_{i}^{% \prime}\mathbf{u}_{i}^{\prime},\bar{\mathbf{z}}_{j}\rangle|\geq 2\gamma,j\in% \mathcal{I}]\geq 1-\frac{4}{\sqrt{\pi}}n\gamma\widetilde{\phi}^{-1}.

Therefore we have

 \displaystyle\mathbb{P}(\mathbf{w}\in\mathcal{W}_{i})\geq\sqrt{\frac{2}{\pi e}% }\gamma\cdot\bigg{(}1-\frac{4}{\sqrt{\pi}}n\gamma\widetilde{\phi}^{-1}\bigg{)}.

Plugging \gamma=\sqrt{\pi}\widetilde{\phi}/(8n), it holds that \mathbb{P}(\mathcal{E})\geq\widetilde{\phi}/(\sqrt{128e}n). This completes the proof. ∎

###### Proof of Lemma C.2.

Recall the decomposition of \mathbf{w} in (C.1),

 \displaystyle\mathbf{w}=\mathbf{u}_{i}^{(1)}\bar{\mathbf{z}}_{i}+\mathbf{Q}_{i% }^{\prime}\mathbf{u}_{i}^{\prime}.

Define the event \mathcal{E}_{i}:=\{\mathbf{w}\in\mathcal{W}_{i}\}. Then conditioning on \mathcal{E}_{i}, we have

 \displaystyle\mathbf{h}(\mathbf{w}) \displaystyle=\sum_{i=1}^{n}a_{i}\sigma^{\prime}(\langle\mathbf{w},\mathbf{z}_% {i}\rangle)\mathbf{z}_{r} \displaystyle=a_{i}\sigma^{\prime}(\mathbf{u}_{i}^{(1)})\mathbf{z}_{i}+\sum_{j% \neq i}a_{j}\sigma^{\prime}\big{(}\mathbf{u}^{(1)}_{i}\langle\bar{\mathbf{z}}_% {i},\mathbf{z}_{j}\rangle+\langle\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime% },\mathbf{z}_{j}\rangle\big{)}\mathbf{z}_{j} \displaystyle=a_{i}\sigma^{\prime}(\mathbf{u}_{i}^{(1)})\mathbf{z}_{i}+\sum_{j% \neq i}a_{j}\sigma^{\prime}\big{(}\langle\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}% ^{\prime},\mathbf{z}_{j}\rangle\big{)}\mathbf{z}_{j} (D.2)

where the last equality follows from the fact that conditioning on event \mathcal{E}_{i}, for all j\neq i, it holds that |\langle\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime},\mathbf{z}_{j}\rangle|% \geq 2\gamma\|\mathbf{z}_{j}\|_{2}\geq|\mathbf{u}_{i}^{(1)}|\|\mathbf{z}_{j}\|% _{2}\geq|\mathbf{u}_{i}^{(1)}\langle\mathbf{z}_{i},\mathbf{z}_{j}\rangle|. We then consider two cases: \mathbf{u}_{i}^{(1)}>0 and \mathbf{u}_{i}^{(1)}<0, which occur equally likely conditioning on the event \mathcal{E}_{i}. Let u_{1}>0 and u_{2}<0 denote \mathbf{u}_{i}^{(1)} in these two cases, we have

 \displaystyle\mathbb{P}\bigg{[}\|\mathbf{h}(\mathbf{w})\|_{2}\geq\inf_{{u_{1}>% 0,u_{2}<0}}\max\big{\{}\big{\|}\mathbf{h}(u_{1}\mathbf{z}_{i}+\mathbf{Q}_{i}^{% \prime}\mathbf{u}_{i}^{\prime})\big{\|}_{2},\big{\|}\mathbf{h}(u_{2}\mathbf{z}% _{i}+\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime})\big{\|}_{2}\big{\}}\bigg{% |}\mathcal{E}_{i}\bigg{]}\geq 1/2.

By the inequality \max\{\|\mathbf{a}\|_{2},\|\mathbf{b}\|_{2}\}\geq\|\mathbf{a}-\mathbf{b}\|_{2}/2, we have

 \displaystyle\mathbb{P}\bigg{[}\|\mathbf{h}(\mathbf{w})\|_{2}\geq\inf_{{u_{1}>% 0,u_{1}<0}}\big{\|}\mathbf{h}(u_{1}\mathbf{z}_{i}+\mathbf{Q}_{i}^{\prime}% \mathbf{u}_{i}^{\prime})-\mathbf{h}(u_{2}\mathbf{z}_{i}+\mathbf{Q}_{i}^{\prime% }\mathbf{u}_{i}^{\prime})\big{\|}_{2}/2\bigg{|}\mathcal{E}_{i}\bigg{]}\geq 1/2. (D.3)

For any u_{1}>0 and u_{2}<0, denote \mathbf{w}_{1}=u_{1}\mathbf{z}_{i}+\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime}, \mathbf{w}_{2}=u_{2}\mathbf{z}_{i}+\mathbf{Q}_{i}^{\prime}\mathbf{u}_{i}^{\prime}. We now proceed to give lower bound for \|\mathbf{h}(\mathbf{w}_{1})-\mathbf{h}(\mathbf{w}_{2})\|_{2}. By (D), we have

 \displaystyle\|\mathbf{h}(\mathbf{w}_{1})-\mathbf{h}(\mathbf{w}_{2})\|_{2}=\|a% _{i}\mathbf{z}_{i}\|_{2}\geq a_{i}/2, (D.4)

where we use the fact that \|\mathbf{z}_{i}\|_{2}\geq 1/2. Plugging this back into (D.3), we have

 \displaystyle\mathbb{P}\big{[}\|\mathbf{h}(\mathbf{w})\|_{2}\geq|a_{i}|/4\big{% |}\mathcal{E}_{i}\big{]}\geq 1/2.

This completes the proof. ∎

## References

• Allen-Zhu et al. (2018a) Allen-Zhu, Z., Li, Y. and Liang, Y. (2018a). Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 .
• Allen-Zhu et al. (2018b) Allen-Zhu, Z., Li, Y. and Song, Z. (2018b). A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 .
• Allen-Zhu et al. (2018c) Allen-Zhu, Z., Li, Y. and Song, Z. (2018c). On the convergence rate of training recurrent neural networks. arXiv preprint arXiv:1810.12065 .
• Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 .
• Brutzkus and Globerson (2017) Brutzkus, A. and Globerson, A. (2017). Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
• Cao and Gu (2019) Cao, Y. and Gu, Q. (2019). A generalization theory of gradient descent for learning over-parameterized deep relu networks. arXiv preprint arXiv:1902.01384 .
• Chizat and Bach (2018) Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956 .
• Du and Lee (2018) Du, S. S. and Lee, J. D. (2018). On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206 .
• Du et al. (2018a) Du, S. S., Lee, J. D., Li, H., Wang, L. and Zhai, X. (2018a). Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804 .
• Du et al. (2017) Du, S. S., Lee, J. D. and Tian, Y. (2017). When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129 .
• Du et al. (2018b) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2018b). Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 .
• He et al. (2015) He, K., Zhang, X., Ren, S. and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision.
• Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
• Li and Liang (2018) Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204 .
• Li and Yuan (2017) Li, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with ReLU activation. arXiv preprint arXiv:1705.09886 .
• Oymak and Soltanolkotabi (2019) Oymak, S. and Soltanolkotabi, M. (2019). Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674 .
• Tian (2017) Tian, Y. (2017). An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560 .
• Wu et al. (2019) Wu, X., Du, S. S. and Ward, R. (2019). Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint arXiv:1902.07111 .
• Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 .
• Zhang et al. (2019) Zhang, H., Yu, D., Chen, W. and Liu, T.-Y. (2019). Training over-parameterized deep resnet is almost as easy as training a two-layer network. arXiv preprint arXiv:1903.07120 .
• Zhang et al. (2018) Zhang, X., Yu, Y., Wang, L. and Gu, Q. (2018). Learning one-hidden-layer ReLU networks via gradient descent. arXiv preprint arXiv:1806.07808 .
• Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L. and Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175 .
• Zou et al. (2018) Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2018). Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888 .
Comments 0
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters

Loading ...
374739

You are asking your first question!
How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description