Stochastic Gradient Descent Optimizes
Over-parameterized Deep ReLU Networks
We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.
Deep neural networks have achieved great success in many applications like image processing (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012) and Go games (Silver et al., 2016). However, the reason why deep networks work well in these fields remains a mystery for long time. Different lines of research try to understand the mechanism of deep neural networks from different aspects. For example, a series of work tries to understand how the expressive power of deep neural networks are related to their architecture, including the width of each layer and depth of the network (Telgarsky, 2015, 2016; Lu et al., 2017; Liang and Srikant, 2016; Yarotsky, 2017, 2018; Hanin, 2017; Hanin and Sellke, 2017). These work shows that multi-layer networks with wide layers can approximate arbitrary continuous function.
In this paper, we mainly focus on the optimization perspective of deep neural networks. It is well known that without any additional assumption, even training a shallow neural network is a NP-hard problem (Blum and Rivest, 1989). Researchers have made various assumptions to get a better theoretical understanding of training neural networks, such as Gaussian input assumption (Brutzkus et al., 2017; Du et al., 2017a; Zhong et al., 2017) and independent activation assumption (Choromanska et al., 2015; Kawaguchi, 2016). A recent line of work tries to understand the optimization process of training deep neural networks from two aspects: over-parameterization and random weight initialization. It has been observed that over-parameterization and proper random initialization can help the optimization in training neural networks, and various theoretical results have been established (Safran and Shamir, 2017; Du and Lee, 2018; Arora et al., 2018a; Allen-Zhu et al., 2018c; Du et al., 2018b; Li and Liang, 2018). More specifically, Safran and Shamir (2017) showed that over-parameterization can help reduce the spurious local minima in one-hidden-layer neural networks with Rectified Linear Unit (ReLU) activation function. Du and Lee (2018) showed that with over-parameterization, all local minima in one-hidden-layer networks with quardratic activation function are global minima. Arora et al. (2018b) showed that over-parameterization introduced by depth can accelerate the training process using gradient descent (GD). Allen-Zhu et al. (2018c) showed that with over-parameterization and random weight initialization, both gradient descent and stochastic gradient descent (SGD) can find the global minima of recurrent neural networks.
The most related work to ours are Li and Liang (2018) and Du et al. (2018b). Li and Liang (2018) showed that for a one-hidden-layer network with ReLU activation function using over-parameterization and random initialization, GD and SGD can find the near global-optimal solutions in polynomial time with respect to the accuracy parameter , training sample size and the data separation parameter 111More precisely, Li and Liang (2018) assumed that each data point is generated from distributions , and is defined as .. Du et al. (2018b) showed that under the assumption that the population Gram matrix is not degenerate222More precisely, Du et al. (2018b) assumed that the minimal singular value of is greater than a constant, where is defined as and are data points., randomly initialized GD converges to a globally optimal solution of a one-hidden-layer network with ReLU activation function and quadratic loss function. However, both Li and Liang (2018) and Du et al. (2018b) only characterized the behavior of gradient-based method on one-hidden-layer shallow neural networks rather than on deep neural networks that are widely used in practice.
In this paper, we aim to advance this line of research by studying the optimization properties of gradient-based methods for deep ReLU neural networks. In specific, we consider a -hidden-layer fully-connected neural network with ReLU activation function. Similar to the one-hidden-layer case studied in Li and Liang (2018) and Du et al. (2018b), we study binary classification problem and show that both GD and SGD can achieve global minima of the training loss for any , with the aid of over-parameterization and random initialization. At the core of our analysis is to show that Gaussian random initialization followed by (stochastic) gradient descent generates a sequence of iterates within a small perturbation region centering around the initial weights. In addition, we will show that the empirical loss function of deep ReLU networks has very good local curvature properties inside the perturbation region, which guarantees the global convergence of (stochastic) gradient descent. More specifically, our main contributions are summarized as follows:
We show that with Gaussian random initialization on each layer, when the number of hidden nodes per layer is at least , GD can achieve zero training error within iterations, where is the data separation distance, is the number of training examples, and is the number of hidden layers. Our result can be applied to a broad family of loss functions, as opposed to cross entropy loss studied in Li and Liang (2018) and quadratic loss considered in Du et al. (2018b).
We also prove a similar convergence result for SGD. We show that with Gaussian random initialization on each layer, when the number of hidden nodes per layer is at least , SGD can also achieve zero training error within iterations.
Our proof only makes so-called data separation assumption on input data, which is more general than the assumption made in Du et al. (2018b).
When we were preparing this manuscript, we were informed that two concurrent work Allen-Zhu et al. (2018b) and Du et al. (2018a) have appeared on-line very recently. Compared with Allen-Zhu et al. (2018b), our work bares a similar proof idea at a high level, by extending the results for two-layer ReLU networks in Li and Liang (2018) to deep ReLU networks. Compared with Du et al. (2018a), our work is based on a different assumption on the training data and is able to deal with the nonsmooth ReLU activation function.
The remainder of this paper is organized as follows: In Section 2, we discuss the literature that are most related to our work. In Section 3, we introduce the problem setup and preliminaries of our work. In Sections 4 and 5, we present our main theoretical results and their proofs respectively. We conclude our work and discuss some future work in Section 6.
2 Related Work
Due to the huge amount of literature on deep learning theory, we are not able to include all papers in this big vein here. Instead, we review the following three major lines of research, which are mostly related to our work.
One-hidden-layer neural networks with ground truth parameters Recently a series of work (Tian, 2017; Brutzkus and Globerson, 2017; Li and Yuan, 2017; Du et al., 2017a, b; Zhang et al., 2018) study a specific class of shallow two-layer (one-hidden-layer) neural networks, whose training data are generated by a ground truth network called “teacher network”. This series of work aim to provide recovery guarantee for gradient-based methods to learn the teacher networks based on either the population or empirical loss functions. More specifically, Tian (2017) proved that for two-layer networks with ReLU activation function and only one hidden neuron, GD with random initialization on the population loss function is able to recover the hidden teacher network. Brutzkus and Globerson (2017) proved that GD can learn the true parameters of a two-layer network with a convolution filter. Li and Yuan (2017) proved that SGD can recover the underlying parameters of a two-layer residual network in polynomial time. Du et al. (2017a) showed that both GD and SGD can recover the true parameters of a two-layer neural network with a convolution filter. Du et al. (2017b) proved that GD can recover the teacher network of a two-layer CNN with ReLU activation function. Zhang et al. (2018) showed that GD on a empirical loss function can recover the ground truth parameters of one-hidden-layer ReLU networks with one neuron at a linear rate.
Deep linear networks Beyond the shallow one-hidden-layer neural networks, a series of recent work (Hardt and Ma, 2016; Kawaguchi, 2016; Bartlett et al., 2018; Arora et al., 2018a, b) focus on the optimization landscape of deep linear networks. More specifically, Hardt and Ma (2016) showed that deep linear residual networks have no spurious local minima. Kawaguchi (2016) proved that all local minima are global minima in deep linear networks. Arora et al. (2018b) showed that depth can accelerate the optimization of deep linear networks. Bartlett et al. (2018) proved that with identity initialization and proper regularizer, GD can converge to the least square solution on a residual linear network with quadratic loss function, while Arora et al. (2018a) proved the same properties for general deep linear networks.
Generalization bounds for deep neural networks The phenomenon that deep neural networks generalize better than shallow neural networks have been observed in practice for a long time (Langford and Caruana, 2002). Besides classical VC-dimension based results (Vapnik, 2013; Anthony and Bartlett, 2009), a vast literature have recent studied the connection between the generalization performance of deep nueral networks and their architecture (Neyshabur et al., 2015, 2017a, 2017b; Bartlett et al., 2017; Golowich et al., 2017; Arora et al., 2018c; Allen-Zhu et al., 2018a). More specifically, Neyshabur et al. (2015) derived Rademacher complexity for a class of norm-constrained feed-forward neural networks with ReLU activation function. Bartlett et al. (2017) derived margin bounds for deep ReLU networks based on Rademacher complexity and covering number. Neyshabur et al. (2017a, b) also derived similar spectrally-normalized margin bounds for deep neural networks with ReLU activation function using PAC-Bayes approach. Golowich et al. (2017) studied size-independent sample complexity of deep neural networks and showed that the sample complexity can be independent of both depth and width under additional assumptions. Arora et al. (2018c) proved generalization bounds via compression-based framework. Allen-Zhu et al. (2018a) showed that an over-parameterized one-hidden-layer neural network can learn a one-hidden-layer neural network with fewer parameters using SGD up to a small generalization error, while similar results also hold for over-parameterized two-hidden-layer neural networks.
3 Problem Setup and Preliminaries
We use lower case, lower case bold face, and upper case bold face letters to denote scalars, vectors and matrices respectively. For a positive integer , we denote . For a vector , we denote by the norm of , the norm of , and the norm of . We use to denote a square diagonal matrix with the elements of vector on the main diagonal. For a matrix , we use to denote the Frobenius norm of , to denote the spectral norm (maximum singular value), and to denote the number of nonzero entries. We denote by the unit sphere in .
For two sequences and , we use to denote that for some absolute constant , and use to denote that for some absolute constant . In addition, we also use and to hide some logarithmic terms in Big-O and Big-Omega notations. We also use the following matrix product notation. For indices and a collection of matrices , we denote
3.2 Problem Setup
Let be a set of training examples. Let . We consider -hidden-layer neural networks as follows:
where is the entry-wise ReLU activation function, , are the weight matrices, and is the fixed output layer weight vector with half and half entries. Let be the collection of matrices , we consider solving the following empirical risk minimization problem:
where . Regarding the loss function , we make the following assumptions.
The loss function is continuous, and satisfies , as well as .
Assumption 3.1 has been widely made in the studies of training binary classifiers (Soudry et al., 2017; Nacson et al., 2018; Ji and Telgarsky, 2018). In addition, we require the following assumption which provides an upper bound on the derivative of .
There exists a constant , such that for any we have
where , and are positive constants. Note that can be .
This assumption holds for most of the popular loss functions, such as hinge loss , squared hinge loss , exponential loss , and cross entropy loss .
The loss function is -Lipschitz, i.e., for all .
The loss function is -smooth, i.e., for all .
The above two assumptions are general for analyzing the convergence property of (stochastic) gradient descent, which has also been made in Li and Liang (2018).
In addition, we make the following assumptions on the training data.
for all .
for all , for some .
Assumption 3.6 was inspired by the assumption made in Li and Liang (2018), where they assumed that the training data in each class are supported on a mixture of components, and any two components are separated by a certain distance. In this assumption, we treat each training examples as a component and define by the minimum distance between any two examples. The same assumption was also made in Allen-Zhu et al. (2018b).
Define , . We assume that .
3.3 Optimization Algorithms
In this paper, we consider training a deep neural network with Gaussian initialization followed by gradient descent/stochastic gradient descent.
Gaussian initialization. We say that the weight matrices are generated from Gaussian initialization if each column of is generated independently from the Gaussian distribution for all .
Gradient descent. We consider solving the empirical risk minimization problem (3.2) with gradient descent initialized with Gaussian initialization: let be weight matrices generated from Gaussian initialization, we consider the following gradient descent update rule:
where is the step size (a.k.a., learning rate).
Stochastic gradient descent. We also consider solving (3.2) using stochastic gradient descent with Gaussian initialization. Again, let be generated from Gaussian initialization. At the -th iteration, a minibatch of training examples with batch size is sampled from the training set, and the stochastic gradient is calculated as follows:
The update rule for stochastic gradient descent is then defined as follows:
where is the step size.
Here we briefly introduce some useful notations and provide some basic calculations regarding the neural network under our setting.
Output after the -th layer: Given an input , the output of the neural network after the -th layer is
where 333Here we slightly abuse the notation and denote for a vector ., and for .
Output of the neural network: The output of the neural network with input is as follows:
where we define and the last equality holds for any .
Gradient of the neural network: The partial gradient of the training loss with respect to is as follows:
4 Main Theory
In this section, we show that with random Gaussian initialization, over-parameterization helps gradient based algorithms, including gradient descent and stochastic gradient descent, converge to the global minimum, i.e., find some points with arbitrary small training loss. In particular, we provide the following theorem which characterizes the required number of hidden nodes and iterations such that the gradient descent can attain the global minimum of the training loss function.
Theorem 4.1 suggests that the required number of hidden nodes and the number of iterations are both polynomial in the number of training examples , and the separation parameter . This is consistent with the recent work on the global convergence in training neural networks (Li and Yuan, 2017; Du et al., 2018b; Allen-Zhu et al., 2018c; Du et al., 2018a; Allen-Zhu et al., 2018b). Regarding different loss functions (depending on and according to Assumption 3.2), the dependence in ranges from to . It is worth noting that, both and have an exponential dependence on the number of hidden layers . The same dependence on has also appeared in (Du et al., 2018a) for training deep neural networks with smooth activation functions. For deep ReLU networks, Allen-Zhu et al. (2018b) proved that the number of hidden nodes and the number of iterations that guarantee global convergence only have a polynomial dependence on . We will improve the dependence of our results on in the future.
It is worth noting that, squared hinge loss and exponential loss do not satisfy the Lipschitz continuous assumption. However, gradient descent can still be guaranteed to find the global minimum, as long as the derivative of the loss function is bounded along the optimization trajectory.
Based on the results in Theorem 4.1, we are able to characterize the required number of hidden nodes per layer that gradient descent can find a point with zero training error in the following corollary.
Under the same assumptions as in Theorem 4.1, gradient descent can find a point with zero training error if the number of hidden nodes per layer is at least .
In the following Theorem, we characterize the required numbers of hidden nodes and iterations for stochastic gradient descent.
Since it is not reflected in Theorems 4.1 and 4.5, we remark here that compared with gradient descent, stochastic gradient descent requires more hidden nodes and iterations to find the global minimum. Specifically, depending on different and , the required number of hidden nodes of stochastic gradient descent is worse by a factor ranging from to than that of gradient descent. The iteration number of stochastic gradient descent is also worse by a factor ranging from to . The detailed comparison can be found in the proofs of Theorems 4.1 and 4.5.
Similar to gradient descent, the following corollary characterizes the required number of hidden nodes per layer that stochastic gradient descent can achieve zero training error.
Under the same assumptions as in Theorem 4.5, stochastic gradient descent can find a point with zero training error if the number of hidden nodes per layer is at least .
5 Proof of the Main Theory
We prove the basic properties for Gaussian random matrices in Theorem 5.1, which constitutes a basic structure of the neural network after Gaussian random initialization.
We show that as long as the product of iteration number and step size is smaller than some quantity , (stochastic) gradient descent with iterations remains in the perturbation region centering around the Gaussian initialization , which justifies the application of Theorem 5.3 to the iterates of (stochastic) gradient descent.
Based on the assumption that all iterates are within the perturbation region centering at with radius , we establish the convergence results in Theorem 5.6, and derive conditions on the product of iteration number and step size that guarantees convergence.
We finalize the proof by ensuring that (stochastic) gradient descent converges before exceeds by setting on the number of hidden nodes in each layer to be large enough.
The following theorem summarizes some high probability results of neural networks with Gaussian random initialization, which is pivotal to establish the subsequent theoretical analyses.
and , with probability at least , all the following results hold:
for all and .
for all and .
for all .
for all and .
for all and .
for all and .
For any , there exists at least nodes satisfying
Theorem 5.1 summarizes all the properties we need for Gaussian initialization. In the sequel, we always assume that results 1-7 hold for the Gaussian initialization. The parameter in Theorem 5.1 is chosen in later proofs as . With , it is easy to check that the condition (5.1) is satisfied under the conditions of Theorem 4.1 and Theorem 4.5.
We perform -perturbation on the collection of random matrices with perturbation level , which formulates a perturbation region centering at with radius . For the perturbed matrices within such perturbation region, let , , and . We summarize their properties in the following theorem.
for all .
for all and .
for all and .
If , the squared Frobenius norm of the partial gradient with respect to the weight matrix in the last hidden layer is lower bounded by
For all , the Frobenius norm of the partial gradient of with respect to the weight matrix in the -th hidden layer is upper bounded by
Moreover, the Frobenius norm of the stochastic partial gradient is upper bounded by
where denotes the mini-batch of indices queried for computing stochastic gradient, and denotes the minibatch size.
Theorem 5.3 suggests that quantities and grow exponentially in . In order to make them sufficiently small, we need to set to be exponentially small in . Moreover, the gradient lower bound provided in 5 implies that within the perturbation region the empirical loss function of deep neural network enjoys good local curvature properties, which are essential to prove the global convergence of (stochastic) gradient descent, while the gradient upper bound in 6 will be utilized to quantify how much the weight matrices of the neural network would change during the training process.
5.1 Proof of Theorem 4.1
We first show that within a number of iterations, the iterates generated by gradient descent are in the perturbation region centering at the initial point with radius .
Suppose that are generated via Gaussian initialization, and all results 1-7 in Theorem 5.1 hold for . Let be the iterate obtained at the -th iteration of gradient descent starting from , and . Then under the same assumptions made in Theorem 5.3, for any iteration number and step size satisfying , for all .
The global convergence rate of gradient descent for training deep ReLU networks is characterized by the following lemma.
Suppose that are generated from Gaussian initialization, and for , is the iterate obtained at the -th iteration of gradient descent starting from . Under the same assumptions as in Theorem 4.1, if set the step size , perturbation level , and assume all iterates are within the perturbation region, then the last iterate of gradient descent, denoted as , satisfies if the following holds:
when , and
Proof of Theorem 4.1.
The convergence result of gradient descent in terms of can be directly derived based on the results of and we obtained in Lemma 5.6.
In the following, we prove the result of the required number of hidden nodes per layer. Note that the convergence rate in Lemma 5.6 is derived by assuming that all iterates are within the perturbation. Therefore, we need to find the requirement of such that this assumption holds. Note that by Lemma 5.5, we know that if , all iterates generated by gradient descent must be within the perturbation region. In addition, by Lemma 5.6, it requires to set . Then, by solving , we can derived the required number of hidden nodes per layer as follows:
Proof of Corollary 4.4 .
Note that if , is correctly classified. Therefore, we can achieve zero training error if the loss function satisfies
where the inequality follows from facts that is strictly decreasing in and there is no training example satisfying . Then, we are able to replace the accuracy in Theorem 4.1 with , and obtain the result presented in this corollary. ∎
5.2 Proof of Theorem 4.5
Similar to the proof of Theorem 4.1, the following lemma shows that within a number of iterations, the iterates generated by stochastic gradient descent are within the perturbation region centering at the initial point with radius .
Suppose that are generated via Gaussian initialization, and all results 1-7 in Theorem 5.1 hold for . Let be the iterate obtained at the -th iteration of stochastic gradient descent starting from , and . Then under the same assumptions made in Theorem 5.3, for any iteration number and step size satisfying , for all .
Now we characterize the global convergence rate of stochastic gradient descent in the following lemma.
Lemma 5.8 (Convergence rate).
when , and
when , then stochastic gradient descent can find a point along the iteration sequence such that with constant probability the training loss .
Proof of Theorem 4.5.
The proof for is straightforward. Utilizing the results of and provided in Lemma 5.8 directly yields the bound on the iteration number .
6 Conclusions and Future Work
In this paper, we studied training deep neural networks by gradient descent and stochastic gradient descent. We proved that both gradient descent and stochastic gradient descent can achieve global minima of over-parameterized deep ReLU networks with random initialization, for a general class of loss functions, with only mild assumption on training data. Our theory sheds light on understanding why stochastic gradient descent can train deep neural networks very well in practice, and paves the way to study the optimization dynamics of training more sophisticated deep neural networks.
In the future, we will improve the dependence of our results on the number of layers . We will also sharpen the polynomial dependence of our results on other problem-dependent parameters.
Appendix A Proof of Theorem 5.1
In this section we provide the proof of Theorem 5.1. The bound for given by result 1 in Theorem 5.1 follows from standard results for Gaussian random matrices with independent entries (See Corollary 5.35 in Vershynin (2010)). We split the rest results into several lemmas and prove them separately.
We first give the bound for the norms of the outputs of each layer. Intuitively, since the columns of are sampled independently from , given the output of the previous layer , the expectation of is . Moreover, the ReLU activation function truncates roughly half of the entries of to zero, and therefore should be approximately equal to . This leads to Lemma A.1 and Corollary A.2.
Denote by the -th column of . Suppose that for any , are generated independently from . Then there exists an absolute constant such that for any , as long as , with probability at least ,
for all and .