Adding One Neuron Can Eliminate All Bad Local Minima
Abstract
One of the main difficulties in analyzing neural networks is the nonconvexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum.
1 Introduction
Deep neural networks have recently achieved huge success in various machine learning tasks (see, [1]; [2]; [3], for example). However, a theoretical understanding of neural networks is largely lacking. One of the difficulties in analyzing neural networks is the nonconvexity of the loss function which allows the existence of many local minima with large losses. This was long considered a bottleneck of neural networks, and one of the reasons why convex formulations such as support vector machine [4] were preferred previously. Given the recent empirical success of the deep neural networks, an interesting question is whether the nonconvexity of the neural network is really an issue.
It has been widely conjectured that all local minima of the empirical loss lead to similar training performance [5, 6]. For example, prior works empirically showed that neural networks with identical architectures but different initialization points can converge to local minima with similar classification performance [1, 7, 8]. On the theoretical side, there have been many recent attempts to analyze the landscape of the neural network loss functions. A few works have studied deep networks, but they either require linear activation functions [9, 10, 11, 12, 13], or require assumptions such as independence of ReLU activations [6] and significant overparametrization [14, 15, 16]. There is a large body of works that study singlehiddenlayer neural networks and provide various conditions under which a local search algorithm can find a global minimum [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Note that even for singlelayer networks, strong assumptions such as overparameterization, very special neuron activation functions, fixed second layer parameters and/or Gaussian data distribution are often needed in the existing works. The presence of various strong assumptions reflects the difficulty of the problem: even for the singlehiddenlayer nonlinear neural network, it seems hard to analyze the landscape, so it is reasonable to make various assumptions.
In addition to strong assumptions, the conclusions in many existing works do not apply to all local minima. One typical conclusion is about the local geometry, i.e., in a small neighborhood of the global minima no bad local minima exist [29, 28, 30]. Another typical conclusion is that a subset of local minima are global minima [32, 22, 33, 14, 15]. [34] has shown that a subset of secondorder local minima can perform nearly as well as linear predictors. The presence of various conclusions also reflects the difficulty of the problem: while analyzing the global landscape seems hard, we may step back and analyze the local landscape or a “majority” of the landscape. Based on the above discussions, an ideal theoretical result would state that with mild assumptions on the dataset, neural architectures and loss functions, every local minimum is a global minimum; existing results often make more than one strong assumption and/or prove weaker conclusions on the landscape.
1.1 Our Contributions
Given this context, our main result is quite surprising: for binary classification, with a small modification of the neural architecture, every local minimum is a global minimum of the loss function. Our result requires no assumption on the network size, the specific type of the original neural network, etc., yet our result applies to every local minimum. The major trick is adding one special neuron (with a skip connection) and an associated regularizer of this neuron. Our major result and its implications are as follows:

We focus on the binary classification problem with a smooth hinge loss function. We prove the following result: for any neural network, by adding a special neuron (e.g., exponential neuron) to the network and adding a quadratic regularizer of this neuron, the new loss function has no bad local minimum. In addition, every local minimum achieves the minimum misclassification error.

In the main result, the augmented neuron can be viewed as a skip connection from the input to the output layer. However, this skip connection is not critical, as the same result also holds if we add one special neuron to each layer of a fullyconnected feedforward neural network.

To our knowledge, this is the first result that no spurious local minimum exists for a wide class of deep nonlinear networks. Our result indicates that the class of “good neural networks” (neural networks such that there is an associated loss function with no spurious local minima) contains any network with one special neuron, thus this class is rather “dense” in the class of all neural networks: the distance between any neural network and a good neural network is just a neuron away.
The outline of the paper is as follows. In Section 2, we present several notations. In Section 3, we present the main result and several extensions on the main results are presented in Section 4. We present the proof idea of the main result in Section 5 and conclude this paper in Section 6. All proofs are presented in Appendix.
2 Preliminaries
Feedforward networks. Given an input vector of dimension , we consider a neural network with layers of neurons for binary classification. We denote by the number of neurons in the th layer (note that ). We denote the neural activation function by . Let denote the weight matrix connecting the th and th layer and denote the bias vector for neurons in the th layer. Let and denote the weight vector and bias scalar in the output layer, respectively. Therefore, the output of the network can be expressed by
(1) 
Loss and error. We use to denote a dataset containing samples, where and denote the feature vector and the label of the th sample, respectively. Given a neural network parameterized by and a loss function , in binary classification tasks, we define the empirical loss as the average loss of the network on a sample in the dataset and define the training error (also called the misclassification error) as the misclassification rate of the network on the dataset , i.e.,
(2) 
where is the indicator function.
Tensors products. We use to denote the tensor product of vectors and and use to denote the tensor product where appears times. For an th order tensor and vectors , we define
where we use to denote the th component of the tensor , to denote the th component of the vector , and to denote the set .
3 Main Result
In this section, we first present several important conditions on the loss function and the dataset in order to derive the main results. After that, we will present the main results.
3.1 Assumptions
In this subsection, we introduce two assumptions on the loss function and the dataset.
Assumption 1 (Loss function)
Assume that the loss function is monotonically nondecreasing and twice differentiable, i.e., . Assume that every critical point of the loss function is also a global minimum and every global minimum satisfies .
A simple example of the loss function satisfying Assumption 1 is the polynomial hinge loss, i.e., , . It is always zero for and behaves like a polynomial function in the region . Note that the condition that every global minimum of the loss function is negative is not needed to prove the result that every local minimum of the empirical loss is globally minimal, but is necessary to prove that the global minimizer of the empirical loss is also the minimizer of the misclassification rate.
Assumption 2 (Realizability)
Assume that there exists a set of parameters such that the neural network is able to correctly classify all samples in the dataset .
By Assumption 2, we assume that the dataset is realizable by the neural architecture . We note that this assumption is consistent with previous empirical observations [35, 1, 7] showing that at the end of the training process, neural networks usually achieve zero misclassification rates on the training sets. However, as we will show later, if the loss function is convex, then we can prove the main result even without Assumption 2.
3.2 Main Result
In this subsection, we first introduce several notations and next present the main result of the paper.
Given a neural architecture defined on a dimensional Euclidean space and parameterized by a set of parameters , we define a new architecture by adding the output of an exponential neuron to the output of the network , i.e.,
(3) 
where the vector denote the parametrization of the network . For this designed model, we define the empirical loss function as follows,
(4) 
where the scalar is a positive real number, i.e., . Different from the empirical loss function , the loss has an additional regularizer on the parameter , since we aim to eliminate the impact of the exponential neuron on the output of the network at every local minimum of . As we will show later, the exponential neuron is inactive at every local minimum of the empirical loss . Now we present the following theorem to show that every local minimum of the loss function is also a global minimum.
Remark: Instead of viewing the exponential term in Equation (3) as a neuron, one can also equivalently think of modifying the loss function to be
Then, one can interpret Equation (3) and (4) as maintaining the original neural architecture and slightly modifying the loss function.
Theorem 1
Remarks: (i) Theorem 1 shows that every local minimum of the empirical loss is also a global minimum and shows that achieves the minimum training error and the minimum loss value on the original loss function at the same time. (ii) Since we do not require the explicit form of the neural architecture , Theorem 1 applies to the neural architectures widely used in practice such as convolutional neural network [1], deep residual networks [7], etc. This further indicates that the result holds for any real neural activation functions such as rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), etc. (iii) As we will show in the following corollary, at every local minimum , the exponential neuron is inactive. Therefore, at every local minimum , the neural network with an augmented exponential neuron is equivalent to the original neural network .
Corollary 1
Under the conditions of Theorem 1, if is a local minimum of the empirical loss function , then two neural networks and are equivalent, i.e., , .
Corollary 1 shows that at every local minimum, the exponential neuron does not contribute to the output of the neural network . However, this does not imply that the exponential neuron is unnecessary, since several previous results [36, 31] have already shown that the loss surface of pure ReLU neural networks are guaranteed to have bad local minima. Furthermore, to prove the main result under any dataset, the regularizer is also necessary, since [31] has already shown that even with an augmented exponential neuron, the empirical loss without the regularizer still have bad local minima under some datasets.
4 Extensions
4.1 Eliminating the Skip Connection
As noted in the previous section, the exponential term in Equation (3) can be viewed as a skip connection or a modification to the loss function. Our analysis also works under other architectures as well. When the exponential term is viewed as a skip connection, the network architecture is as shown in Fig. 1(a). This architecture is different from the canonical feedforward neural architectures as there is a direct path from the input layer to the output layer. In this subsection, we will show that the main result still holds if the model is defined as a feedforward neural network shown in Fig. 1(b), where each layer of the network is augmented by an additional exponential neuron. This is a standard fully connected neural network except for one special neuron at each layer.
Notations. Given a fullyconnected feedforward neural network defined by Equation (1), we define a new fully connected feedforward neural network by adding an additional exponential neuron to each layer of the network . We use the vector to denote the parameterization of the network , where denotes the vector consisting of all augmented weights and biases. Let and denote the weight matrix and the bias vector in the th layer of the network , respectively. Let and denote the weight vector and the bias scalar in the output layer of the network , respectively. Without the loss of generality, we assume that the th neuron in the th layer is the augmented exponential neuron. Thus, the output of the network is expressed by
(5) 
where is a vectorvalued activation function with the first components being the activation functions in the network and with the last component being the exponential function, i.e., . Furthermore, we use the to denote the vector in the th row of the matrix . In other words, the components of the vector are the weights on the edges connecting the exponential neuron in the th layer and the neurons in the th layer. For this feedforward network, we define an empirical loss function as
(6) 
where denotes the norm of a vector and is a positive real number, i.e., . Similar to the empirical loss discussed in the previous section, we add a regularizer to eliminate the impacts of all exponential neurons on the output of the network. Similarly, we can prove that at every local minimum of , all exponential neurons are inactive. Now we present the following theorem to show that if the set of parameters is a local minimum of the empirical loss function , then is a global minimum and is a global minimum of both minimization problems and . This means that the neural network simultaneously achieves the globally minimal loss value and misclassification rate on the dataset .
Theorem 2
Suppose that Assumption 1 and 2 hold. Suppose that the activation function is differentiable. Assume that is a local minimum of the empirical loss function , then is a global minimum of . Furthermore, achieves the minimum loss value and the minimum misclassification rate on the dataset , i.e., and .
Remarks: (i) This theorem is not a direct corollary of the result in the previous section, but the proof ideas are similar. (ii) Due to the assumption on the differentiability of the activation function , Theorem 2 does not apply to the neural networks consisting of nonsmooth neurons such as ReLUs, Leaky ReLUs, etc. (iii) Similar to Corollary 1, we will present the following corollary to show that at every local minimum , the neural network with augmented exponential neurons is equivalent to the original neural network .
Corollary 2
Under the conditions in Theorem 2, if is a local minimum of the empirical loss function , then two neural networks and are equivalent, i.e., .
Corollary 2 further shows that even if we add an exponential neuron to each layer of the original network , at every local minimum of the empirical loss, all exponential neurons are inactive.
4.2 Neurons
In this subsection, we will show that even if the exponential neuron is replaced by a monomial neuron, the main result still holds under additional assumptions. Similar to the case where exponential neurons are used, given a neural network , we define a new neural network by adding the output of a monomial neuron of degree to the output of the original model , i.e.,
(7) 
In addition, the empirical loss function is exactly the same as the loss function defined by Equation (4). Next, we will present the following theorem to show that if all samples in the dataset can be correctly classified by a polynomial of degree and the degree of the augmented monomial is not smaller than (i.e., ), then every local minimum of the empirical loss function is also a global minimum. We note that the degree of a monomial is the sum of powers of all variables in this monomial and the degree of a polynomial is the maximum degree of its monomial.
Proposition 1
Remarks: (i) We note that, similar to Theorem 1, Proposition 1 applies to all neural architectures and all neural activation functions defined on , as we do not require the explicit form of the neural network . (ii) It follows from the Lagrangian interpolating polynomial and Assumption 2 that for a dataset consisted of different samples, there always exists a polynomial of degree smaller such that the polynomial can correctly classify all points in the dataset. This indicates that Proposition 1 always holds if . (iii) Similar to Corollary 1 and 2, we can show that at every local minimum , the neural network with an augmented monomial neuron is equivalent to the original neural network .
4.3 Allowing Random Labels
In previous subsections, we assume the realizability of the dataset by the neural network which implies that the label of a given feature vector is unique. It does not cover the case where the dataset contains two samples with the same feature vector but with different labels (for example, the same image can be labeled differently by two different people). Clearly, in this case, no model can correctly classify all samples in this dataset. Another simple example of this case is the mixture of two Gaussians where the data samples are drawn from each of the two Gaussian distributions with certain probability.
In this subsection, we will show that under this broader setting that one feature vector may correspond to two different labels, with a slightly stronger assumption on the convexity of the loss , the same result still holds. The formal statement is present by the following proposition.
Proposition 2
Suppose that Assumption 1 holds and the loss function is convex. Assume that is a local minimum of the empirical loss function , then is a global minimum of . Furthermore, achieves the minimum loss value and the minimum misclassification rate on the dataset , i.e., and .
Remark: The differences of Proposition 2 and Theorem 1 can be understood in the following ways. First, as stated previously, Proposition 2 allows a feature vector to have two different labels, but Theorem 1 does not. Second, the minimum misclassification rate under the conditions in Theorem 1 must be zero, while in Proposition 2, the minimum misclassification rate can be nonzero.
4.4 Highorder Stationary Points
In this subsection, we characterize the highorder stationary points of the empirical loss shown in Section 3.2. We first introduce the definition of the highorder stationary point and next show that every stationary point of the loss with a sufficiently high order is also a global minimum.
Definition 1 (th order stationary point)
A critical point of a function is a th order stationary point, if there exists positive constant such that for every with , .
Next, we will show that if a polynomial of degree can correctly classify all points in the dataset, then every stationary point of the order at least is a global minimum and the set of parameters corresponding to this stationary point achieves the minimum training error.
Proposition 3
Suppose that Assumptions 1 and 2 hold. Assume that all samples in the dataset can be correctly classified by a polynomial of degree . Assume that is a th order stationary point of the empirical loss function and , then is a global minimum of . Furthermore, the neural network achieves the minimum misclassification rate on the dataset , i.e., .
One implication of Proposition 3 is that if a dataset is linearly separable, then every second order stationary point of the empirical loss function is a global minimum and, at this stationary point, the neural network achieves zero training error. When the dataset is not linearly separable, our result only covers fourth or higher order stationary point of the empirical loss.
5 Proof Idea
In this section, we provide overviews of the proof of Theorem 1.
5.1 Important Lemmas
In this subsection, we present two important lemmas where the proof of Theorem 1 is based.
Lemma 1
Under Assumption 1 and , if is a local minimum of , then (i) , (ii) for any integer , the following equation holds for all unit vector ,
(8) 
Lemma 2
For any integer and any sequence , if holds for all unit vector , then the th order tensor is a th order zero tensor.
5.2 Proof Sketch of Lemma 1
Proof sketch of Lemma 1(): To prove , we only need to check the first order conditions of local minima. By assumption that is a local minimum of , then the derivative of with respect to and at the point are all zeros, i.e.,
From the above equations, it is not difficult to see that satisfies or, equivalently, .
We note that the main observation we are using here is that the derivative of the exponential neuron is itself. Therefore, it is not difficult to see that the same proof holds for all neuron activation function satisfying for some constant . In fact, with a small modification of the proof, we can show that the same proof works for all neuron activation functions satisfying for some constants and . This further indicates that the same proof holds for the monomial neurons and thus the proof of Proposition 1 follows directly from the proof of Theorem 1.
Proof sketch of Lemma 1(): The main idea of the proof is to use the high order information of the local minimum to derive Equation (8). Due to the assumption that is a local minimum of the empirical loss function , there exists a bounded local region such that the parameters achieve the minimum loss value in this region, i.e., such that for .
Now, we use , to denote the perturbations on the parameters and , respectively. Next, we consider the loss value at the point , where we set and for an arbitrary unit vector . Therefore, as goes to zero, the perturbation magnitude also goes to zero and this indicates that there exists an such that for . By the result , shown in Lemma 1(), the output of the model under parameters can be expressed by
For simplicity of notation, let . From the second order Taylor expansion with Lagrangian remainder and the assumption that is twice differentiable, it follows that there exists a constant depending only on the local minimizer and the dataset such that the following inequality holds for every sample in the dataset and every ,
Summing the above inequality over all samples in the dataset and recalling that holds for all , we obtain
Finally, we complete the proof by induction. Specifically, for the base hypothesis where , we can take the limit on the both sides of the above inequality as , using the property that can be either positive or negative and thus establish the base case where . For the higher order case, we can first assume that Equation (8) holds for and then subtract these equations from the above inequality. After taking the limit on the both sides of the inequality as , we can prove that Equation (8) holds for . Therefore, by induction, we can prove that Equation (8) holds for any nonnegative integer .
5.3 Proof Sketch of Lemma 2
The proof of Lemma 2 follows directly from the results in reference [37]. It is easy to check that, for every sequence and every nonnegative integer , the th order tensor is a symmetric tensor. From Theorem 1 in [37], it directly follows that
Furthermore, by assumption that holds for all , then
and this is equivalent to where is the zero vector in the dimensional space.
5.4 Proof Sketch of Theorem 1
For every dataset satisfying Assumption 2, by the Lagrangian interpolating polynomial, there always exists a polynomial defined on such that it can correctly classify all samples in the dataset with margin at least one, i.e., , where denotes the th monomial in the polynomial . Therefore, from Lemma 1 and 2, it follows that
Since and hold for and the loss function is a nondecreasing function, i.e., , then holds for all . In addition, from the assumption that every critical point of the loss function is a global minimum, it follows that achieves the global minimum of the loss function and this further indicates that is a global minimum of the empirical loss . Furthermore, since at every local minimum, the exponential neuron is inactive, , then the set of parameters is a global minimum of the loss function . Finally, since every critical point of the loss function satisfies , then for every sample, indicates that , or, equivalently, . Therefore, the set of parameters also minimizes the training error. In summary, the set of parameters minimizes the loss function and the set of parameters simultaneously minimizes the empirical loss function and the training error .
6 Conclusions
One of the difficulties in analyzing neural networks is the nonconvexity of the loss functions which allows the existence of many spurious minima with large loss values. In this paper, we prove that for any neural network, by adding a special neuron and an associated regularizer, the new loss function has no spurious local minimum. In addition, we prove that, at every local minimum of this new loss function, the exponential neuron is inactive and this means that the augmented neuron and regularizer improve the landscape of the loss surface without affecting the representing power of the original neural network. We also extend the main result in a few ways. First, while adding a special neuron makes the network different from a classical neural network architecture, the same result also holds for a standard fully connected network with one special neuron added to each layer. Second, the same result holds if we change the exponential neuron to a polynomial neuron with a degree dependent on the data. Third, the same result holds even if one feature vector corresponds to both labels.
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [2] I. J Goodfellow, D. WardeFarley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.
 [3] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus. Regularization of neural networks using dropconnect. In ICML, pages 1058–1066, 2013.
 [4] C. Cortes and V. Vapnik. Supportvector networks. Machine learning, 1995.
 [5] Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436, 2015.
 [6] A. Choromanska, M. Henaff, M. Mathieu, G. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
 [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [8] G. Huang and Z. Liu. Densely connected convolutional networks. In CVPR, 2017.
 [9] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
 [10] K. Kawaguchi. Deep learning without poor local minima. In NIPS, pages 586–594, 2016.
 [11] C D. Freeman and J. Bruna. Topology and geometry of halfrectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
 [12] M. Hardt and T. Ma. Identity matters in deep learning. ICLR, 2017.
 [13] C. Yun, S. Sra, and A. Jadbabaie. Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444, 2017.
 [14] Q. Nguyen and M. Hein. The loss surface and expressivity of deep convolutional neural networks. arXiv preprint arXiv:1710.10928, 2017.
 [15] Q. Nguyen and M. Hein. The loss surface and expressivity of deep convolutional neural networks. arXiv preprint arXiv:1710.10928, 2017.
 [16] R. Livni, S. ShalevShwartz, and O. Shamir. On the computational efficiency of training neural networks. In NIPS, 2014.
 [17] S. S Du and J. D Lee. On the power of overparametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
 [18] R. Ge, J. D Lee, and T. Ma. Learning onehiddenlayer neural networks with landscape design. ICLR, 2018.
 [19] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning polynomials with neural networks. In ICML, 2014.
 [20] H. Sedghi and A. Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv preprint arXiv:1412.2693, 2014.
 [21] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
 [22] B. D Haeffele and R. Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
 [23] A. Gautier, Q. N. Nguyen, and M. Hein. Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods. In NIPS, pages 1687–1695, 2016.
 [24] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
 [25] M. Soltanolkotabi. Learning relus via gradient descent. In NIPS, pages 2004–2014, 2017.
 [26] D. Soudry and E. Hoffer. Exponentially vanishing suboptimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.
 [27] S. Goel and A. Klivans. Learning depththree neural networks in polynomial time. arXiv preprint arXiv:1709.06010, 2017.
 [28] S. S. Du, J. D. Lee, and Y. Tian. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017.
 [29] K. Zhong, Z. Song, P. Jain, P. L Bartlett, and I. S Dhillon. Recovery guarantees for onehiddenlayer neural networks. arXiv preprint arXiv:1706.03175, 2017.
 [30] Y. Li and Y. Yuan. Convergence analysis of twolayer neural networks with relu activation. In NIPS, pages 597–607, 2017.
 [31] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for binary classification. 2018.
 [32] B. Haeffele, E. Young, and R. Vidal. Structured lowrank matrix factorization: Optimality, algorithm, and applications to image processing. In ICML, 2014.
 [33] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
 [34] O. Shamir. Are resnets provably better than linear predictors? arXiv preprint arXiv:1804.06739, 2018.
 [35] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2016.
 [36] Itay Safran and Ohad Shamir. Spurious local minima are common in twolayer relu neural networks. ICML, 2018.
 [37] X. Zhang, C. Ling, and L. Qi. The best rank1 approximation of a symmetric tensor and related spherical optimization problems. SIAM Journal on Matrix Analysis and Applications, 2012.
Appendix A Proof of Lemma 1
a.1 Proof of Lemma 1()
Proof.
To prove , we only need to check the first order conditions of local minima. By assumption that is a local minimum of , then the derivative of with respect to and at the point are all zeros, i.e.,
From above two equations, it is not difficult to see that satisfies or, equivalently, .
∎
a.2 Proof of Lemma 1()
Proof.
The main idea of the proof is to use the high order information of the local minimum to prove the Lemma. Due to the assumption that is a local minimum of the empirical loss function , there exists a bounded local region such that the parameters achieve the minimum loss value in this region, i.e., such that for .
Now, we use , to denote the perturbations on the parameters and , respectively. Next, we consider the loss value at the point , where we set and for an arbitrary unit vector . Therefore, as goes to zero, the perturbation magnitude also goes to zero and this indicates that there exists an such that for . By , the output of the model under parameters can be expressed by
Let . For each sample in the dataset, by the second order Taylor expansion with Lagrangian remainder, there exists a scalar depending on and such that the following equation holds,
Let vector denote an arbitrary unit vector. Let and . Clearly, for all , and , we have
Since , then for each , there exists a constant depend on such that
holds for all .
Since is a local minimum, then there exists such that the inequality