Depth with Nonlinearity Creates No Bad Local Minima in ResNets
Abstract
In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets studied in previous work, in the sense that the values of all local minima are no worse than the global minima values of corresponding shallow linear predictors with arbitrary fixed features, and are guaranteed to further improve via residual representations. As a result, this paper provides an affirmative answer to an open question stated in a paper in the conference on Neural Information Processing Systems (NIPS) 2018. We note that even though our paper advances the theoretical foundation of deep learning and nonconvex optimization, there is still a gap between theory and many practical deep learning applications.
1 Introduction
Deep learning has seen practical success with a significant impact on the fields of computer vision, machine learning and artificial intelligence. In addition to its practical success, deep learning has been theoretically studied and shown to have strong expressive powers. For example, neural networks with one hidden layer can approximate any continuous functions (Leshno et al., 1993; Barron, 1993), and deeper neural networks can approximate functions of certain classes in more compact manners (Montufar et al., 2014; Livni et al., 2014; Telgarsky, 2016).
However, one of the major concerns in both theory and practice is that training a deep learning model requires us to deal with highly nonconvex and highdimensional optimization. Optimization problems with a general nonconvex function and with a certain nonconvex function induced by some specific neural networks are both known to be NPhard (Murty and Kabadi, 1987; Blum and Rivest, 1992), which would pose no serious challenge if only it were not highdimensional (Kawaguchi et al., 2015, 2016). Therefore, a hope is that nonconvex highdimensional optimization in deep learning allows some additional structure or assumption to make the optimization tractable. Under several simplification assumptions, recent studies have proven the existence of novel loss landscape structures that may play a role in making the optimization tractable in deep learning (Dauphin et al., 2014; Choromanska et al., 2015; Kawaguchi, 2016). More recently, Shamir (2018) has shown that a specific type of neural network, namely residual network (ResNet) with a single output unit (a scalarvalued output), has no local minimum with a value higher than the global minimum value of scalarvalued linear predictors (or equivalently, onelayer networks with a single output unit). However, Shamir (2018) remarks that while it is natural to ask whether this result can be extended to networks with multiple output units (vectorvalued outputs) as they are commonly used in practice, it is currently unclear how to prove such a result and the question is left to future research.
As a step towards establishing the optimization theory in deep learning, this paper presents theoretical results that provide an answer to the open question remarked in (Shamir, 2018). Moreover, this paper proves a quantitative upper bound on the local minimum value, which shows that not only the local minimum values of deep ResNets are always no worse than the global minimum value of vectorvalued linear predictors (or onelayer networks with multiple output units), but also further improvements on the quality of local minima are guaranteed via nonnegligible residual representations.
2 Preliminaries
The Residual Network (ResNet) is a class of neural networks that is commonly used in practice with stateoftheart performances in many applications (He et al., 2016a, b; Kim et al., 2016; Xie et al., 2017; Xiong et al., 2018). When compared to standard feedforward neural networks, ResNets introduce skip connections, which adds the output of some previous layer directly to the output of some following layer. A main idea of ResNet is that these skip connections allow each layer to focus on fitting the residual of the target output that is not covered by the previous layer’s output. Accordingly, we may expect that a trained ResNet is no worse than a trained shallower network consisting of fewer layers only up to the previous layer. However, because of the nonconvexity, it is unclear whether ResNets exhibit this behavior, instead of getting stuck around some arbitrarily poor local minimum.
2.1 Model
To study the nonconvex optimization problems of ResNets, both the previous study (Shamir, 2018) and this paper considers a type of arbitrarily deep ResNets, for which the preactivation output of the last layer can be written as
(1) 
Here, , and consist of trainable parameters, is the input vector in any fixed feature space embedded in , and represents the outputs of arbitrarily deep residual functions parameterized by . Also, is the number of output units, is the number of input units, and represents the dimension of the outputs of the residual functions.
There is no assumption on the structure of , and is allowed to represent some possibly complicated deep residual functions that arise in ResNets. For example, the model in Equation (1) can represent arbitrarily deep preactivation ResNets (He et al., 2016b), which are widely used in practice. To facilitate and simplify theoretical study, Shamir (2018) assumed that every entry of the matrix is unconstrained and fully trainable (e.g., instead of representing convolutions). This paper adopts this assumption, following the previous study.
(On arbitrary handcrafted features) All of our results hold true with in any fixed feature space embedded in . Indeed, an input to neural networks represents an input in any such feature space (instead of only in a raw input space); e.g., given raw input and any feature map (including identity as ), we write with .
(On bias terms) All of our results hold true for the model with or without bias terms; i.e., given original and , we can always set and to account for bias terms if desired.
2.2 Optimization problem
The previous study (Shamir, 2018) and this paper consider the following optimization problem:
(2) 
where are unconstrained, is some loss function to be specified, and is the target vector. Here, is an arbitrary probability measure on the space of the pair such that whenever the partial derivative : exists, the identity,
(3) 
holds at every local minimum (of );
Therefore, all the results in this paper always hold true for the standard training error objective,
(4) 
because , where the last equality used the empirical measure with the Dirac measures . In general, the objective function in Equations (2) and (4) is nonconvex even in with a convex map .
This paper analyzes the quality of the local minima in Equation (2) in terms of the global minimum value of the linear predictors with an arbitrary fixed basis (e.g., with some feature map ) that is defined as
Similarly, define to be the global minimum values of the linear predictors with the basis as
2.3 Background
Given any fixed , let be a function of . The main additional assumptions in the previous study (Shamir, 2018) are the following:

The output dimension is one as .

For any , the map is convex and twice differentiable.

On any bounded subset of the domain of , the function , its gradient , and its Hessian are all Lipschitz continuous in .
The previous work (Shamir, 2018) also implicitly requires for Equation (3) to hold at all relevant points for optimization, including every local minimum (see the proof in the previous paper for more detail), which is not required in this paper. Under these assumptions, along with an analysis for a simpler decoupled model (), the previous study (Shamir, 2018) provided a quantitative analysis of approximate stationary points, and proved the following main result for the optimization problem in Equation (2). {proposition} (Shamir, 2018) If assumptions 1, 2 and 3 hold, every local minimum of satisfies
3 Main results
Our main results are presented in Section 3.1 for a general case with arbitrary loss and arbitrary measure, and in Section 3.2 for a concrete case with the squared loss and the empirical measure.
3.1 Result for arbitrary loss and arbitrary measure
This paper discards the above assumptions from the previous literature, and adopts the following assumptions instead:

The output dimension satisfies .

For any , the map is convex and differentiable.
Assumptions 1 and 2 can be easily satisfied in many practical applications in deep learning. For example, we usually have that in multiclass classification with MNIST, CIFAR10 and SVHN, which satisfies Assumption 1. Assumption 2 is usually satisfied in practice as well, because it is satisfied by simply using a common such as squared loss, crossentropy loss, logistic loss and smoothed hinge loss among others.
Using these mild assumptions, we now state our main result in Theorem 3.1 for arbitrary loss and arbitrary measure (including the empirical measure).
If Assumptions 1 and 2 hold, every local minimum of satisfies
(5) 
From Theorem 3.1, one can see that if Assumptions 1 and 2 hold, the objective function has the following properties:

Every local minimum value is at most the global minimum value of linear predictors with the arbitrary fixed basis as .

If is nonnegligible such that , every local minimum value is strictly less than the global minimum value of linear predictors as .
By allowing the multiple output units, Theorem 3.1 provides an affirmative answer to the open question remarked in (Shamir, 2018) (see Section 2.3). Here, the set of our assumptions are strictly weaker than the set of assumptions used to prove Proposition 2.3 in the previous work (Shamir, 2018) (including all assumptions implicitly made in the description of the model, optimization problem, and probability measure), in that the latter implies the former but not vice versa. For example, one can compare Assumptions 1 and 2 against the previous paper’s assumptions 1, 2 and 3 in Section 2.3. We note that along with Proposition 2.3, the previous work (Shamir, 2018) also provided an analysis of approximate stationary points, for which some additional continuity assumption such as such 2 and 3 would be indispensable (i.e., one can consider the properties around a point based on those at the point via some continuity).
In addition to responding to the open question, Theorem 3.1 further states that the guarantee on the local minimum value of ResNets can be much better than the global minimum value of linear predictors, depending on the quality of the residual representation . In Theorem 3.1, we always have that . This is because a linear predictor with the basis achieves by restricting the coefficients of to be zero and minimizing only the rest. Accordingly, if is nonnegligible (, the local minimum value of ResNet is guaranteed to be strictly better than the global minimum value of linear predictors, the degree of which is abstractly quantified in Theorem 3.1 and concretely quantified in the next subsection.
3.2 Result for squared loss and empirical measure
To provide a concrete example of Theorem 3.1, this subsection sets to be the squared loss and to be the empirical measure. That is, this subsection discards Assumption 2 and uses the following assumptions instead:

The map represents the squared loss as .

The is the empirical measure as .
Assumptions 1 and 2 imply that . Let us define the matrix notation of relevant terms as , , and . Let be the orthogonal projection matrix onto the column space (or range space) of a matrix . Let be the orthogonal projection matrix onto the null space (or kernel space) of a matrix . Let be the Frobenius norm.
We now state a concrete example of Theorem 3.1 for the case of the squared loss and the empirical measure.
If Assumptions 1, 1 and 2 hold, every local minimum of satisfies
(6) 
As in Theorem 3.1, one can see in Theorem 3.2 that every local minimum value is at most the global minimum value of linear predictors. When compared with Theorem 3.1, each term in the upper bound in Theorem 3.2 is more concrete. The global minimum value of linear predictors is , which is the (averaged) norm of the target data matrix projected on to the null space of . The further improvement term via the residual representation is
This is the (averaged) norm of the residual projected on to the column space of . Therefore, a local minimum can get the further improvement, if the residual is captured in the residual representation that differs from , as intended in the residual architecture. More concretely, as the column space of differs more from the column space of , the further improvement term becomes closer to , which gets larger as the residual gets more captured by the column space of .
4 Proof idea and additional results
This section provides overviews of the proofs of the theoretical results. The complete proofs are provided in the Appendix at the end of this paper. In contrast to the previous work (Shamir, 2018), this paper proves the quality of the local minima with the additional further improvement term and without assuming the scalar output (1), twice differentiability (2) and Lipschitz continuity (3). Accordingly, our proofs largely differ from those of the previous study (Shamir, 2018).
Along with the proofs of the main results, this paper proves the following lemmas. For a matrix , represents the standard vectorization of the matrix . Let be the Kronecker product of matrices and . Let be the identity matrix of size by . {lemma} (derivatives of predictor) The function is differentiable with respect to and the partial derivatives have the following forms:
and
(necessary condition of local minimum) If is a local minimum of ,
where
4.1 Proof overview of lemmas
Lemma 4 follows a standard observation and a common derivation. Lemma 4 is proven with a case analysis separately for the case of and the case of .
In the case of , the statement of Lemma 4 follows from the first order necessary condition of local minimum, , along with the observation that the derivative of with respect to exits.
In the case of , instead of solely relying on the first order conditions, our proof directly utilizes the definition of local minimum as follows. We first consider a family of sufficiently small perturbations of such that , and observe that if is a local minimum, then must be a local minimum via the definition of local minimum and the triangle inequality. Then, by checking the first order necessary conditions of local minimum for both and , we obtain the statement of Lemma 4.
4.2 Proof overview of theorems
Theorem 3.1 is proven by showing that from Lemma 4, every local minimum induces a globally optimal linear predictor of the form, , in terms of the , where and . This yields that . In the proof of Theorem 3.2, we derive the specific forms of for the case of the squared loss and the empirical measure, obtaining the statement of Theorem 3.2.
5 Conclusion
In this paper, we partially addressed an open problem on a type of deep ResNets by showing that instead of having arbitrarily poor local minima, all local minimum values are no worse than the global minimum value of linear predictors, and are guaranteed to further improve via the residual representation. This paper considered the exact same (and more general) optimization problem of ResNets as in the previous literature. However, the optimization problem in this paper and the literature does not yet directly apply to many practical applications, because the parameters in the matrix are considered to be unconstrained. To improve the applicability, future work would consider the problem with constrained .
Mathematically, we can consider a map that takes a classical machine learning model with linear predictors (with arbitrary fixed features) as an input and outputs a deep version of the classical model. We can then ask what structure this “deepening” map preserves. In terms of this context, this paper proved that in a type of ResNets, depth with nonlinearity (a certain “deepening” map) does not create local minima with loss values worse than the global minimum value of the original model.
Acknowledgements
We would like to thank Professor Ohad Shamir for his inspiring talk on his great paper (Shamir, 2018) and a subsequent conversation. We gratefully acknowledge support from NSF grants 1420316, 1523767 and 1723381, from AFOSR grant FA95501710165, from Honda Research and Draper Laboratory, as well as support from NSERC, CIFAR and Canada Research Chairs.
Appendix
This appendix presents complete proofs.
Appendix A Proofs of lemmas
a.1 Proof of Lemma 4
Proof.
The differentiability follows the fact that is linear in and affine in given other variables being fixed; i.e., with (where is linear in and is linear in ), since (by the linearity of in and the linearity of in ), we have that where as .
For the forms of partial derivatives, because (for matrices and of appropriate sizes), and because , we have that
and
Taking derivatives of in these forms with respect to and respectively yields the desired statement.
∎
a.2 Proof of Lemma 4
Proof.
This proof considers two cases in terms of , and proves that the desired statement holds in both cases. Note that from Lemma 4 and Assumption 2, is differentiable with respect to , because a composition of differentiable functions is differentiable. From the condition on , this implies that is differentiable with respect to at every local minimum . Also, note that since a (or a ) in our analysis is either an arbitrary point or a point depending on the (as well as and ), we can write where is some function of and with the possible dependence being explicit (the same statement holds for ). Let and for notational simplicity.
Case of : From the first order condition of local minimum with respect to ,
where the second line follows Lemma 4. This implies that , which in turn implies that
since .
Similarly, from the first order condition of local minimum with respect to ,
where the second line follows Lemma 4. This implies that
where the last equality follows from that .
Therefore, if is a local minimum and if , we have that and .
Case of : Since and , we have that and there exists a vector such that and . Let be such a vector, and define
where . Since , we have that for any ,
and
If is a local minimum, must be a local minimum with respect to (given the fixed ). If is a local minimum with respect to (given the fixed ), by the definition of a local minimum, there exists such that for all , where is an open ball of radius with the center at . For any sufficiently small such that , if is a local minimum, every is also a local minimum, because there exists such that
for all (the inclusion follows the triangle inequality), which satisfies the definition of local minimum for .
Thus, for any such sufficiently small , we have that
since otherwise, does not satisfy the first order necessary condition of local minima (i.e., can be moved to the direction of the nonzero partial derivative with a sufficiently small magnitude and decrease the loss value, which contradicts with being a local minimum). Hence, for any such sufficiently small ,
which implies that
where the last line follows from the fact that and hence . Since , by multiplying both sides from the left, we have that for any sufficiently small such that ,
which implies that
Then, from ,
where the last equality follows from that .
In summary, if is a local minimum, in both cases of and , we have that and . ∎
Appendix B Proofs of theorems
b.1 Proof of Theorem 3.1
Proof.
Let for notational simplicity. Define and . Then, we have that
and
Since the map is convex and an expectation of convex functions is convex, is convex in . Since a composition of a convex function with an affine function is convex, is convex in . Therefore, from the convexity, if
then is a global minimum of .
We now show that if is a local minimum, then , and hence is a global minimum of . On the one hand, with the same calculations as in the proofs of Lemmas 4 and 4, we have that
On the other hand, Lemma 4 states that if is a local minimum of , we have that and , yielding
and hence
This implies that if is a local minimum, is a global minimum of . Since is the objective function with the linear predictors with the basis , we have that
∎
b.2 Proof of Theorem 3.2
Proof.
From Theorem 3.1, we have that . In this proof, we derive the specific forms of for the case of the squared loss and the empirical measure. Let for notational simplicity. Since the map is assumed to represent the squared loss in this theorem, the global minimum value of linear predictors is the global minimum value of
where . From convexity and differentiability of , is a global minimum if and only if . Since
solving for all solutions of yields that
and hence
Also, the same proof step obtains the fact that is the global minimum value of , which is the objective function with linear predictors .
On the other hand, since the span of the columns of is the same as the span of the columns of , we have that , and
which yields
By plugging this into , with denoting the trace of a matrix ,
where the last line follows from the fact that since .
∎
Footnotes
 A simple sufficient condition to satisfy Equation (3) is for to be bounded in the neighborhood of every local minimum of . Different sufficient conditions to satisfy Equation (3) can be easily obtained by applying various convergence theorems (e.g., the dominated convergence theorem) to the limit (in the definition of derivative) and the integral (in the definition of expectation).
References
 Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
 Avrim L Blum and Ronald L Rivest. Training a 3node neural network is npcomplete. Neural Networks, 5(1):117–127, 1992.
 Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 192–204, 2015.
 Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941, 2014.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016b.
 Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, 2016.
 Kenji Kawaguchi, Leslie Pack Kaelbling, and Tomás LozanoPérez. Bayesian optimization with exponential convergence. In Advances in Neural Information Processing (NIPS), 2015.
 Kenji Kawaguchi, Yu Maruyama, and Xiaoyu Zheng. Global continuous optimization with error bound and fast convergence. Journal of Artificial Intelligence Research, 56:153–195, 2016.
 Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image superresolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
 Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
 Roi Livni, Shai ShalevShwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
 Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
 Katta G Murty and Santosh N Kabadi. Some npcomplete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.
 Ohad Shamir. Are resnets provably better than linear predictors? In Advances in Neural Information Processing Systems, to appear, 2018.
 Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, pages 1517–1539, 2016.
 Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. The microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5934–5938. IEEE, 2018.