How degenerate is the parametrization of neural networks with the ReLU activation function?

How degenerate is the parametrization of neural networks with the ReLU activation function?

Julius Berner
Faculty of Mathematics, University of Vienna
Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
julius.berner@univie.ac.at &Dennis Elbrächter
Faculty of Mathematics, University of Vienna
Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
dennis.elbraechter@univie.ac.at
\ANDPhilipp Grohs
Faculty of Mathematics and Research Platform DataScience@UniVienna, University of Vienna
Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
philipp.grohs@univie.ac.at
Abstract

Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the loss function generally only depends on the realization of a neural network, i.e. the function it computes. Studying the functional optimization problem over the space of realizations can open up completely new ways to understand neural network training. In particular, usual loss functions like the mean squared error are convex on sets of neural network realizations, which themselves are non-convex. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. In this paper we present pathologies which prevent inverse stability in general, and proceed to establish a restricted set of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function, which can be learned by optimization over unrestricted sets. While most of this paper focuses on shallow networks, none of methods used are, in principle, limited to shallow networks, and it should be possible to extend them to deep neural networks.

\pdfstringdefDisableCommands

 

How degenerate is the parametrization of neural networks with the ReLU activation function?


  Julius Berner Faculty of Mathematics, University of Vienna Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria julius.berner@univie.ac.at Dennis Elbrächter Faculty of Mathematics, University of Vienna Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria dennis.elbraechter@univie.ac.at Philipp Grohs Faculty of Mathematics and Research Platform DataScience@UniVienna, University of Vienna Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria philipp.grohs@univie.ac.at

\@float

noticebox[b]Preprint. Under review.\end@float

1 Introduction and Motivation

In recent year much effort has been invested into explaining and understanding the overwhelming success of deep learning based methods. On the theoretical side, impressive approximation capabilities of neural networks have been established Bölcskei et al. [2017], Burger and Neubauer [2001], Funahashi [1989], Gühring et al. [2019], Perekrestenko et al. [2018], Petersen and Voigtlaender [2017], Shaham et al. [2018], Yarotsky [2017]. No less important are recent results on the generalization of neural networks, which deal with the question of how well networks, trained on limited samples, perform on unseen data Anthony and Bartlett [2009], Arora et al. [2018], Bartlett et al. [2017a, b], Berner et al. [2018], Golowich et al. [2017], Neyshabur et al. [2017]. Last but not least, the optimization error which quantifies how well a neural network can be trained by solving the optimization problem with stochastic gradient descent, has been analyzed in different scenarios Allen-Zhu et al. [2018], Choromanska et al. [2015], Du et al. [2018], Kawaguchi [2016], Li and Liang [2018], Li and Yuan [2017], Mei et al. [2018], Shamir and Zhang [2013]. While there are many interesting approaches to the latter question, they tend to require very strong assumptions (e.g. (almost) linearity, convexity, or extreme over-parametrization). Thus a satisfying explanation for the success of stochastic gradient descent for a non-smooth, non-convex problem remains elusive.
In the present paper we intend to pave the way for a functional perspective on the optimization problem, which will allow for new mathematical approaches towards understanding the training of neural networks, see Section 1.2 and Corollary 1.3. To this end we examine degenerate parametrizations with undesirable properties in Section 2. These can be roughly classified as

  1. unbalanced magnitudes of the parameters

  2. weight vectors with the same direction

  3. weight vectors with directly opposite directions.

Subject to these, Theorem 3.1 establishes inverse stability for shallow ReLU networks. This is accomplished by a refined analysis of the behavior of neural networks with ReLU activation function near a discontinuity of their derivative and requires to endow the function space with a Sobolev norm. Inverse stability connects the loss surface of the parametrized minimization problem to the loss surface of the functional problem, see Proposition 1.2. Note that this functional approach of analyzing the loss surface is conceptually different from previous approaches as in Choromanska et al. [2015], Goodfellow et al. [2014], Li et al. [2018], Nguyen and Hein [2017], Pennington and Bahri [2017], Safran and Shamir [2016].

1.1 Inverse Stability of Neural Networks

We will focus on neural networks with the ReLU activation function, and adapt the mathematically convenient notation from Petersen and Voigtlaender [2017], which distinguishes between the parametrization of a neural network and its realization function. Let be a network architecture specifying the number of neurons in each of the layers. We then define the set of parametrizations with architecture as

(1)

and the realization map

(2)

where and is applied component-wise.
We refer to and as the weights and biases in the -th layer. Note that a parametrization uniquely induces a realization function , while in general there can be multiple non-trivially different parametrizations with the same realization. To put it in mathematical terms, the realization map is not injective. Consider the basic counterexample

(3)

from Petersen et al. [2018] where regardless of and both realization functions coincide with . However, it it is well-known that the realization map is Lipschitz continuous, meaning that close111On the finite dimensional vector space all norms are equivalent and we take w.l.o.g. the maximum norm , i.e. the maximum of the absolute values of the entries of the and . parametrizations induce realization functions which are close in the uniform norm on compact sets, see e.g. [Anthony and Bartlett, 2009, Lemma 14.6][Berner et al., 2018, Theorem 4.2] and [Petersen et al., 2018, Proposition 5.1].
We will shed light upon the inverse question. Given realizations and that are close (in some norm on , which we will specify later), do the parametrizations and have to be close? In view of the above counterexample this cannot be true in general and, at least, we need to allow for a re-parametrization of one of the networks, i.e. we get the following question.

Given and that are close, does there exist a re-parametrization with such that and are close?

As we will see in Section 2, this question is fundamentally connected to understanding the redundancies and degeneracies of ReLU network parametrization. By suitable regularization, i.e. considering a subset of parametrizations, we can avoid these pathologies and establish a positive answer to the question above. The, to the best of our knowledge, only other research conducted in this direction coined the term inverse stability of neural networks for this property Petersen et al. [2018].

Definition 1.1 (Inverse stability).

Let , , and let be a norm on . We say that the realization map is inverse stable on w.r.t. , if for every and there exists with

(4)

In Section 2 we will see why inverse stability w.r.t. the uniform norm (on some open domain ) fails. Therefore, we consider a norm which takes into account not only the maximum error of the function values but also of the gradients (component-wise). In mathematical terms, we make use of the Sobolev norm defined for every Lipschitz continuous function by with the Sobolev semi-norm given by

(5)

See Evans and Gariepy [2015] for further information on Sobolev norms, and Berner et al. [2019] for further information on the derivate of ReLU networks.

1.2 Implications of inverse stability for neural network optimization

We continue by showing how inverse stability opens up new perspectives on the optimization problem for neural networks. Specifically, consider a loss function on the space of continuous functions. For illustration, we take the commonly used mean squared error (MSE) which, for training data , is given by

(6)

Typically, the optimization problem is over some subset of parametrizations , i.e.

(7)

From an abstract point of view, by writing this is equivalent to the corresponding optimization problem over the space of realizations , i.e.

(8)

However, the loss landscape of the former problem is only properly connected to the landscape of the latter if the realization map is inverse stable on . Otherwise a realization can be arbitrarily close to a global or local minimum in the realization space but every parametrization with is far away from the corresponding minimum in the parameter space (see Example A.1). The next proposition shows that, if we have inverse stability, local minima of (7) in the parameter space are local minima of (8) in the realization space.

Proposition 1.2 (Parameter minimum realization minimum).

Let and a norm on such that the realization map is inverse stable on w.r.t. . Let be a local minimum of on with radius , i.e. for all with it holds that

(9)

Then is a local minimum of on with radius , i.e. for all with it holds that

(10)

See Appendix A.2 for a proof. Using Proposition 1.2, it is now possible to transfer results, which are gained by analyzing the optimization problem over the realization space, to the parametrized setting. Note that on the functional side we consider a problem with convex loss function but non-convex feasible set, see [Petersen et al., 2018, Section 3.2]. This opens up completely new avenues of investigation using tools from functional analysis and, e.g., utilizing recent results Gribonval et al. [2019], Petersen et al. [2018] exploring the topological properties of neural network realization spaces.
For ease of presentation, we restrict ourselves to two-layer networks, where we present a proof for inverse stability w.r.t. the Sobolev semi-norm on a suitably regularized set of parametrizations. Both the regularizations as well as the stronger norm (compared to the uniform norm) will shown to be necessary in Section 2. We now present, in an informal way, a collection of our main results. A short proof making the connection to the formal results can be found in Appendix A.2.

Corollary 1.3 (Inverse stability and implications - colloquial).

Suppose we are given data and want to solve a typical minimization problem for ReLU networks with shallow architecture , i.e.

(11)

First we augment the architecture to , while omitting the biases, and augment the samples to . Furthermore, we assume that the parametrizations

(12)

are regularized such that

  1. the network is balanced, i.e.

  2. no non-zero weights in the first layer are redundant, i.e.

  3. the last two coordinates of each weight vector are strictly positive.

Then for the new minimization problem

(13)

it holds that

  1. every local minimum in the parametrization space with radius is a local minimum in the realization space with radius w.r.t. .

  2. the global minimum is at least as good as the global minimum of (11), i.e.

    (14)

The omission of bias weights is standard in neural network optimization literature Choromanska et al. [2015], Du et al. [2018], Kawaguchi [2016], Li and Liang [2018]. While this severely limits the functions that can be realized with a given architecture, it is sufficient to augment the problem by one dimension in order to recover the full range of functions that can be learned Allen-Zhu et al. [2018]. Here we augment by two dimensions, so that the third regularization condition can be fulfilled without loosing range. This argument is not limited to the MSE loss function but works for any loss function based on evaluating the realization. Moreover, note that, for simplicity, the regularization assumptions stated above are stricter than necessary and possible relaxations are discussed in Section 3.

2 Obstacles to inverse stability - degeneracies of ReLU parametrizations

In the remainder of this paper we will focus on shallow networks without biases and with output dimension one. We define the set of parametrizations of two-layer networks without biases and with architecture by . The realization map222This is a slight abuse of notation, justified by the the fact that acts the same on with zero biases and weights and . is, for every , given by

(15)

Note that each function represents a so-called ridge function which is zero on the halfspace and linear with gradient on the other halfspace. Thus, we refer to the weight vectors also as the directions of . Moreover, for it holds that and, as long as the domain of interest contains the origin, the Sobolev norm is equivalent to its semi-norm, since

(16)

see also inequalities of Poincaré-Friedrichs type [Evans, 2010, Subsection 5.8.1]. Therefore, in the rest of the paper we will only consider the Sobolev semi-norm333For we abbreviate

(17)

In (17) one can see that in our setting is independent of (as long as is open and contains the origin) and will thus be abbreviated by .

2.1 Failure of inverse stability w.r.t uniform norm

All proofs for this section can be found in Appendix A.3. We start by showing that inverse stability fails w.r.t. the uniform norm. This example is adapted from [Petersen et al., 2018, Theorem 5.2] and represents, to the best of our knowledge, the only degeneracy which has already been observed before.

Example 2.1 (Failure due to exploding gradient).

Let and be given by (see Figure 1)

(18)

Then for every sequence with it holds that

(19)

Figure 1: The figure shows for .

In particular, note that inverse stability fails here even for a non-degenerate parametrization of the zero function, and this counterexample could be lifted to any larger network which has a single weight pair. However, for this type of counterexample the magnitude of the gradient of needs to go to infinity, which is our motivation for looking at inverse stability w.r.t. .

2.2 Failure of inverse stability w.r.t. Sobolev norm

In this section we present four degenerate cases where inverse stability fails w.r.t. . This collection of counterexamples is complete in the sense that we can establish inverse stability under assumptions which are designed to exclude these four pathologies.

Example 2.2 (Failure due to complete unbalancedness).

Let , and be given by (see Figure 3)

(20)

The only way to parametrize is with , and we have

(21)

This is a very simple example of a degenerate parametrization of the zero function, since regardless of choice of . The issue here is that we can have a weight pair, i.e. , where the product is independent of the value of one of the parameters. Note that one gets a slightly more subtle version of this pathology by considering instead (see Example A.2). In this case one could still get an inverse stability estimate for each fixed ; the rate of inverse stability would however get worse with increasing .

Example 2.3 (Failure due to redundant directions).

Let

(22)

and be given by (see Figure 3)

(23)

We have for every and with that

(24)

Figure 2: Shows () and .

Figure 3: Shows and .

This example illustrates that redundant directions prevent inverse stability.
The next example shows that not only redundant weight vectors can cause issues, but also weight vectors of opposite direction, as they would allow for a (balanced) degenerate parametrization of the zero function.

Example 2.4 (Failure due to opposite weight vectors 1).

Let , , be pairwise linearly independent with and . We define

(25)

and note that . Now let with be linearly independent to each , , and let be given by (see Figure 5)

(26)

Then there exists a constant such that for every and every with it holds that

(27)

Thus we will need an assumption which prevents each individual from having weight vectors with opposite directions. This will, however, not be enough as is demonstrated by the next example, which is similar but more subtle.

Example 2.5 (Failure due to opposite weight vectors 2).

We define the weight vectors

(28)

and consider the parametrizations (see Figure 5)

(29)

Then it holds for every and every with that

(30)

Figure 4: Shows and (, , , ).

Figure 5: Shows the weight vectors of (grey) and (black).

Note that and need to have multiple exactly opposite weight vectors which add to something small (compared to the size of the individual vectors), but not zero, since otherwise reparametrization would be possible (see Lemma A.3).

3 Inverse stability for two-layer Neural Networks

We now establish an inverse stability result using assumptions which are designed to exclude the pathologies from the previous section. First we present a rather technical theorem which considers a parametrization in the unrestricted parametrization space and a function in the the corresponding function space . The aim is to use assumptions which are as weak as possible, while allowing us to find a parametrization of , whose distance to can be bounded relative to . We then continue by defining a restricted parameter space , for which we get uniform inverse stability (meaning that we get the same estimate for every ).

Theorem 3.1 (Inverse stability at ).

Let , , , let , , and let .
Assume that the following conditions are satisfied:

  1. It holds for all with that .

  2. It holds for all with that .

  3. There exists a parametrization such that and

    1. it holds for all with that and for all with that ,

    2. it holds for all , that

    where .

Then there exists a parametrization with

(31)

The proof can be found in Appendix A.1. Note that each of the conditions in the theorem above corresponds directly to one of the pathologies in Section 2.2. Somewhat curiously, Condition 1, which deals with unbalancedness, only imposes an restriction on the weight pairs whose product is small compared to the distance of and . As can be glanced from Example 2.2 and seen in the proof of Theorem 3.1, such a balancedness assumption is in fact only needed to deal with degenerate cases, where and have parts with mismatching directions of negligible magnitude. Otherwise a matching reparametrization is always possible.
Condition 2 requires to not have any redundant directions, the necessity of which is demonstrated by Example 2.3. Note that the first two conditions are mild from a theoretical perspective, in the sense that they only restrict the parameter space and not the corresponding space of realizations. Specifically we can define the restricted parameter space

(32)

for which we have

(33)

To see that perfect balancedness (i.e. ) does not restrict the set of realization, observe that the ReLU is positively homogeneous (i.e. for all , ). Further note that for a perfectly balanced , Condition 1 is satisfied with

(34)

in (31). It is also possible to relax the balancedness assumption by only requiring and to be close to , which would still give a similar estimate but with a worse exponent. In particular this suggests that, in practice, it should be reasonable to enforce balancedness by a regularizer on the weights.
To see that for any there is a with , further note that

(35)

if is a positive multiple of (i.e. ). This makes Condition 2 unproblematic from a theoretical perspective. From a practical point of view, enforcing this condition could be achieved by a regularization term using a barrier function. Alternatively on could employ a non-standard approach of combining such redundant neurons by changing one of them according to (35) and either setting the other one to zero or removing it entirely444This could be of interest in the design of dynamic network architectures Liu et al. [2018], Miikkulainen et al. [2019], Zoph et al. [2018] and is also closely related to the co-adaption of neurons, to counteract which, dropout was invented Hinton et al. [2012].. In order to satisfy Conditions 3a and 3b we need to restrict the parameter space in a way which also restricts the corresponding space of realizations. One possibility to do so is the following approach, which also incorporates the previous restrictions as well as the transition to networks without biases.

Definition 3.2 (Restricted parameter space).

Let . We define

(36)

Here we no longer have . Note, however, that for every there exists such that for all it holds (see Lemma A.4) that

(37)

In particular, this means that for any optimization problem over an unrestricted parameter space , there is a corresponding optimization problem over the parameter space whose solution is at least as good (see also Corollary 1.3). Our main result now states that for such a restricted parameter space we have uniform inverse stability (see Appendix A.4 for a proof).

Corollary 3.3 (Inverse stability on ).

Let and . For every and there exists a parametrization with

(38)

4 Outlook

Understanding the pathologies which prevent inverse stability is an important first step towards understanding the connection of the parametrized and the functional optimization problem for training deep neural networks. Although our positive inverse stability result, so far, only covers shallow networks, it is based on conditions which, in principle, are not limited to shallow networks. While technically challenging it should be very much possible to employ the methods used here in order to produce inverse stability results for deep neural networks.
Another interesting direction would be to use this inverse stability result in order to obtain an inverse stability under a bounded angle condition (see Appendix A.5).
Motivated by the necessity of Conditions 1-3, we are highly encouraged to implement corresponding regularizers (penalizing unbalancedness and redundancy in terms of parallel vectors) in state-of-the-art networks and hope to observe positive impacts on the optimization behavior. Furthermore we want to point out, that there are already approaches, called Sobolev Training, reporting better generalization and data-efficiency by employing the Sobolev norm as loss Czarnecki et al. [2017]. In general, our results enable the transfer of novel insights from the study of realization spaces of neural networks to the study of neural network training.

Acknowledgment

The research of JB and DE was supported by the Austrian Science Fund (FWF) under grants I3403-N32 and P 30148. The authors would like to thank Pavol Harár for helpful comments.

References

  • Allen-Zhu et al. [2018] Z. Allen-Zhu, Y. Li, and Z. Song. A Convergence Theory for Deep Learning via Over-Parameterization. arXiv:1811.03962, 2018.
  • Anthony and Bartlett [2009] M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 2009.
  • Arora et al. [2018] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pages 254–263, 2018.
  • Bartlett et al. [2017a] P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv:1706.08498, 2017a.
  • Bartlett et al. [2017b] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. arXiv:1703.02930, 2017b.
  • Berner et al. [2018] J. Berner, P. Grohs, and A. Jentzen. Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. arXiv:1809.03062, 2018.
  • Berner et al. [2019] J. Berner, D. Elbrächter, P. Grohs, and A. Jentzen. Towards a regularity theory for relu networks–chain rule and global error estimates. arXiv:1905.04992, 2019.
  • Bölcskei et al. [2017] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connected deep neural networks. arXiv:1705.01714, 2017.
  • Burger and Neubauer [2001] M. Burger and A. Neubauer. Error Bounds for Approximation with Neural Networks . Journal of Approximation Theory, 112(2):235–250, 2001.
  • Choromanska et al. [2015] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015.
  • Czarnecki et al. [2017] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. In Advances in Neural Information Processing Systems, pages 4278–4287, 2017.
  • Du et al. [2018] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804, 2018.
  • Evans [2010] L. C. Evans. Partial Differential Equations (second edition). Graduate studies in mathematics. American Mathematical Society, 2010.
  • Evans and Gariepy [2015] L. C. Evans and R. F. Gariepy. Measure Theory and Fine Properties of Functions, Revised Edition. Textbooks in Mathematics. CRC Press, 2015.
  • Funahashi [1989] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989.
  • Golowich et al. [2017] N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural networks. arXiv:1712.06541, 2017.
  • Goodfellow et al. [2014] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. arXiv:1412.6544, 2014.
  • Gribonval et al. [2019] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks. arXiv: 1905.01208, 2019.
  • Gühring et al. [2019] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep ReLU neural networks in norms. arXiv:1902.07896, 2019.
  • Hinton et al. [2012] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • Kawaguchi [2016] K. Kawaguchi. Deep learning without poor local minima. In Advances in neural information processing systems, pages 586–594, 2016.
  • Li et al. [2018] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6389–6399, 2018.
  • Li and Liang [2018] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
  • Li and Yuan [2017] Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
  • Liu et al. [2018] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv:1806.09055, 2018.
  • Mei et al. [2018] S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  • Miikkulainen et al. [2019] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat. Chapter 15 - evolving deep neural networks. In R. Kozma, C. Alippi, Y. Choe, and F. C. Morabito, editors, Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293 – 312. Academic Press, 2019.
  • Neyshabur et al. [2017] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
  • Nguyen and Hein [2017] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2603–2612. JMLR.org, 2017.
  • Pennington and Bahri [2017] J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix theory. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2798–2806. JMLR.org, 2017.
  • Perekrestenko et al. [2018] D. Perekrestenko, P. Grohs, D. Elbrächter, and H. Bölcskei. The universal approximation power of finite-width deep relu networks. arXiv:1806.01528, 2018.
  • Petersen and Voigtlaender [2017] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. arXiv:1709.05289, 2017.
  • Petersen et al. [2018] P. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of functions generated by neural networks of fixed size. arXiv:1806.08459, 2018.
  • Safran and Shamir [2016] I. Safran and O. Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pages 774–782, 2016.
  • Shaham et al. [2018] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural networks. Applied and Computational Harmonic Analysis, 44(3):537 – 557, 2018.
  • Shamir and Zhang [2013] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
  • Yarotsky [2017] D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
  • Zoph et al. [2018] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.

Appendix A Appendix - Proofs and Additional Material

a.1 Proof of the Main Theorem

Proof of Theorem 3.1.

Without loss of generality, we can assume that implies for all . We now need to show that there always exists a way to reparametrize such that architecture and realization remain the same and it also fulfills (31). For simplicity of notation we will write throughout the proof. Let resp. be the part that is contributed by the -th neuron, i.e.

(39)
(40)

Further let

(41)

By conditions 2 and 3a we have for all that

(42)

Further note that we can reparametrize such that the same holds there. To this end observe that

(43)

given that is a positive multiple of . Specifically, let be a partition of (i.e. , and if ), such that for all it holds that

(44)

We denote by the smallest element in and make the following replacements, for all , without changing the realization of :

(45)
(46)

Note that we also update the set accordingly. Let now

(47)

By construction, we have for all that

(48)

Next, for , let

(49)

and

(50)

The , , and , , are the different linear regions of and respectively. Note, that they have non-empty interior, since they are non-empty intersections of open sets. Next observe that the derivatives of are (a.e.) given by

(51)

Note that for every , we have

(52)

Next we use that for , we have if , and compare adjacent linear regions of . Let now and consider the following cases:
Case 1: We have for all . This means that the , , and the , , are the same on both sides near the hyperplane , while the value of is on one side and on the other. Specifically, there exist and such that