Towards a regularity theory for ReLU networks – chain rule and global error estimates

# Towards a regularity theory for ReLU networks – chain rule and global error estimates

Julius Berner1, Dennis Elbrächter1, Philipp Grohs3, Arnulf Jentzen4 1Faculty of Mathematics, University of Vienna
Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
3Faculty of Mathematics and Research Platform DataScience@UniVienna, University of Vienna
Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria
4Department of Mathematics, ETH Zürich
Rämistrasse 101, 8092 Zürich, Switzerland
###### Abstract

Although for neural networks with locally Lipschitz continuous activation functions the classical derivative exists almost everywhere, the standard chain rule is in general not applicable. We will consider a way of introducing a derivative for neural networks that admits a chain rule, which is both rigorous and easy to work with. In addition we will present a method of converting approximation results on bounded domains to global (pointwise) estimates. This can be used to extend known neural network approximation theory to include the study of regularity properties. Of particular interest is the application to neural networks with ReLU activation function, where it contributes to the understanding of the success of deep learning methods for high-dimensional partial differential equations.

\pdfstringdefDisableCommands

## I Introduction

It has been observed that deep neural networks exhibit the remarkable capability of overcoming the curse of dimensionality in a number of different scenarios. In particular, for certain types of high-dimensional partial differential equations (PDEs) there are promising empirical observations [1, 2, 3, 4, 5, 6, 7] backed by theoretical results for both the approximation error [8, 9, 10, 11] as well as the generalization error [12]. In this context it becomes relevant to not only show how well a given function of interest can be approximated by neural networks but also to extend the study to the derivative of this function. A number of recent publications [13, 14, 15] have investigated the required size of a network which is sufficient to approximate certain interesting (classes of) functions within a given accuracy. This is achieved, first, by considering the approximation of basic functions by very simple networks and, subsequently, by combining those networks in order to approximate more difficult structures. To extend this approach to include the regularity of the approximation, one requires some kind of chain rule for the composition of neural networks. For neural networks with differentiable activation function the standard chain rule is sufficient. It, however, fails when considering neural networks with an activation function, which is not everywhere differentiable. Although locally Lipschitz continuous functions are w.r.t the Lebesgue measure almost everywhere (a.e.) differentiable, the standard chain rule is not applicable, as, in general, it does not hold even in an ’almost everywhere’ sense. We will introduce derivatives of neural networks in a way that admits a chain rule which is both rigorous as well as easy to work with. Chain rules for functions which are not everywhere differentiable have been considered in a more general setting in e.g. [16, 17]. We employ the specific structure of neural networks to get stronger results using simpler arguments. In particular it allows for a stability result, i.e. Lemma III.3, the application of which will be discussed in Section V. We would also like to mention a very recent work [18] about approximation in Sobolev norms, where they deal with the issue by using a general bound for the Sobolev norm of the composition of functions from the Sobolev space . Note however that this approach leads to a certain factor depending on the dimensions of the domains of the functions, which can be avoided with our method. For ease of exposition, we formulate our results for neural networks with the ReLU activation function. We, however, consider in Section IV how such a chain rule can be obtained for any activation function which is locally Lipschitz continuous (with at most countably many points at which it is not differentiable). In Section V we briefly sketch how the results from Section III can be utilized to get approximation results for certain classes of functions. Subsequently, in Section VI, we present a general method of deriving global error estimates from such approximation results, which are naturally obtained for bounded domains. Ultimately, we discuss how our results can be used to extend known theory, enabling the further study of the approximation of PDE solutions by neural networks.

## Ii Setting

As in [14], we consider a neural network to be a finite sequence of matrix-vector pairs, i.e.

 Φ=((Ak,bk))Lk=1, (1)

where and for some depth and layer dimensions . The realization of the neural network is the function given by

 RΦ=WL∘ReLU∘WL−1∘…∘ReLU∘W1, (2)

where for every and where

 ReLU(x):=(max{0,x1},…,max{0,xN}) (3)

for every . We distinguish between a neural network and its realization, since uniquely induces , while in general there can be multiple non-trivially different neural networks with the same realization. The representation of a neural network as a structured set of weights as in (1) allows the introduction of notions of network sizes. While there are slight differences between various publications, commonly considered quantities are the depth (i.e. number of affine transformations), the connectivity (i.e. number of non-zero entries of the and ), and the weight bound (i.e. maximum of the absolute values of the entries of the and ). In [15] it has been shown that these three quantities determine the length of a bit string which is sufficient to encode the network with a prescribed quantization error. In the following let

 Φ=((Ak,bk))Lk=1,Ψ=((~Ak,~bk))~Lk=1 (4)

be neural networks with matching dimensions in the sense that and . We then define their composition as

 Ψ⊙Φ:=(((Ak,bk))L−1k=1,(~A1AL,~A1bL+~b1),((~Ak,~bk))~Lk=2). (5)

Direct computation shows

 (6)

Note that the realization of a neural network is continuous piecewise linear (CPL) as a composition of CPL functions. Consequently, it is Lipschitz continuous and the realization is almost everywhere differentiable by Rademacher’s theorem. In particular all three functions in (6) are a.e. differentiable. This, however, is not sufficient to get the derivative of from the derivatives of and by use of the classical chain rule. Consider the very simple counterexample of and and formally apply the chain rule, i.e.

 (D(u∘v))(x)=(Du)(v(x))⋅(Dv)(x). (7)

Even though is well-defined for every , the expression is defined for no . In general this problem occurs when the inner function maps a set of positive measure into a set where the derivative of the outer function does not exist. Now in this case, one can directly see that setting to any arbitrary value would cause (7) to provide the correct result since .

## Iii ReLU network derivative

We proceed by defining the derivative of an arbitrary neural network in a way such that it not only coincides a.e. with the derivative of the realization, but also admits a chain rule. To this end let be the function given by

 H(x):=diag(\mathbbm1(0,∞)(x1),…,\mathbbm1(0,∞)(xN)) (8)

for every and let . We then define the neural network derivative of as the function given by

 DΦ:=AL⋅H(RL−1Φ)⋅AL−1⋅…⋅H(R1Φ)⋅A1. (9)

Note that this definition is motivated by formally applying the chain rule with the convention that the derivative of is zero at the origin. Now we need to verify that this is justified.

###### Theorem III.1.

It holds for almost every that

 (DΦ)(x)=(D(RΦ))(x). (10)
###### Proof.

Let be a locally Lipschitz continuous function, define , and

 Li:={x∈Rd:wi(x)=0}={x∈Rd:vi(x)≤0}. (11)

We now use an observation about differentiability on level sets (see e.g. [19, Thm 3.3(i)]), which states that

 (Dwi)(x)=0for almost every x∈Li. (12)

As for every , we get a.e.

 Dwi=\mathbbm1Rd∖Li⋅Dvi=\mathbbm1(0,∞)(vi)⋅Dvi (13)

and consequently

 D(ReLU∘v)=H(v)⋅Dv. (14)

The claim follows by induction over the layers of , using (14) with for the induction step. ∎

Note that even for convex the values of on the nullset do not necessarily lie in the respective subdifferentials of , as can be seen in Figure 1. Although Theorem III.1 holds regardless of which value is chosen for the derivative of at the origin, no choice will guarantee that all values of lie in the respective subdifferentials of . Here we have set the derivative at the origin to zero, following the convention of software implementations for deep learning applications, e.g. TensorFlow and PyTorch. Using (5) and (9) one can verify by direct computation that obeys the chain rule.

###### Corollary III.2.

It holds for every that

 (D(Ψ⊙Φ))(x)=(DΨ)(RΦ(x))⋅(DΦ)(x). (15)

Note that (15) is well-defined as exists everywhere, although it only coincides with almost everywhere. Theorem III.1 however guarantees that we still have a.e.

 D(Ψ⊙Φ)=D(R(Ψ⊙Φ))=D(RΨ∘RΦ). (16)

Next we provide a technical result dealing with the stability of our chain rule, which will prove to be useful in Section V.

###### Lemma III.3.

It holds for almost every that

 limy→RΦ(x)[(DΨ)(y)−(DΨ)(RΦ(x))]⋅(DΦ)(x)=0. (17)
###### Proof.

We first show for every locally Lipschitz continuous function and for almost every that

 limy→RΦ(x)[H(u(y))−H(u(RΦ(x)))]⋅(D(u∘RΦ))(x)=0. (18)

If we have

 limy→RΦ(x)\mathbbm1(0,∞)(ui(y))=\mathbbm1(0,∞)(ui(RΦ(x))) (19)

as is continuous and is continuous on . Furthermore, [19, Thm 3.3(i)] implies that

 (D(ui∘RΦ))(x)=0 (20)

for almost every with . Since a finite union of nullsets is again a nullset, this proves the claim (18). The lemma follows by induction over the layers of and applying (18) with . ∎

## Iv General Activation Functions

As mentioned in the introduction, it is possible to replace the ReLU activation function in (2) by some locally Lipschitz continuous, component-wise applied function with an at most countably large set of points where is not differentiable. Specifically, one can define the neural network derivative (with activation function ) as in (9) with in (8) replaced by

 (¯Dϱ)(xi):={0,xi∈S(Dϱ)(xi),else. (21)

The chain rule can, again, be checked by direct computation and it is straightforward to adapt Theorem III.1 to this more general setting by considering the level sets

 {x∈Rd:wi(x)=s},s∈S. (22)

If additionally is continuous on , the proof of Lemma III.3 translates without any modifications.

## V Utilization in Approximation Theory

These results can now be employed to bound the -norm of , given corresponding estimates for the approximation of and by and , respectively. Here, one has to take some care when bounding the term

 ∥[DΨ∘RΦ−Du∘RΦ]DΦ∥L∞ (23)

by

 ∥DΨ−Du∥L∞∥DΦ∥L∞. (24)

Again it can happen that maps a set of positive measure into a nullset where the estimate for the approximation of by in the essential supremum norm is not valid. However, using the stability result in Lemma III.3 one can for almost every shift to a sufficiently close point where the estimate holds. In [13] Yarotsky explicitly constructs networks whose realization is a linear interpolation111The interpolation points are uniformly distributed over the domain of approximation and their number grows exponentially with the size of the networks. of the squaring function (see Fig. 1 for illustration), which directly gives an estimate on the approximation rate for the derivatives. These simple networks can then be combined to get networks approximating multiplication, polynomials and eventually, by means of e.g. local Taylor approximation, functions whose first (weak) derivatives are bounded. This leads to estimates of the form

 ∥f−RΦε,B∥L∞(IB)≤ε, (25)

with , including estimates for the scaling of the size of the network w.r.t. and . As these constructions are based on composing simpler functions with known estimates one can now employ Theorem III.1 and Corollary III.2 to show that the derivatives of those networks also approximate the derivative of the function, i.e.

 ∥Df−DΦε,B∥L∞(IB)≤cεr. (26)

Such constructive approaches can further be found in [8], in [14] for -cartoon-like functions, in [20] for -holomorphic maps, and in [15] for high-frequent sinusoidal functions.

## Vi Global Error Estimates

The error estimates above are usually only sensible for bounded domains, as the realization of a neural network is always CPL with a finite number of pieces. We briefly discuss a general way of transforming them into global pointwise error estimates, which can be useful in the context of PDEs (see e.g. [9, 10]). In the following assume that we have a function with an at most polynomially growing derivative, i.e.

 ∥(Df)(x)∥2≤c(1+∥x∥κ2). (27)

Denote by a neural network which represents the -dimensional approximate characteristic function of , i.e. and

 RΦcharB(x)=1,x∈IB,RΦcharB(x)=0,x∉IB+1. (28)

See [15, Proof of Thm. VIII.3] for such a construction. Further let be the neural network approximating the multiplication function on with error (see e.g. [20, Prop. 3.1]).
Now we define the global approximation networks as the composition of with the parallelization of and for suitable

 Bε∈O(ε−1)andbε∈O(ε−κ−1). (29)

See Figure 2 for an illustration and e.g. [14, Def. 2.7] for a formal definition of parallelization. Considering the errors on , and leads to global estimates, i.e. for every

 |f(x)−RΦε(x)|≤ε(1+∥x∥κ+22) (30)

and, by use of the chain rule III.2, for almost every

 ∥(Df)(x)−(DΦε)(x)∥2≤Cεr(1+∥x∥κ+22). (31)

Due to the logarithmic size scaling of the multiplication network, the size of can be bounded by the size of plus an additional term in .

## Vii Application to PDEs

Analyzing the regularity properties of neural networks was motivated by the recent successful application of deep learning methods to PDEs [2, 3, 4, 5, 6, 7, 11]. Initiated by empirical experiments [1] it has been proven that neural networks are capable of overcoming the curse of dimensionality for solving so-called Kolmogorov PDEs [12]. More precisely, the solution to the empirical risk minimization problem over a class of neural networks approximates the solution of the PDE up to error with high probability and with size of the networks and number of samples scaling only polynomially in the dimension and . The above requires a suitable learning problem and a sufficiently good approximation of the solution function by neural networks. For Kolmogorov PDEs, this boils down to calculating global Lipschitz coefficients and error estimates for neural networks approximating the initial condition and coefficient functions (see e.g. [9, 10]). Employing estimates of the form (26) one can bound the derivative on , i.e.

 LB:=∥DΦε,B∥L∞(IB)≤∥Df∥L∞(IB)+cεr. (32)

Using mollification and the mean value theorem we can establish local Lipschitz estimates, i.e. for all that

 |RΦε,B(x)−RΦε,B(y)|≤LB∥x−y∥2, (33)

and corresponding linear growth bounds

 |RΦε,B(x)|≤(|RΦε,B(0)|+LB)(1+∥x∥2). (34)

Similarly, one can use (31) to obtain estimates of the form

 |RΦε(x)−RΦε(y)|≤C(1+∥x∥κ+22+∥y∥κ+22)∥x−y∥2 (35)

for all (which are demanded in [10, Theorem 1.1]). Moreover, note that the capability to produce approximation results which include error estimates for the derivative is of significant independent interest. Various numerical methods (for instance Galerkin methods) rely on bounding the error in some Sobolev norm , which requires estimates of the derivative differences. We believe that the possibility to obtain regularity estimates significantly contributes to the mathematical theory of neural networks and allows for further advances in the numerical approximation of high dimensional partial differential equations.

## Viii Relation to backpropagation in training

The approach discussed here could further be applied to the training of neural networks by (stochastic) gradient descent. Note, however, that this is a slightly different setting. From the approximation theory perspective we were interested in the derivative of , while in training one requires the derivative of for some fixed sample . In particular this function is no longer CPL but rather continuous piecewise polynomial. While this would necessitate some technical modifications, we believe that it should be possible to employ the method used here in order to show that the gradient of coincides a.e. with what is computed by backpropagation using the convention of setting the derivative of to zero at the origin (as well as similar conventions for e.g. max-pooling).

## Acknowledgment

The research of JB and DE was supported by the Austrian Science Fund (FWF) under grants I3403-N32 and P 30148.

## References

• [1] C. Beck, S. Becker, P. Grohs, N. Jaafari, and A. Jentzen, “Solving stochastic differential equations and Kolmogorov equations by means of deep learning,” arXiv:1806.00421, 2018.
• [2] W. E, J. Han, and A. Jentzen, “Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations,” Communications in Mathematics and Statistics, vol. 5, no. 4, pp. 349–380, 2017.
• [3] J. Han, A. Jentzen, and W. E, “Solving high-dimensional partial differential equations using deep learning,” arXiv:1707.02568, 2017.
• [4] J. Sirignano and K. Spiliopoulos, “DGM: A deep learning algorithm for solving partial differential equations,” arXiv:1708.07469, 2017.
• [5] M. Fujii, A. Takahashi, and M. Takahashi, “Asymptotic Expansion as Prior Knowledge in Deep Learning Method for high dimensional BSDEs,” arXiv:1710.07030, 2017.
• [6] Y. Khoo, J. Lu, and L. Ying, “Solving parametric PDE problems with artificial neural networks,” arXiv:1707.03351, 2017.
• [7] W. E and B. Yu, “The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems,” arXiv:1710.00211, 2017.
• [8] D. Elbrächter, P. Grohs, A. Jentzen, and C. Schwab, “DNN Expression Rate Analysis of high-dimensional PDEs: Application to Option Pricing,” arXiv:1809.07669, 2018.
• [9] P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger, “A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations,” arXiv:1809.02362, 2018.
• [10] A. Jentzen, D. Salimova, and T. Welti, “A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients,” arxiv:1809.07321, 2018.
• [11] M. Hutzenthaler, A. Jentzen, T. Kruse, and T. A. Nguyen, “A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations,” arXiv:1707.02568, 2019.
• [12] J. Berner, P. Grohs, and A. Jentzen, “Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations,” arXiv:1809.03062, 2018.
• [13] D. Yarotsky, “Optimal approximation of continuous functions by very deep ReLU networks,” arXiv:1802.03620, 2018.
• [14] P. Petersen and F. Voigtlaender, “Optimal approximation of piecewise smooth functions using deep ReLU neural networks,” arXiv:1709.05289, 2017.
• [15] P. Grohs, D. Perekrestenko, D. Elbrächter, and H. Bölcskei, “Deep Neural Network Approximation Theory,” arxiv:1901.02220, 2019.
• [16] F. Murat and C. Trombetti, “A chain rule formula for the composition of a vector-valued function by a piecewise smooth function,” Bollettino dell’Unione Matematica Italiana, vol. 6, no. 3, pp. 581–595, 2003.
• [17] L. Ambrosio and G. Dal Maso, “A general chain rule for distributional derivatives,” Proceedings of the American Mathematical Society, vol. 108, no. 3, pp. 691–702, 1990.
• [18] I. Gühring, G. Kutyniok, and P. Petersen, “Error bounds for approximations with deep ReLU neural networks in norms,” arXiv:1902.07896, 2019.
• [19] L. C. Evans and R. F. Gariepy, Measure Theory and Fine Properties of Functions, Revised Edition, ser. Textbooks in Mathematics.   CRC Press, 2015.
• [20] C. Schwab and J. Zech, “Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ,” Analysis and Applications, Singapore, vol. 17, no. 1, pp. 19–55, 2019.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters