Approximation Rates for Neural Networks with General Activation Functions

Approximation Rates for Neural Networks with General Activation Functions

Abstract

We prove some new results concerning the approximation rate of neural networks with general activation functions. Our first result concerns the rate of approximation of a two layer neural network with a polynomially-decaying non-sigmoidal activation function. We extend the dimension independent approximation rates previously obtained to this new class of activation functions. Our second result gives a weaker, but still dimension independent, approximation rate for a larger class of activation functions, removing the polynomial decay assumption. This result applies to any bounded, integrable activation function. Finally, we show that a stratified sampling approach can be used to improve the approximation rate for polynomially decaying activation functions under mild additional assumptions.

1 Introduction

Deep neural networks have recently revolutionized a variety of areas of machine learning, including computer vision and speech recognition [10]. A deep neural network with layers is a statistical model which takes the following form

(1)

where is an affine linear function, is a fixed activation function which is applied pointwise and are the parameters of the model.

The approximation properties of neural networks have recieved a lot of attention, with many positive results. For example, in [11, 3] it is shown that neural networks can appproximate any function on a compact set as long as the activation function is not a polynomial, i.e. that the set

(2)

is dense in for any compact . An earlier result of this form can be found in [6], and [5] shows that derivatives can be approximated arbitrarily accurately as well. An elementary and constructive proof for functions can be found in [1].

In addition, quantitative estimates on the order of approximation are obtained for sigmoidal activation functions in [2] and for periodic activation functions in [17] and [15]. Results for general activation functions can be found in [7]. A remarkable feature of these results in that the approximation rate is , where is the number of hidden neurons, which shows that neural networks can overcome the curse of dimensionality. Results concerning the approximation properties of generalized translation networks (a generalization of two-layer neural networks) for smooth and analytic functions are obtained in [16]. Approximation estimates for multilayer convolutional neural networks are considered in [25] and multilayer networks with rectified linear activation functions in [24]. A comparison of the effect of depth vs width on the expressive power of neural networks is presented in [12].

An optimal approximation rate in terms of highly smooth Sobolev norms is given in [19]. This work differs from previous work and the current work in that it considers approximation of highly smooth functions, for which proof techniques based on the Hilbert space structure of Sobolev spaces can be used. In contrast, the line of reasoning initially persued in [2] and continued in this work makes significantly weaker assumptions on the function to be approximated.

A review of a variety of known results, especially for networks with one hidden layer, can be found in [20]. More recently, these results have been improved by a factor of in [9] using the idea of stratified sampling, based in part on the techniques in [14].

Our work, like much of the previous work, focuses on the case of two-layer neural networks. A two layer neural network can be written in the following particularly simple way

(3)

where are parameters and is the number of hidden neurons in the model.

In this work, we study the how the approximation properties of two-layer neural networks depends on the number of hidden neurons. In particular, we consider the class of functions where the number of hidden neurons is bounded,

(4)

and prove the Theorem 2 concerning the order of approximation as for activation functions with polynomial decay and Theorem 3 which applies to neural networks with periodic activation functions. Our results make the assumption that the function to be approximated, , has bounded Barron norm

(5)

and we consider the problem of approximating on a bounded domain . This is a significantly weaker assumption than the strong smoothness assumption made in [19, 8]. Similar results appear in [2, 7], but we have improved their bound by a logarithmic factor for exponentially decaying activation functions and generalized these result to polynomially decaying activation functions. We also leverage this result to obtain a somewhat worse, though still dimension independent, approximation rate without the polynomial decay condition. This result ultimately applies to every activation function of bounded variation. Finally, we extend the stratified sampling argument in [9] to polynomially decaying activation functions in Theorem 5 and to periodic activation functions in Theorem 6. This gives an improvement on the asymptotic rate of convergence under mild additional assumptions.

The paper is organized as follows. In the next section, we discuss some basic results concerning the Fourier transform. We use these results to provide a simplified proof using Fourier analysis of the density result in [11] under the mild additional assumption of polynomial growth on . Then, in the third section, we study the order of approximation and prove Theorems 2 and 3, extending the result in [2, 7] to polynomially decaying activation functions, respectively periodic activation functions, and removing a logarithmic factor in the rate of approximation. In the fourth section, we provide a new argument using an approximate integral representation to obtain dimension independent results without the polynomial decay condition in Theorem 4. In the fifth section, we use a stratified sampling argument to prove Theorems 5 and 6, which improve upon the convergence rates in Theorems 2 and 3 under mild additional assumptions. This generalizes the results in [9] to more general activation functions. Finally, we give concluding remarks and further research directions in the conclusion.

2 Preliminaries

Our arguments will make use of the theory of tempered distributions (see [23, 21] for an introduction) and we begin by collecting some results of independent interest, which will also be important later. We begin by noting that an activation function which satisfies a polynomial growth condition for some constants and is a tempered distribution. As a result, we make this assumption on our activation functions in the following theorems. We briefly note that this condition is sufficient, but not necessary (for instance an integrable function need not satisfy a pointwise polynomial growth bound) for to be represent a tempered distribution.

We begin by studying the convolution of with a Gaussian mollifier. Let be a Gaussian mollifier

(6)

Set . Then consider

(7)

for a given activation function .

It is clear that . Moreover, by considering the Fourier transform (as a tempered distribution) we see that

(8)

We begin by stating a lemma which characterizes the set of polynomials in terms of their Fourier transform.

Lemma 1.

Given a tempered distribution , the following statements are equivalent:

  1. is a polynomial

  2. given by (7) is a polynomial for any .

  3. .

Proof.

We begin by proving that (3) and (1) are equivalent. This follows from a characterization of distributions supported at a single point (see [23], section 6.3). In particular, a distribution supported at must be a finite linear combination of Dirac masses and their derivatives. In particular, if is supported at , then

(9)

Taking the inverse Fourier transform and noting that the inverse Fourier transform of is , we see that is a polynomial. This shows that (3) implies (1), for the converse we simply take the Fourier transform of a polynomial and note that it is a finite linear combination of Dirac masses and their derivatives.

Finally, we prove the equivalence of (2) and (3). For this it suffices to show that is supported at iff is supported at . This follows from equation 8 and the fact that is nowhere vanishing.

As an application of Lemma 1, let us give a simple proof of the following result. The first proof of this result can be found in [11] and is summarized in [20]. Extending this result to the case of non-smooth activation is first done in several steps in [11]. Our contribution is to provide a much simpler argument based on Fourier analysis.

Theorem 1.

Assume that is a Riemann integrable function which satisfies a polynomial growth condition, i.e.

(10)

holds for some constants and . Then if is not a polynomial, in dense in for any compact .

Proof.

Let us first prove the theorem in a special case that . Since , it follows that for every

(11)

for all .

By the same argument, for

for all , , and .

Now

where . Since is not a polynomial there exists a such that . Taking and , we thus see that . Thus, all polynomials of the form are in .

This implies that contains all polynomials. By Weierstrass’s Theorem [22] it follows that contains for each compact . That is is dense in .

By the preceding lemma, it follows that is dense and so it suffices to show that . This follows by using the Riemann integrability of and approximating the integral

(12)

by a sequence of Riemann sums, each of which is clearly in . ∎

3 Convergence Rates in Sobolev Norms

In this section, we study the order of approximation for two-layer neural networks as the number of neurons increases. In particular, we consider the space of functions represented by a two-layer neural network with neurons and activation function given in (4), and ask the following question: Given a function on a bounded domain, how many neurons do we need to approximate with a given accuracy?

Specifically, we will consider the problem of approximating a function with bounded Barron norm (5) in the Sobolev space . Our first step will be to prove a lemma showing that the Sobolev norm is bounded by the Barron norm.

Lemma 2.

Let be an integer and a bounded domain. Then for any Schwartz function , we have

(13)
Proof.

Let be a Schwartz function satisfying for . Such a function exists because is bounded. Let be any multi-index with . Then we have

(14)

Now we use Young’s inequality to obtain

(15)

Combining this over all multi-indices , we get

(16)

as desired.

If we let denote the Banach space of functions with bounded Barron norm, then this lemma implies that for any bounded set , since the Schwartz functions are clearly dense in (as the Barron norm is a polynomially weighted norm in Fourier space).

We now prove the following result, which shows that functions can be efficiently approximated by neural networks in the norm with polynomially decaying activation functions. A similar result can be found in [2], but our result applies to non-sigmoidal activation functions . The class of functions we consider neither contains nor is contained by the class of sigmoidal functions. Compared with the result on exponentially decaying activation functions in [7], we extend the results to polynomially decaying activation functions and improve the rate of approximation by a logarithmic factor.

Theorem 2.

Let be a bounded domain. If the activation function is non-zero and satisfies the polynomial decay condition

(17)

for and some , we have

(18)

for any .

Before we proceed to the proof, we discuss how this bound depends on the dimension . We first note that may in a sense depend on the dimension, as the measure may be exponentially large in high dimensions. However, bounding the error over a larger set is also proportionally stronger. This can be seen by noting that dividing by the factor transforms the left hand side from the total squared error to the average squared error.

The dimension dependence of this result is a consequence of how the Barron norm behaves in high dimensions. This issue is discussed in [2], where the norm is analyzed for a number of different function classes. A particularly representative result found there is that . This shows that sufficiently smooth functions have bounded Barron norm, where the required number of derivatives depends upon the dimension. It is known that approximating functions with such a dimension dependent level of smoothness can be done efficiently [19, 8]. However, the Barron space is significantly larger that , in fact we only have by lemma 2. The precise properties of the Barron norm in high dimensions are an interesting research direction which would help explain exactly how shallow neural networks help alleviate the curse of dimensionality.

Proof.

Note first that Lemma 2 implies that . We will later use that is in the Hilbert space .

Note that the growth condition on implies that and thus the Fourier transform of is well-defined and continuous. Since is non-zero, this implies that for some . Via a change of variables, this means that for all and , we have

(19)

and so

(20)

Likewise, since the growth condition also implies that , we can differentiate the above expression under the integral with respect to .

This allows us to write the Fourier mode as an integral of neuron output functions. We substitute this into the Fourier representation of (note that the assumption we make implies that so this is rigorously justified for a.e. ) to get

(21)

Given this integral representation, we follow a similar line of reasoning as in [2]. Our argument differs from previous arguments in how we write as convex combination of shifts and dilations of . This is what allows us to relax our assumptions on . In order to do this, we must first find a way to normalize the above integral.

The above integral is on an unbounded domain, but the decay assumption on the Fourier transform of allows us to normalize the integral in the direction. To normalize the integral in the direction, we must use the assumption that is bounded and that decays polynomially. Consider the part of the above integral representation,

(22)

Note that by the triangle inequality and the boundedness of , we can obtain a lower bound on the above argument uniformly in . Specifically, we have

(23)

where is the maximum norm of an element of . Note that without loss of generality, we can translate so that it contains the origin and so .

Combining this with the polynomial decay of (17) implies that

(24)

Thus the function defined by

(25)

provides (up to a constant) an upper bound on uniformly in . The decay rate of is fast enough to make it integrable in . Moreover, its integral in grows at most linearly with . Namely, we calculate

(26)

Combining (26) with our assumption on the Fourier transform, we get

(27)

We use this to introduce a probability measure on given by

(28)

which allows us to write (using the integral representation (21))

(29)

where

and

(30)

Since is real, we can replace the complex number by its real part, which we call , to get

(31)

This writes as an infinite convex combination of the functions , where

(32)

We now use Lemma 1 from [2] to conclude that for each there exists an which is a convex combination of at most distinct , and thus , such that

(33)

where .

We proceed to estimate . Since is bounded, it has finite measure, and Cauchy-Schwartz implies that

(34)

and so it suffices to bound for each .

Expanding the definition (32) of , recalling that , and recalling the definition (30) of , we get

(35)

Since and , we obtain

and so, plugging this into (35), we get

(36)

Utilizing the fact (24) that upper bounds uniformly in , we obtain

(37)

Plugging this into (36), summing over , and using (34), we obtain

(38)

Finally, utilizing the bound (27) and equation (33), we obtain

(39)

as desired. ∎

Finally, we note that the approximation rate in this theorem holds as long as the growth condition (17) hold for some , i.e. the condition (17) need not hold for itself. We state this as a corollary below.

Corollary 1.

Let be an activation function and suppose that there exists a which satisfies the polynomial decay condition (17) in Theorem 2. Then for any satisfying the assumptions of Theorem 2, we have

(40)
Proof.

The result follows immediately from Theorem 2 and the observation that implies that

(41)

This includes many popular activation functions, such as the rectified linear units [18] and logistic sigmoid activation functions. Below we provide a table listing some well-known activation functions to which this theorem applies.

Activation Function Maximal
Sigmoidal (Logistic)

Arctan
Hyperbolic Tangent
SoftPlus [4]
ReLU[18]
Leaky ReLU[13]
-th power of ReLU

Next, we consider the case of periodic activation functions. We show that neural networks with periodic activation functions achieve the same rate of approximation in Theorem 2. The argument makes use of a modified integral representation and allows us to relax the smoothness condition on , which now only has to be in .

Theorem 3.

Let be a bounded domain. If the activation function is a non-constant periodic function, we have

(42)

for any .

Proof.

By dilating if necessary, we may assume without loss of generality that is periodic on . Consider the Fourier series of

(43)

with coefficients

(44)

The assumption that is non-constant means that there exists some such that . Note that we do not need the Fourier series to converge pointwise to , all we need is for some to be non-zero and the integrals in (44) to converge (which is does since ). Notice that shifting by , i.e. replacing by , scales the coefficient by . Setting , we get

(45)

Plugging this into the Fourier representation of , we see that

(46)

Since is real, we can add this to its conjugate to obtain the representation

(47)

where .

Using the integrability condition on , we define the probability distribution on as

(48)

Then equation (47) becomes

(49)

We have now written as a convex combination of functions . As in the proof of 2, we now utilize Lemma 1 in [2] and proceed to bound using much the same argument.

4 Activation Functions without Decay

In this section, we show that one can remove the decay condition on , but this results in a slightly worse (though still dimension independent) approximation rate. The main new tool here is the use of an approximate integral representation, followed by an optimization over the accuracy of the representation. Finally, as a corollary we are able to obtain a dimension independent approximation rate for all bounded, integrable activation functions and all activation functions of bounded variation.

Theorem 4.

Let be a bounded domain. Suppose that and that there exists an open interval such that (as a tempered distribution) is a non-zero bounded function on . Then we have

(50)

for any .

Proof.

Let be a Lesbesgue point of for which . Note that is bounded and thus locally integrable. Such a point must exist since the set of Lesbesgue points has full measure and is non-zero on . Let be a Schwartz function whose Fourier transform is supported in , , and on . For a fixed such that , we note the identity

(51)

Here the equality comes from the assumption that is a bounded function on (i.e. that Plancherel’s theorem holds using bona-fide integrals for Schwartz functions whose Fourier transforms are supported in ) and the observation (or calculation) that

and

Define

(52)

Since is a Lesbesgue point, we have

(53)

Thus, for sufficiently small , is bounded away from . In addition, because , we see that, since ,

(54)

Dividing by (which is bounded away from for small enough ) and multiplying by (which is bounded), we get (using the identity (51))

(55)

where the implied constant depends upon , but not upon as long as is sufficiently small. This gives us the approximate integral representation

(56)

Given that and is bounded, we obtain, by allowing implied constant in the notation to depend on (specifically, since we can translate without loss of generality so it contains the origin, ),

(57)

We now use this to construct the following approximate integral representation of .

(58)

Introducing the probability distribution

(59)

we can rewrite this as

(60)

The error term in this representation is bounded by . The first term is a convex combination of bounded functions and can be analyzed in the same way as in the proof of Theorem 2. In particular, we use Lemma 1 in [2] combined with the assumption that to see that

(61)

We now use the fact that , that , and that is bounded away from , to obtain

(62)

Optimizing over , i.e. setting , we obtain

(63)

as desired. Specifically, the notation used throughout the proof hides a constant which depends on , and the fixed function . ∎

The assumption made on the activation function in Theorem 4 is quite technical. However, we note that it also appears to be extremely weak. In particular, the only bounded functions we have been able to find which don’t satisfy the assumption are quasi-periodic functions of the form

(64)

for a sequence of frequencies