Approximation capabilities of neural networks on unbounded domains

Approximation capabilities of neural networks on unbounded domains

Abstract

We prove that if and if the activation function is a monotone sigmoid, relu, elu, softplus or leaky relu, then the shallow neural network is a universal approximator in . This generalizes classical universal approximation theorems on

We also prove that if and if the activation function is a sigmoid, relu, elu, softplus or leaky relu, then the shallow neural network expresses no non-zero functions in . Consequently a shallow relu network expresses no non-zero functions in . Some authors, on the other hand, have showed that deep relu network is a universal approximator in . Together we obtained a qualitative viewpoint which justifies the benefit of depth in the context of relu networks.

Index terms— Universal approximation theorem, unbounded domain, neural networks, sigmoid, relu, elu, softplus, leaky relu, tail risk, benefit of depth


1 Introduction

The universal approximation theorem in the mathematical theory of artificial neural networks was established by Cybenko [4], with various versions and different proofs contributed by Hornik-Stinchcombe-White [6] and Funahashi [5]. This theory treats the approximation of continuous functions as well as integrable functions.

Classical universal approximation theorems are mainly focused on bounded domains. We list here some articles which have also explored the thoery on certain unbounded domains. Some authors, such as [8] and [2], studied the approximation capabilities of neural networks in the space of contintous functions of vanishing at the infinity. Hornik [7] studied the approximation capabilities of neural networks in , with respect to a finite input space enviroment measure . Regarding approximation on unbounded domains with respect to its Lebesgue measure, Lu et al. [10] showed that deep narrow relu network is a universal approximator in ; Clevert et al. [1] proved that deep relu network with at most hidden layers is a universal approximator in ; Qu and Wang [17] proved that an artificial neural network with a single hidden layer and logistic activation is a universal approximator in .

Qu and Wang [17] pointed out the following connection between the universal approximation theory on and the management of tail risks in option trading: experiments in [17] demonstrated that in the design of a option price learning model, a decision function that fits into the approximation capability of networks in yields faster learning and better generalization performance. This suggests that a further study of approximation capabilities of neural networks on unbounded domains is not only of value in theory but also of practical applications.

The main result of this paper is Theorem 4.7 and Theorem 5.7. We proved that if and if the activiation function is a monotone sigmoid, relu, elu, softplus or leaky relu then the shallow neural network is a universal approximator in . We also prove that if and if the activiation function belongs to a sigmoid, relu, elu, softplus or leaky relu then the shallow neural network expresses no non-zero function in . As a corollary, a relu shallow neural network expresses no non-zero function in . In contrast, [10] and [1] showed that deep shalow neural network (even narrow indeed) is a universal approximator in . Together we obtained a qualitative viewpoint which justifies the benefit of depth for relu neural networks.

The organization of this article is as follows. In section 2 we prove Theorem 2.1 which demonstrates that the universal approximation capacity in is equivalent to the universal approximation capacity in . In section 3 we prove a result which will be used to prove that typical shallow networks are not integrable functions on as well as(as a corollary) on . In section 4 we discuss shallow networks with bounded, eventually monotone activation functions. In section 5 we discuss shallow networks with relu and other popular unbounded activation functions.

2 Approximation capabilities of networks on

Given any measurable function and we define

These are called ridge functions. The space of shallow networks on with activation function is denoted by and consists of

The closure of a subset of a metric space is denoted by . The canonical Lebesgue measure on will be denoted by The indicator function of a set will be denoted by Before proving Theorem 2.1 we recall some facts from harmonic analysis. If then its Fourier transform is a bounded continuous function with the integral form

(1)

Let be the Schwartz space on ([13]) which is a Frechet space. Let be the space of temperate distributions which is the dual of . The Fourier transform of is defined by

(2)

For there is a natural embedding ([13][p.135]). If then its Fourier transform (2) agrees with (1). To make Cybenko’s strategy working in our case we need to show that certian is zero. Because that Fourier transformation is an isomorphism of ([13][Theorem IX.2]), it suffices to show that

TheoremTheorem 2.1 ().

Let , and . If a measurable function satisfies then

Proof.

Write . Suppose and our theorem is not true. By Hahn-Banach Theorem there exists a nonzero and a nonzero real valued such that if then

(3)

and we have .

If satisfies and if , then we show . Indeed by change of variables we have

Moreover by definition one can check . Put together, if satisfies and if then , and then by (3) Next we prove a stronger fact: if satisfies and then

(4)

Because , for any there exists such that From (4) we have and furthermore

The above inequality holds for all and proves (4). For write . We identify as an element of by assigning , and we define elements of by setting . With respect to the topology of we have and therefore

By setting we have and

The Fourier transform of integrable is represented by the integral form

(5)

As is represented by , is also represented by . Hence if then

Let be any compact set in and let . There is a constant such that is bounded from above by . Let . If , and , then and Therefore Similarly if , and then . In particular we have proved that if then

(6)

For any let respectively consist of all point respectively such that there exists satisfying :

Figure 1: The contribution of to is zero, therefore the estimation of involves only . If is sufficiently large, become away from . This fact implies that is stable in the sense of (8).
Figure 2: For the purpose of a rigorous limit argument, instead of a single , our estimation relies on , where is a compact set in . If is sufficiently large is stable in the sense of (9).
Figure 1: The contribution of to is zero, therefore the estimation of involves only . If is sufficiently large, become away from . This fact implies that is stable in the sense of (8).

Because are connected, (6) implies that if and then

(7)

Let and be integers greater than we want to show that

(8)

Let , we have and . It is obviously that and which implies . By (7), we have and therefore . We have proved that The same argument proves also and consequently Similar arguments prove another part of (8). By (7) and by (8), the number

is finite and is independent of the choice of . Moreover (8) leads to

(9)

which implies that moves to infinity as goes to infinity. Write

We want to prove

(10)

By definition the part of (10) is true. Now assume that and . By assumption there exists such that Suppose then the line intersects the boundary of , as

Let be the intersection point. We have . This together with the definition of gives , contradicting to . Therefore is not ture, and instead . Suppose then there exists such that , which in the same way contradicts to . Therefore is not ture, and instead . Putting together we have proved that if satisfies the left hand side of then Finally we proved , which leads to

(11)

For , we define . Because is bounded and with bounded support, . By and (11), for ,

Therefore for any we have

Because , by (4) we have and therefore

Take then by (5) and the above equality,

(12)

Let with where is a compact subset of . By (9), for any there exists such that if then

and for all such and , using (2) and the fact that ,

For all we have

Consequently for all we have

By similar arguments for for all we have

This implies that is supported on the hyperplane . Because , and . As is represented by some function in , and therefore . Which contradicts to the fact that is non-zero. ∎

3 Inexpressivity of sums of ridge functions

It is well known that “The graph of ridge function is a ruled surface, and it is this ruled nature that make ridge functions difficult to use in any standard way in approximation theory”([9]). For instance [9][Proposition 1.1] shows that the space contains no ridge function except 0. In this section we shall prove a stronger

TheoremProposition 3.1 ().

Let and . Suppose real valued functions defined on and satisfy

  1. For all and , there exist real numbers and such that

  2. is uniformly continuous.

  3. for some .

Then we have

Proof.

Throughout the proof, refers to the point in Because

without loss of generality we can assume

(13)

where are different numbers in . The condition 1 in the assumption implies that this condition holds for as well, i.e. for all and , there exist real numbers and such that

We claim that are all linear functions. If this is not true then pick for which is not linear. By nonlinearity, there exists such that

(14)

The function satisfies:

  1. and is uniformly continuous.

  2. There exist functions such that and that for all and there are real numbers satisfying

  3. .

Fact follows from . Fact follows from

where and the following calculation

Fact follows from the following and (14):

Next we prove that for all

(15)

By , the infinity behavior of at is the same as a linear function in . Therefore exists in . If this limit is not zero then there exists and such that if then As is uniformly continuous there exists such that if then