On Approximation Capabilities of ReLU Activation and Softmax Output Layer in Neural Networks

On Approximation Capabilities of ReLU Activation and Softmax Output Layer in Neural Networks

Abstract

In this paper, we have extended the well-established universal approximator theory to neural networks that use the unbounded ReLU activation function and a nonlinear softmax output layer. We have proved that a sufficiently large neural network using the ReLU activation function can approximate any function in up to any arbitrary precision. Moreover, our theoretical results have shown that a large enough neural network using a nonlinear softmax output layer can also approximate any indicator function in , which is equivalent to mutually-exclusive class labels in any realistic multiple-class pattern classification problems. To the best of our knowledge, this work is the first theoretical justification for using the softmax output layers in neural networks for pattern classification.

1 Introduction

In recent years, neural networks have revived in the field of machine learning as a dominant method for supervised learning. In practice, large-scale neural networks are trained to classify their inputs based on a huge amount of training samples consisting of input-output pairs. Many different types of network structures can be used to specify the architecture for a neural network, such as fully-connected neural networks, recurrent neural networks, and convolutional neural networks. Fundamentally speaking, all neural networks just attempt to approximate an unknown target function from its input space to output space. Recent experimental results in many application domains have empirically demonstrated the superb power of various types of neural networks in terms of approximating an unknown function. On the other hand, the universal approximator theory has been well established for neural networks since several decades ago Cybenko (1989); Hornik (1991). However, these theoretical works were proved based on some early neural network configurations that were popular at that time, such as sigmoidal activation functions and fully-connected linear output layers. Since then, many new configurations have been adopted for neural networks. For example, a new type of nonlinear activation function, called rectified linear unit (ReLU) activation, was originally proposed in Jarrett et al. (2009); Nair and Hinton (2010); Glorot et al. (2011), and then very quickly it has been widely adopted in neural networks due to its superior performance in training deep neural networks. When the ReLU activation was initially proposed, it was quite a surprise to many people due to the fact that the ReLU function is actually unbounded. In addition, as we know, it is a common practice to use a nonlinear softmax output layer when neural networks are used for a pattern classification task Bridle (1990b, a). The soft-max output layer significantly changes the range of the outputs of a neural network due to its non-linearity. Obviously, the theoretical results in Cybenko (1989); Hornik (1991) are not directly applicable to the unbounded ReLU activation function and the nonlinear softmax output layers in neural networks.

In this paper, we study the approximation power of neural networks that use the ReLU activation function and a softmax output layer, and have extended the universal approximator theory in Cybenko (1989); Hornik (1991) to address these two cases. Similar to Cybenko (1989); Hornik (1991), our theoretical results are established based on a simple feed-forward neural network using a single fully-connected hidden layer. Of course, our results can be further extended to more complicated structures since many complicated structures can be viewed as special cases of fully-connected layers. The major results from this work include: i) A sufficiently large neural network using the ReLU activation function can approximate any function up to any arbitrary precision; ii) A sufficiently large neural network using the softmax output layer can approximate well any indicator function in , which is equivalent to mutually exclusive class labels in a realistic multiple-class pattern classification problem.

2 Related Works

Under the notion of universal approximators, the approximation capabilities of some types of neural networks have been studied by other authors. For example, K. Hornik in Hornik (1991) has shown the power of fully-connected neural networks using a bounded and non-constant activation layer to approximate any function in , where and stands for the unit hypercube in an -dimensional space. And denotes the space of all functions defined on such that . He has proved that if the activation function is a bounded and non-constant function, a large enough single hidden layer feed-forward neural network can approximate any function in . He has also proved that same network using a continuous, bounded and non-constant activation function can approximate any function in , where denotes the space of all continuous functions defined in the unit hypercube . Moreover, G. Cybenco in Cybenko (1989) has proved that any target functions in can be approximated by a single hidden layer feed-forward neural network using any continuous sigmoidal activation function. He has also demonstrated (in Theorem 4) that same network with bounded measurable sigmoidal activation function can approximate any function in where the sigmoidal activation function is defined as a monotonically-increasing function whose limit is 1 when approaching and is 0 at . In this paper, we will extend this theorem to neural networks using the unbounded ReLU activation function.

There are also other works related to the approximation power of neural networks using the ReLU activation function but we do not use them in our proofs. For example, in Leshno et al. (1993), it has proven that any activation function will ensure a neural network to be a universal approximator if and only if this function is not a polynomial almost everywhere. The results of this paper could be applied to the ReLU activation function but their target function is in space. In Arora et al. (2018), the authors have shown that every function in can be approximated in the norm by a ReLU deep neural network using at most hidden layers. Moreover, for , any such function can be arbitrarily well approximated by a 2-layer DNN, with some tight approximation bounds on the size of neural networks.

On the other hand, we have not found much theoretical work on the approximation capability of a softmax neural network. A softmax output layer is simply taken for granted in pattern classification problems.

3 Notations

We represent the input of a neural network as , where denotes the -dimensional unit hypercube, i.e. . As in Figure 1, the output of a single hidden layer feed-forward neural network using a linear output layer can be represented as a superposition of the activation functions :

where , and denotes the input weights of the -th hidden unit, and stands for the output weight of the -th hidden unit, and is the bias of the -th hidden unit.

Figure 1: An illustration of a single-output feedforward neural network without any softmax layer.

If a neural network is used for a multiple-class pattern classification problem, as shown in Figure 2, we use the notation for to represent -th output of the neural network prior to the softmax layer. Similarly, we have . Also, we define a vector output , where each is one of outputs of network before adding a softmax layer. Furthermore, after adding a softmax layer to the above outputs, the -th output of the network after the softmax layer may be represented as:

Figure 2: An illustration of a multi-output feedforward neural network with a softmax layer.

4 Main Results

If we use the ReLU activation function for all hidden nodes in a neural network in Figure 1. We can prove that it still maintains the universal approximation property as long as the number of hidden nodes is large enough.

Theorem 1 (Universal Approximation with ReLU)

When ReLU is used as the activation function, the output of a single-hidden-layer neural network

(1)

in dense in if is large enough. In other words, given any function and any arbitrarily small positive number , there exists a sum of the above form , for which:

(2)

Proof:

First of all, we construct a special activation function use the ReLU function as:

Obviously, is a piece-wise linear function given as follows:

(3)

The above function is also plotted in Figure 3.

Figure 3: A special activation function as constructed in eq.(3).

Since is a bounded measurable sigmoidal function, based on Theorem 4 in Cybenko (1989), the sum

is dense in . In other words, given any function and , there exists a sum as above, for which:

Substituting into the above equation, we have:

We can further re-arrange it into:

(4)

where:

Given any function , we have found a sum

as shown in eq.(4) that satisfies for any small , therefore the proof is completed.

In the following, let us extend Theorem 1 to a ReLU neural network with multiple outputs without using any softmax output layer.

Corollary 1

Given any vector-valued function , where each component function , there exists a single-hidden-layer neural network yielding a vector output , each of which is a sum of the form:

(5)

to satisfy

for any arbitrarily small positive number , which is equivalent to

Proof:

We use the mathematical induction on to prove this theorem.

For , the problem reduces to Theorem 1. Next, we assume that the theorem is correct for any , we will prove that it also holds for . Based on our assumption, given and , there exist some sums, i.e , as eq.(5) for which:

Now we need to approximate . Based on Theorem 1, there exists a sum of the form in eq.(1) for which it satisfies for any given and :

Now we can construct a new neural network with outputs , and all weights are defined as follows:

Now we can show:

Therefore, the proof is completed.

In the following, we will investigate how the softmax output layer may affect the approximation capabilities of neural networks.

Lemma 1

Given any vector-valued function where each component function , there exists a vector-valued function , each is a sum of eq.(5), so that their softmax outputs satisfy:

(6)

for any small positive value , where

Proof:

First, the softmax function is a continuous function everywhere in its domain.1 Therefore, given any , there exists a such that if then .

Second, according to Corollary 1, for any and , there exists a function such that:

Putting these two results together completes the proof.

Definition 1 (Indicator Function)

We define a vector-valued function as an indicator function if it simultaneously satisfies the following two conditions:

  1. and :

In other words, for any input , an indicator function will yield one but only one ’1’ in a component and ’0’ for all remaining components. Obviously, an indicator can represent mutually-exclusive class labels in a multiple-class pattern classification problem. On the other hand, the class labels from any such pattern classification problem may be represented by an indicator function.

Theorem 2 (Approximation Capability of Softmax)

Given any indicator function , assume each component function (), there exist some sums as eq.(5), i.e. , such that their corresponding softmax outputs satisfy:

(7)

for any small , where .

Proof:

First, we define a new vector-valued function , where each component function is defined as functions for all . Note that due to the assumption . According to the triangular inequality, for any , we have:

Second, based on Lemma 1, for our any given , there exists a function , each of which is in the form of eq.(5), to ensure .

Next, we just need to show that for any indicator function , where each , we have holds for any . In order to prove this, we first have:

Based on the properties of the indicator function , we further derive:

where the last step at above follows the inequality for .

Finally, for any , we have

Therefore, the proof is completed.

Obviously, Theorem 2 indicates that a sufficiently-large single-hidden-layer feed-forward neural network with a softmax layer can approximate well any indicator function if the target function belongs to .

5 Conclusions

In this work, we have studied the approximation capabilities of the popular ReLU activation function and softmax output layers in neural networks for pattern classification. We have first shown in Theorem 1 that a large enough neural network using the ReLU activation function is a universal approximator in . Furthermore, we have extended the result to multi-output neural networks and have proved that it can approximate any target function in . Next, we have proved in Theorem 2 that a sufficiently large neural network can approximate well any indicator target function in , which is equivalent to mutually-exclusive class labels in any realistic multiple-class pattern classification tasks. At last, we also want to note that the result in Theorem 2 is also applicable to many other activation functions in Cybenko (1989); Hornik (1991) since we do not use the property of the ReLU activation function in the proof.

Footnotes

  1. According to continuity of exp function and positivity of the denominator of , the proof of this claim is straightforward and is omitted.

References

  1. Understanding deep neural networks with rectified linear units. In Proceedings of International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  2. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, F. F. Soulié and J. Hérault (Eds.), Berlin, Heidelberg, pp. 227–236. External Links: ISBN 978-3-642-76153-9 Cited by: §1.
  3. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Advances in Neural Information Processing Systems (NIPS), Vol. 2, pp. 211–217. Cited by: §1.
  4. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS) 2 (4), pp. 303–314. External Links: Document, ISSN 0932-4194, Link Cited by: §1, §1, §2, §4, §5.
  5. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), G. Gordon, D. Dunson and M. Dudik (Eds.), Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 315–323. External Links: Link Cited by: §1.
  6. Approximation capabilities of multilayer feedforward networks. Neural Networks 4 (2), pp. 251–257. External Links: Document, Link Cited by: §1, §1, §2, §5.
  7. What is the best multi-stage architecture for object recognition. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2146–2153. Cited by: §1.
  8. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.. Neural Networks 6 (6), pp. 861–867. Cited by: §2.
  9. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407760
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description