On Approximation Capabilities of ReLU Activation and Softmax Output Layer in Neural Networks
Abstract
In this paper, we have extended the wellestablished universal approximator theory to neural networks that use the unbounded ReLU activation function and a nonlinear softmax output layer. We have proved that a sufficiently large neural network using the ReLU activation function can approximate any function in up to any arbitrary precision. Moreover, our theoretical results have shown that a large enough neural network using a nonlinear softmax output layer can also approximate any indicator function in , which is equivalent to mutuallyexclusive class labels in any realistic multipleclass pattern classification problems. To the best of our knowledge, this work is the first theoretical justification for using the softmax output layers in neural networks for pattern classification.
1 Introduction
In recent years, neural networks have revived in the field of machine learning as a dominant method for supervised learning. In practice, largescale neural networks are trained to classify their inputs based on a huge amount of training samples consisting of inputoutput pairs. Many different types of network structures can be used to specify the architecture for a neural network, such as fullyconnected neural networks, recurrent neural networks, and convolutional neural networks. Fundamentally speaking, all neural networks just attempt to approximate an unknown target function from its input space to output space. Recent experimental results in many application domains have empirically demonstrated the superb power of various types of neural networks in terms of approximating an unknown function. On the other hand, the universal approximator theory has been well established for neural networks since several decades ago Cybenko (1989); Hornik (1991). However, these theoretical works were proved based on some early neural network configurations that were popular at that time, such as sigmoidal activation functions and fullyconnected linear output layers. Since then, many new configurations have been adopted for neural networks. For example, a new type of nonlinear activation function, called rectified linear unit (ReLU) activation, was originally proposed in Jarrett et al. (2009); Nair and Hinton (2010); Glorot et al. (2011), and then very quickly it has been widely adopted in neural networks due to its superior performance in training deep neural networks. When the ReLU activation was initially proposed, it was quite a surprise to many people due to the fact that the ReLU function is actually unbounded. In addition, as we know, it is a common practice to use a nonlinear softmax output layer when neural networks are used for a pattern classification task Bridle (1990b, a). The softmax output layer significantly changes the range of the outputs of a neural network due to its nonlinearity. Obviously, the theoretical results in Cybenko (1989); Hornik (1991) are not directly applicable to the unbounded ReLU activation function and the nonlinear softmax output layers in neural networks.
In this paper, we study the approximation power of neural networks that use the ReLU activation function and a softmax output layer, and have extended the universal approximator theory in Cybenko (1989); Hornik (1991) to address these two cases. Similar to Cybenko (1989); Hornik (1991), our theoretical results are established based on a simple feedforward neural network using a single fullyconnected hidden layer. Of course, our results can be further extended to more complicated structures since many complicated structures can be viewed as special cases of fullyconnected layers. The major results from this work include: i) A sufficiently large neural network using the ReLU activation function can approximate any function up to any arbitrary precision; ii) A sufficiently large neural network using the softmax output layer can approximate well any indicator function in , which is equivalent to mutually exclusive class labels in a realistic multipleclass pattern classification problem.
2 Related Works
Under the notion of universal approximators, the approximation capabilities of some types of neural networks have been studied by other authors. For example, K. Hornik in Hornik (1991) has shown the power of fullyconnected neural networks using a bounded and nonconstant activation layer to approximate any function in , where and stands for the unit hypercube in an dimensional space. And denotes the space of all functions defined on such that . He has proved that if the activation function is a bounded and nonconstant function, a large enough single hidden layer feedforward neural network can approximate any function in . He has also proved that same network using a continuous, bounded and nonconstant activation function can approximate any function in , where denotes the space of all continuous functions defined in the unit hypercube . Moreover, G. Cybenco in Cybenko (1989) has proved that any target functions in can be approximated by a single hidden layer feedforward neural network using any continuous sigmoidal activation function. He has also demonstrated (in Theorem 4) that same network with bounded measurable sigmoidal activation function can approximate any function in where the sigmoidal activation function is defined as a monotonicallyincreasing function whose limit is 1 when approaching and is 0 at . In this paper, we will extend this theorem to neural networks using the unbounded ReLU activation function.
There are also other works related to the approximation power of neural networks using the ReLU activation function but we do not use them in our proofs. For example, in Leshno et al. (1993), it has proven that any activation function will ensure a neural network to be a universal approximator if and only if this function is not a polynomial almost everywhere. The results of this paper could be applied to the ReLU activation function but their target function is in space. In Arora et al. (2018), the authors have shown that every function in can be approximated in the norm by a ReLU deep neural network using at most hidden layers. Moreover, for , any such function can be arbitrarily well approximated by a 2layer DNN, with some tight approximation bounds on the size of neural networks.
On the other hand, we have not found much theoretical work on the approximation capability of a softmax neural network. A softmax output layer is simply taken for granted in pattern classification problems.
3 Notations
We represent the input of a neural network as , where denotes the dimensional unit hypercube, i.e. . As in Figure 1, the output of a single hidden layer feedforward neural network using a linear output layer can be represented as a superposition of the activation functions :
where , and denotes the input weights of the th hidden unit, and stands for the output weight of the th hidden unit, and is the bias of the th hidden unit.
If a neural network is used for a multipleclass pattern classification problem, as shown in Figure 2, we use the notation for to represent th output of the neural network prior to the softmax layer. Similarly, we have . Also, we define a vector output , where each is one of outputs of network before adding a softmax layer. Furthermore, after adding a softmax layer to the above outputs, the th output of the network after the softmax layer may be represented as:
4 Main Results
If we use the ReLU activation function for all hidden nodes in a neural network in Figure 1. We can prove that it still maintains the universal approximation property as long as the number of hidden nodes is large enough.
Theorem 1 (Universal Approximation with ReLU)
When ReLU is used as the activation function, the output of a singlehiddenlayer neural network
(1) 
in dense in if is large enough. In other words, given any function and any arbitrarily small positive number , there exists a sum of the above form , for which:
(2) 
Proof:
First of all, we construct a special activation function use the ReLU function as:
Obviously, is a piecewise linear function given as follows:
(3) 
The above function is also plotted in Figure 3.
Since is a bounded measurable sigmoidal function, based on Theorem 4 in Cybenko (1989), the sum
is dense in . In other words, given any function and , there exists a sum as above, for which:
Substituting into the above equation, we have:
We can further rearrange it into:
(4) 
where:
Given any function , we have found a sum
as shown in eq.(4) that satisfies for any small , therefore the proof is completed.
In the following, let us extend Theorem 1 to a ReLU neural network with multiple outputs without using any softmax output layer.
Corollary 1
Given any vectorvalued function , where each component function , there exists a singlehiddenlayer neural network yielding a vector output , each of which is a sum of the form:
(5) 
to satisfy
for any arbitrarily small positive number , which is equivalent to
Proof:
We use the mathematical induction on to prove this theorem.
For , the problem reduces to Theorem 1. Next, we assume that the theorem is correct for any , we will prove that it also holds for . Based on our assumption, given and , there exist some sums, i.e , as eq.(5) for which:
Now we need to approximate . Based on Theorem 1, there exists a sum of the form in eq.(1) for which it satisfies for any given and :
Now we can construct a new neural network with outputs , and all weights are defined as follows:
Now we can show:
Therefore, the proof is completed.
In the following, we will investigate how the softmax output layer may affect the approximation capabilities of neural networks.
Lemma 1
Given any vectorvalued function where each component function , there exists a vectorvalued function , each is a sum of eq.(5), so that their softmax outputs satisfy:
(6) 
for any small positive value , where
Proof:
First, the softmax function is a continuous function everywhere in its domain.
Second, according to Corollary 1, for any and , there exists a function such that:
Putting these two results together completes the proof.
Definition 1 (Indicator Function)
We define a vectorvalued function as an indicator function if it simultaneously satisfies the following two conditions:

and :

In other words, for any input , an indicator function will yield one but only one ’1’ in a component and ’0’ for all remaining components. Obviously, an indicator can represent mutuallyexclusive class labels in a multipleclass pattern classification problem. On the other hand, the class labels from any such pattern classification problem may be represented by an indicator function.
Theorem 2 (Approximation Capability of Softmax)
Given any indicator function , assume each component function (), there exist some sums as eq.(5), i.e. , such that their corresponding softmax outputs satisfy:
(7) 
for any small , where .
Proof:
First, we define a new vectorvalued function , where each component function is defined as functions for all . Note that due to the assumption . According to the triangular inequality, for any , we have:
Second, based on Lemma 1, for our any given , there exists a function , each of which is in the form of eq.(5), to ensure .
Next, we just need to show that for any indicator function , where each , we have holds for any . In order to prove this, we first have:
Based on the properties of the indicator function , we further derive:
where the last step at above follows the inequality for .
Finally, for any , we have
Therefore, the proof is completed.
Obviously, Theorem 2 indicates that a sufficientlylarge singlehiddenlayer feedforward neural network with a softmax layer can approximate well any indicator function if the target function belongs to .
5 Conclusions
In this work, we have studied the approximation capabilities of the popular ReLU activation function and softmax output layers in neural networks for pattern classification. We have first shown in Theorem 1 that a large enough neural network using the ReLU activation function is a universal approximator in . Furthermore, we have extended the result to multioutput neural networks and have proved that it can approximate any target function in . Next, we have proved in Theorem 2 that a sufficiently large neural network can approximate well any indicator target function in , which is equivalent to mutuallyexclusive class labels in any realistic multipleclass pattern classification tasks. At last, we also want to note that the result in Theorem 2 is also applicable to many other activation functions in Cybenko (1989); Hornik (1991) since we do not use the property of the ReLU activation function in the proof.
Footnotes
 According to continuity of exp function and positivity of the denominator of , the proof of this claim is straightforward and is omitted.
References
 Understanding deep neural networks with rectified linear units. In Proceedings of International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
 Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, F. F. Soulié and J. Hérault (Eds.), Berlin, Heidelberg, pp. 227–236. External Links: ISBN 9783642761539 Cited by: §1.
 Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Advances in Neural Information Processing Systems (NIPS), Vol. 2, pp. 211–217. Cited by: §1.
 Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS) 2 (4), pp. 303–314. External Links: Document, ISSN 09324194, Link Cited by: §1, §1, §2, §4, §5.
 Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), G. Gordon, D. Dunson and M. Dudik (Eds.), Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 315–323. External Links: Link Cited by: §1.
 Approximation capabilities of multilayer feedforward networks. Neural Networks 4 (2), pp. 251–257. External Links: Document, Link Cited by: §1, §1, §2, §5.
 What is the best multistage architecture for object recognition. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2146–2153. Cited by: §1.
 Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.. Neural Networks 6 (6), pp. 861–867. Cited by: §2.
 Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814. Cited by: §1.