EIS  a family of activation functions combining Exponential, ISRU, and Softplus
Abstract
Activation functions play a pivotal role in the function learning using neural networks. The nonlinearity in the learned function is achieved by repeated use of the activation function. Over the years, numerous activation functions have been proposed to improve accuracy in several tasks. Basic functions like ReLU, Exponential, Tanh, or Softplus have been favorite among the deep learning community because of their simplicity. In recent years, several novel activation functions arising from these basic functions have been proposed, which have improved accuracy in some challenging datasets with complicated models. We propose a five hyperparameters family of activation functions, namely EIS, defined as,
We show examples of activation functions from the EIS family which outperform widely used activation functions on some well known datasets and models. For example, beats ReLU by 0.89% in DenseNet169, 0.24% in Inception V3 in CIFAR100 dataset while 1.13% in Inception V3, 0.13% in DenseNet169, 0.94% in SimpleNet model in CIFAR10 dataset. Also, beats ReLU by 1.68% in DenseNet169, 0.30% in Inception V3 in CIFAR100 dataset while 1.0% in Inception V3, 0.15% in DenseNet169, 1.13% in SimpleNet model in CIFAR10 dataset
Activation Function Neural Networks Deep Learning
1 Introduction
Multilayered neural networks are widely used to learn nonlinear functions from complex data. An activation function is an integral part of neural networks that provides essential nonlinearity. A universal activation function may not be suitable for all datasets, and it is important to select an appropriate activation function for the task in hand. Nevertheless, a piecewise activation function, Rectified Linear Unit (ReLU) [7, 13, 18], defined as , is widely used due to its simplicity, convergence speed and lesser training time.
Despite of its simplicity and better convergence rate than Sigmoid and Tanh, ReLU has drawbacks like nonzero mean, negative missing, unbounded output, dying ReLU see[25] to name a few. Various activation functions have been proposed to overcome the drawbacks of ReLU and improve performance over it. Some of the variants of ReLU are Leaky ReLU [17], Randomized Leaky Rectified Linear Units (RReLU) [24], Exponential Linear Unit (ELU) [4], Inverse Square Root Linear Units (ISRLUs) [3], Parametric Rectified Linear Unit (PReLU) [10], and PTELU [6]. But none of the abovementioned activation functions have come close to ReLU in terms of usage. Most recently, Swish [21] has managed to gain attention from the deep learning community. Swish is a oneparameter family of activation functions defined as . Some other hyperparametrized families of activation functions include SoftRootSign [25] and TanhSoft [2]. In fact, many functions from the TanhSoft family have managed to outperform ReLU and Swish as well.
The most prominent drawback of ReLU is the dying ReLU hat is providing zero output for negative input. Many novel activation functions are built to overcome this problem. It was resolved by many activation functions by simply defining a piecewise function that resembles ReLU for positive input and takes nonzero values for negative input. Swish is different from such piecewise activation functions in the sense that it is a product of two smooth functions, and manages to remain close to ReLU for positive input and takes small negative values for the negative input. Recently, a four hyperparameters family of activation functions, TanhSoft [2], have been proposed and showed that many functions from TanhSoft have the similar closeness to ReLU as Swish, and perform better when compared to ReLU and Swish.
In this work, we have proposed a family of activation functions with five hyperparameters known as EIS and defined as
(1) 
where and are hyperparameters.
In the next sections, we have described this family, extracted three subfamilies from (1), and shown that for particular values of hyperparameters they outperform widely used activation functions, including ReLU and Swish. To validate the performance of the subfamilies, we have given a comprehensive search with different values of hyperparameters.
2 Related Work
Several activation functions have been proposed as a substitute to ReLU that can overcome its drawbacks. Because of the dying ReLU problem, it has been observed that a large fraction of neurons become inactive due to zero outcome. Another issue which activation functions face is that during the flow of gradient in the network, the gradient can become zero or diverge to infinity, which are commonly known as vanishing and exploding gradient problems. Leaky Relu [17] has been introduced with a small negative linear component to solve the dying ReLU problem and has shown improvement over ReLU. A hyperparametric component is incorporated in PReLU [10] to find the best value in the negative linear component. Many other improvements have been proposed over the years  Randomized Leaky Rectified Linear Units (RReLU) [24], Exponential Linear Unit (ELU) [4], and Inverse Square Root Linear Units (ISRLUs) [3] to name a few. Swish [21] is proposed by a team of researchers from Google Brain by an exhaustive search [19] and reinforcement learning techniques [1].
3 EIS Activation Function Family
In this work, we propose a hyperparametric family of functions, defined as
(2) 
Since family in (2) is made up of three basic activation functions  Exponential, ISRLU, and Softplus, we name this family as EIS. The family make senses for the hyperparameter ranges and , and and can not simultaneously be equal to zero as this will make the function undefined. For experimental purposes, we only work with small ranges of hyperparameters such as and . The relationship between the hyperparameters portrays an important role in the EIS family and controls the slope of the function in both negative and positive axes. The parameter have been added to switch on and off Softplus function in numerator. For square root, we consider the positive branch. Note that recover the constant function 1 while recovers the identity function . Moreover,
(3) 
For specific values of hyperparameters, EIS family recover some known activation functions. For example, is Softplus [8], is Swish [21], and is ISRLU [3] activation functions.
The derivative of the EIS activation family is as follows:
4 Search Findings
We have performed an exhaustive search with EIS family by considering different hyperparametric values. All of the functions have been trained and tested with CIFAR10 [15] and CIFAR100 [15] dataset with DenseNet121 (DN121) [12], MobileNet (MN) [11] and SimpleNet (SN) [9] models and best twelve functions have been reported in Table 1. All of these functions have either outperforms or performs equally well When compared with ReLU or Swish. From Table 1, note that the functional forms , and constantly outperforms compared to ReLU and Swish. A more detailed results about these three functional forms have been reported in the next section.
We also want to mention here that the function performs remarkably well for , , and in our searches but due to complicated functional form and higher training time, we have not reported the results with complex models.
Activation Function  CIFAR100  CIFAR10  







ReLU  56.90  66.20  62.77  85.49  90.69  91.07  
Swish  56.25  66.91  64.85  85.55  90.80  91.70  
57.24  67.45  64.99  86.63  90.99  92.01  
57.60  67.50  65.15  86.32  91.05  92.20  
57.46  67.42  65.09  86.00  91.12  92.35  
57.85  67.48  65.12  86.08  91.10  92.40  
56.57  66.22  64.52  85.70  90.89  91.70  
56.03  66.03  64.12  85.36  90.50  91.93  
56.72  66.85  64.78  85.25  90.40  91.70  
57.30  67.15  65.03  85.78  90.60  91.77  
57.05  67.06  64.99  86.24  90.79  92.11  
56.69  66.25  64.77  85.36  90.69  91.93  
57.12  67.01  65.01  85.25  90.62  91.79  
56.75  66.77  64.82  85.61  90.42  91.65 
It’s always difficult to say that an activation function will give the best results on challenging realworld datasets, but from the above searches, it shows that there is a merit that the defined family can generalize and can provide better results compared to widely used activation functions. A more detailed results with complex models on more datasets are given in the experiments sections.
5 Eis1, Eis2, Eis3
We have extracted three subfamilies of activation functions from the EIS family based on our searches. We called them EIS1, EIS2 and EIS3. They are defined as follows:
(4)  
(5)  
(6) 
The derivative of the above subfamilies are :
(7)  
(8)  
(9) 
Graph for EIS1, EIS2 and EIS3 activation functions for different hyperparametric values is given in figure 4, 4 and 4 show the . Also, figure 7, 7 and 7 represents the first order derivative of EIS1, EIS2 and EIS3 activation functions for different hyperparametric values. Additionally, a comparison between Swish, EIS1, EIS2 and EIS3 is given in figure 9 and 9. Functions of these three subfamilies are smooth and nonmonotonic like Swish.
6 Experiment with EIS1, EIS2, and EIS3
From the activation search is it observed , and constantly outperforms the rest other functions and we have reported results of these three activation functions in the following sections, . We have tested , and against some widely used activation functions on challenging datasets and noticed that they beats the baseline functions in most of the cases. In the next few subsections, we will describe about datasets, baseline functions and experimental results in details. Table 2 provides a detail comparison of , and with baseline activations where they beats, performs equally well or underperforms in different models such as DenseNet(DN) [12], MobileNet(MN) [11], Inception Module(IN) [22], SimpleNet(SN) [9] and WideResNet(WRN).
Baselines  ReLU  Leaky ReLU  ELU  Swish  Softplus 
> Baseline  14  13  14  14  14 
= Baseline  0  0  0  0  0 
< Baseline  0  1  0  0  0 
> Baseline  14  14  14  14  14 
= Baseline  0  0  0  0  0 
< Baseline  0  0  0  0  0 
> Baseline  14  13  14  14  14 
= Baseline  0  0  0  0  0 
< Baseline  0  1  0  0  0 
To estimate the performance of , and , we have tested on different datasets & models and compared with five widely used activation functions which are described below.

Leaky Rectified Linear Unit: To overcome the drawbacks of ReLU, Leaky ReLU was introduced by Mass et al. in 2013 [17] and it shows promising results in different real world datasets. Leaky ReLU is defined as
(11) 
Exponential Linear Units: ELU is defined in such a way so that it overcomes the vanishing gradient problem of ReLU. ELU was introduced by Clevert et al. in 2015 [4]. ELU ELU is defined as
(12) where is a hyperparameter.

Swish: Swish is a nonmonotonic, smooth function which is bounded below and unbounded above. It was proposed by a team of researcher from Google Brain in 2017 [21]. Swish is defined as
(14)
We have reported the results with five benchmarking databases, MNIST, Fashion MNIST, The Street View House Numbers, CIFAR10, and CIFAR100. A brief description about the databases are as follows.

MNIST: MNIST [16] is a well established standard database consisting of handwritten greyscale images of digits from 0 to 9. It is widely used to establish the efficacy of various deep learning models. The dataset consists of 60k training images and 10k testing images.

FashionMNIST: FashionMNIST [23] is a database consisting of 28 28 pixels greyscale images of Zalando’s, ten fashion items class like Tshirt, Trouser, Coat, Bag etc. It’s consist of 60k training examples and 10k testing examples. FashionMNIST provides a more challenging classification problem than MNIST and seeks to replace MNIST.

The Street View House Numbers (SVHN) Database: SVHN [20] is a popular computer vision database consists of real world house number with RGB images. The database has 73257 training images and 26032 testing images. The database has a total of 10 classes.

CIFAR: The CIFAR [15] (Canadian Institute for Advanced Research), is another standard well established computervision dataset that is generally used to establish the efficacy of deep learning models. It contains 60k color images of size 32 32, out of which 50k are training images, and 10k are testing images. It has two versions CIFAR 10 and CIFAR100, which contains 10 and 100 target classes, respectively.
Implementation and Evaluation of EIS
The proposed three activation subfamilies with specific parametric values found from our searches have been evaluated on different CNN architecture. We have implemented a small sevenlayer custom model to evaluate the performance on MNIST, Fashion MNIST, and SVHN datasets and trained with a uniform 0.001 learning rate & Adam [14] optimizer while DenseNet, MobileNet, Inception Module, WideResNet is used with uniform 0.001 learning rate, Adam optimizer and trained for 100 epochs and with same the settings SimpleNet is trained with 200 epochs to evaluate performance on CIFAR10 and CIFAR100 datasets. All three activation functions from EIS family are compared with stateoftheart five baseline activations. The test accuracy’s of five benchmarking databases have been reported in table 4, 4, 5, 6 and 7. A accuracy and loss plot on WRN 2810 model with CIFAR100 dataset for , , , ReLU and Swish is given in figure 12 and 12.
Activation Function 


95.38  
95.30  
95.41  
ReLU  95.20  
Swish  95.23  
Leaky ReLU( = 0.01)  95.22  
ELU()  95.20  
Softplus  95.10 












90.99  98.70  90.89  98.87  92.14  98.99  86.63  97.55  92.01  
91.05  98.75  90.91  98.75  92.01  99.00  86.32  97.50  92.20  
91.12  98.82  90.96  98.79  92.12  99.05  86.00  97.36  92.35  
ReLU  90.69  98.71  90.76  98.82  91.01  98.85  85.49  97.10  91.07  

90.72  98.72  90.70  98.71  91.62  98.80  85.56  97.20  91.32  

90.31  98.41  90.55  98.70  91.09  98.69  85.59  97.24  91.01  
Swish  90.80  98.85  90.70  98.75  91.50  98.81  85.55  97.15  91.70  
Softplus  90.55  98.75  90.31  98.42  91.59  98.72  85.54  97.10  91.01  













57.24  76.37  69.25  85.62  67.45  83.90  64.99  82.22  64.99  69.01  
57.60  76.70  69.31  85.65  67.50  83.95  65.78  82.80  65.15  69.08  
57.46  76.20  69.18  85.42  67.42  83.87  65.15  82.51  65.09  69.20  
ReLU  56.90  76.20  69.01  85.33  66.20  83.01  64.10  81.46  62.77  66.60  

57.54  76.62  69.07  85.21  66.99  83.25  63.32  81.12  62.51  68.97  

56.99  75.95  68.55  85.14  66.62  83.49  64.32  81.54  63.60  64.56  
Swish  56.25  75.50  68.12  84.85  66.91  83.70  64.80  82.12  64.85  68.52  
SoftPlus  56.95  76.05  69.02  84.99  66.20  83.41  64.82  82.20  62.80  61.70  

7 Conclusion
The activation function is an integral part of neural networks. It helps to generalize the networks. In this paper, we have proposed a hyperparametric family of Activation functions. The proposed activation functions have been evaluated in dense neural network architectures. We have used benchmark databases, and the results showcase the effectiveness of the proposed functions.
References
 (2016) Designing neural network architectures using reinforcement learning. External Links: 1611.02167 Cited by: §2.
 (2020) TanhSoft – a family of activation functions combining tanh and softplus. External Links: 2009.03863 Cited by: §1, §1.
 (2017) Improving deep learning by inverse square root linear units (isrlus). External Links: 1710.09967 Cited by: §1, §2, §3.
 (2015) Fast and accurate deep network learning by exponential linear units (elus). External Links: 1511.07289 Cited by: §1, §2, 3rd item.
 (2000) Incorporating secondorder functional knowledge for better option pricing. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, Cambridge, MA, USA, pp. 451–457. Cited by: 4th item.
 (2017) Ptelu: parametric tan hyperbolic linear unit activation for deep neural networks. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Vol. , pp. 974–978. Cited by: §1.
 (200007) Digital selection and analogue amplification coexist in a cortexinspired silicon circuit. Nature 405, pp. 947–51. External Links: Document Cited by: §1, 1st item.
 (2015) Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–4. Cited by: §3, 4th item.
 (2016) Lets keep it simple, using simple architectures to outperform deeper and more complex architectures. External Links: 1608.06037 Cited by: §4, §6.
 (2015) Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. External Links: 1502.01852 Cited by: §1, §2.
 (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. External Links: 1704.04861 Cited by: §4, §6.
 (2016) Densely connected convolutional networks. External Links: 1608.06993 Cited by: §4, §6.
 (2009) What is the best multistage architecture for object recognition?. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27  October 4, 2009, pp. 2146–2153. External Links: Link, Document Cited by: §1, 1st item.
 (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Implementation and Evaluation of EIS.
 (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §4, 4th item.
 (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2. Cited by: 1st item.
 (2013) Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: §1, §2, 2nd item.
 (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, J. Fürnkranz and T. Joachims (Eds.), pp. 807–814. External Links: Link Cited by: §1, 1st item.
 (2017) DeepArchitect: automatically designing and training deep architectures. External Links: 1704.08792 Cited by: §2.
 (2011) Reading digits in natural images with unsupervised feature learning. Cited by: 3rd item.
 (2017) Searching for activation functions. External Links: 1710.05941 Cited by: §1, §2, §3, 5th item.
 (2015) Rethinking the inception architecture for computer vision. External Links: 1512.00567 Cited by: §6.
 (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: 2nd item.
 (2015) Empirical evaluation of rectified activations in convolutional network. External Links: 1505.00853 Cited by: §1, §2.
 (2020) Softrootsign activation function. External Links: 2003.00547 Cited by: §1.