EIS - a family of activation functions combining Exponential, ISRU, and Softplus

EIS - a family of activation functions combining Exponential, ISRU, and Softplus

Abstract

Activation functions play a pivotal role in the function learning using neural networks. The non-linearity in the learned function is achieved by repeated use of the activation function. Over the years, numerous activation functions have been proposed to improve accuracy in several tasks. Basic functions like ReLU, Exponential, Tanh, or Softplus have been favorite among the deep learning community because of their simplicity. In recent years, several novel activation functions arising from these basic functions have been proposed, which have improved accuracy in some challenging datasets with complicated models. We propose a five hyper-parameters family of activation functions, namely EIS, defined as,

We show examples of activation functions from the EIS family which outperform widely used activation functions on some well known datasets and models. For example, beats ReLU by 0.89% in DenseNet-169, 0.24% in Inception V3 in CIFAR100 dataset while 1.13% in Inception V3, 0.13% in DenseNet-169, 0.94% in SimpleNet model in CIFAR10 dataset. Also, beats ReLU by 1.68% in DenseNet-169, 0.30% in Inception V3 in CIFAR100 dataset while 1.0% in Inception V3, 0.15% in DenseNet-169, 1.13% in SimpleNet model in CIFAR10 dataset

\keywords

Activation Function Neural Networks Deep Learning

1 Introduction

Multi-layered neural networks are widely used to learn nonlinear functions from complex data. An activation function is an integral part of neural networks that provides essential non-linearity. A universal activation function may not be suitable for all datasets, and it is important to select an appropriate activation function for the task in hand. Nevertheless, a piecewise activation function, Rectified Linear Unit (ReLU) [7, 13, 18], defined as , is widely used due to its simplicity, convergence speed and lesser training time.

Despite of its simplicity and better convergence rate than Sigmoid and Tanh, ReLU has drawbacks like non-zero mean, negative missing, unbounded output, dying ReLU see[25] to name a few. Various activation functions have been proposed to overcome the drawbacks of ReLU and improve performance over it. Some of the variants of ReLU are Leaky ReLU [17], Randomized Leaky Rectified Linear Units (RReLU) [24], Exponential Linear Unit (ELU) [4], Inverse Square Root Linear Units (ISRLUs) [3], Parametric Rectified Linear Unit (PReLU) [10], and P-TELU [6]. But none of the above-mentioned activation functions have come close to ReLU in terms of usage. Most recently, Swish [21] has managed to gain attention from the deep learning community. Swish is a one-parameter family of activation functions defined as . Some other hyper-parametrized families of activation functions include Soft-Root-Sign [25] and TanhSoft [2]. In fact, many functions from the TanhSoft family have managed to outperform ReLU and Swish as well.

The most prominent drawback of ReLU is the dying ReLU hat is providing zero output for negative input. Many novel activation functions are built to overcome this problem. It was resolved by many activation functions by simply defining a piecewise function that resembles ReLU for positive input and takes non-zero values for negative input. Swish is different from such piecewise activation functions in the sense that it is a product of two smooth functions, and manages to remain close to ReLU for positive input and takes small negative values for the negative input. Recently, a four hyper-parameters family of activation functions, TanhSoft [2], have been proposed and showed that many functions from TanhSoft have the similar closeness to ReLU as Swish, and perform better when compared to ReLU and Swish.

In this work, we have proposed a family of activation functions with five hyper-parameters known as EIS and defined as

(1)

where and are hyper-parameters.

In the next sections, we have described this family, extracted three subfamilies from (1), and shown that for particular values of hyper-parameters they outperform widely used activation functions, including ReLU and Swish. To validate the performance of the subfamilies, we have given a comprehensive search with different values of hyper-parameters.

2 Related Work

Several activation functions have been proposed as a substitute to ReLU that can overcome its drawbacks. Because of the dying ReLU problem, it has been observed that a large fraction of neurons become inactive due to zero outcome. Another issue which activation functions face is that during the flow of gradient in the network, the gradient can become zero or diverge to infinity, which are commonly known as vanishing and exploding gradient problems. Leaky Relu [17] has been introduced with a small negative linear component to solve the dying ReLU problem and has shown improvement over ReLU. A hyper-parametric component is incorporated in PReLU [10] to find the best value in the negative linear component. Many other improvements have been proposed over the years - Randomized Leaky Rectified Linear Units (RReLU) [24], Exponential Linear Unit (ELU) [4], and Inverse Square Root Linear Units (ISRLUs) [3] to name a few. Swish [21] is proposed by a team of researchers from Google Brain by an exhaustive search [19] and reinforcement learning techniques [1].

3 EIS Activation Function Family

In this work, we propose a hyper-parametric family of functions, defined as

(2)

Since family in (2) is made up of three basic activation functions - Exponential, ISRLU, and Softplus, we name this family as EIS. The family make senses for the hyper-parameter ranges and , and and can not simultaneously be equal to zero as this will make the function undefined. For experimental purposes, we only work with small ranges of hyper-parameters such as and . The relationship between the hyper-parameters portrays an important role in the EIS family and controls the slope of the function in both negative and positive axes. The parameter have been added to switch on and off Softplus function in numerator. For square root, we consider the positive branch. Note that recover the constant function 1 while recovers the identity function . Moreover,

(3)

For specific values of hyper-parameters, EIS family recover some known activation functions. For example, is Softplus [8], is Swish [21], and is ISRLU [3] activation functions.

The derivative of the EIS activation family is as follows:

4 Search Findings

We have performed an exhaustive search with EIS family by considering different hyper-parametric values. All of the functions have been trained and tested with CIFAR10 [15] and CIFAR100 [15] dataset with DenseNet-121 (DN-121) [12], MobileNet (MN) [11] and SimpleNet (SN) [9] models and best twelve functions have been reported in Table 1. All of these functions have either outperforms or performs equally well When compared with ReLU or Swish. From Table 1, note that the functional forms , and constantly outperforms compared to ReLU and Swish. A more detailed results about these three functional forms have been reported in the next section.

We also want to mention here that the function performs remarkably well for  and  in our searches but due to complicated functional form and higher training time, we have not reported the results with complex models.

Figure 1: A few novel activation functions from the searches of the TanhSoft family.
Activation Function CIFAR100 CIFAR10
Top-1
accuracy
on
MN
Top-1
accuracy
on
DN-121
Top-1
accuracy
on
SN
Top-1
accuracy
on
MN
Top-1
accuracy
on
DN-121
Top-1
accuracy
on
SN
ReLU 56.90 66.20 62.77 85.49 90.69 91.07
Swish 56.25 66.91 64.85 85.55 90.80 91.70
57.24 67.45 64.99 86.63 90.99 92.01
57.60 67.50 65.15 86.32 91.05 92.20
57.46 67.42 65.09 86.00 91.12 92.35
57.85 67.48 65.12 86.08 91.10 92.40
56.57 66.22 64.52 85.70 90.89 91.70
56.03 66.03 64.12 85.36 90.50 91.93
56.72 66.85 64.78 85.25 90.40 91.70
57.30 67.15 65.03 85.78 90.60 91.77
57.05 67.06 64.99 86.24 90.79 92.11
56.69 66.25 64.77 85.36 90.69 91.93
57.12 67.01 65.01 85.25 90.62 91.79
56.75 66.77 64.82 85.61 90.42 91.65
Table 1: Searches on CIFAR100 and CIFAR10

It’s always difficult to say that an activation function will give the best results on challenging real-world datasets, but from the above searches, it shows that there is a merit that the defined family can generalize and can provide better results compared to widely used activation functions. A more detailed results with complex models on more datasets are given in the experiments sections.

5 Eis-1, Eis-2, Eis-3

We have extracted three subfamilies of activation functions from the EIS family based on our searches. We called them EIS-1, EIS-2 and EIS-3. They are defined as follows:-

(4)
(5)
(6)

The derivative of the above subfamilies are :-

(7)
(8)
(9)

Graph for EIS-1, EIS-2 and EIS-3 activation functions for different hyper-parametric values is given in figure  4, 4 and 4 show the . Also, figure  7, 7 and 7 represents the first order derivative of EIS-1, EIS-2 and EIS-3 activation functions for different hyper-parametric values. Additionally, a comparison between Swish, EIS-1, EIS-2 and EIS-3 is given in figure  9 and 9. Functions of these three subfamilies are smooth and non-monotonic like Swish.

Figure 2: Plots of for different values of .
Figure 3: Plots of for different values of .
Figure 4: Plots of for different values of .
Figure 5: Plots of first derivative of for different values of .
Figure 6: Plots of first derivative of for different values of .
Figure 7: Plots of first derivative of for different values of .
Figure 8: Swish, , and
Figure 9: First order derivatives of Swish, , and

6 Experiment with EIS-1, EIS-2, and EIS-3

From the activation search is it observed , and constantly outperforms the rest other functions and we have reported results of these three activation functions in the following sections, . We have tested , and against some widely used activation functions on challenging datasets and noticed that they beats the baseline functions in most of the cases. In the next few subsections, we will describe about datasets, baseline functions and experimental results in details. Table 2 provides a detail comparison of , and with baseline activations where they beats, performs equally well or under-performs in different models such as DenseNet(DN) [12], MobileNet(MN) [11], Inception Module(IN) [22], SimpleNet(SN) [9] and WideResNet(WRN).

Baselines ReLU Leaky ReLU ELU Swish Softplus
> Baseline    14     13    14    14    14
= Baseline    0     0    0    0    0
< Baseline    0     1    0    0    0
> Baseline    14     14    14    14    14
= Baseline    0     0    0    0    0
< Baseline    0     0    0    0    0
> Baseline    14     13    14    14    14
= Baseline    0     0    0    0    0
< Baseline    0     1    0    0    0
Table 2: Baseline table of , and for Top-1 Accurecy

To estimate the performance of , and , we have tested on different datasets & models and compared with five widely used activation functions which are described below.

  • Rectified Linear Unit (ReLU):- ReLU was proposed by Nair and Hinton ([7],[13],[18]) and currently it is one of the most frequently used activation function in deep learning field. ReLU is defined as

    (10)
  • Leaky Rectified Linear Unit:- To overcome the drawbacks of ReLU, Leaky ReLU was introduced by Mass et al. in 2013 [17] and it shows promising results in different real world datasets. Leaky ReLU is defined as

    (11)
  • Exponential Linear Units:- ELU is defined in such a way so that it overcomes the vanishing gradient problem of ReLU. ELU was introduced by Clevert et al. in 2015 [4]. ELU ELU is defined as

    (12)

    where is a hyper-parameter.

  • Softplus:- Softplus [8, 5] is a smooth non-monotonic function, defined as

    (13)
  • Swish:- Swish is a non-monotonic, smooth function which is bounded below and unbounded above. It was proposed by a team of researcher from Google Brain in 2017 [21]. Swish is defined as

    (14)
Figure 10: Plots of few widely used activation functions.

We have reported the results with five bench-marking databases, MNIST, Fashion MNIST, The Street View House Numbers, CIFAR10, and CIFAR100. A brief description about the databases are as follows.

  • MNIST:- MNIST [16] is a well established standard database consisting of handwritten grey-scale images of digits from 0 to 9. It is widely used to establish the efficacy of various deep learning models. The dataset consists of 60k training images and 10k testing images.

  • Fashion-MNIST:- Fashion-MNIST [23] is a database consisting of 28 28 pixels grey-scale images of Zalando’s, ten fashion items class like T-shirt, Trouser, Coat, Bag etc. It’s consist of 60k training examples and 10k testing examples. Fashion-MNIST provides a more challenging classification problem than MNIST and seeks to replace MNIST.

  • The Street View House Numbers (SVHN) Database:- SVHN [20] is a popular computer vision database consists of real world house number with RGB images. The database has 73257 training images and 26032 testing images. The database has a total of 10 classes.

  • CIFAR:- The CIFAR [15] (Canadian Institute for Advanced Research), is another standard well established computer-vision dataset that is generally used to establish the efficacy of deep learning models. It contains 60k color images of size 32 32, out of which 50k are training images, and 10k are testing images. It has two versions CIFAR 10 and CIFAR100, which contains 10 and 100 target classes, respectively.

Implementation and Evaluation of EIS

The proposed three activation subfamilies with specific parametric values found from our searches have been evaluated on different CNN architecture. We have implemented a small seven-layer custom model to evaluate the performance on MNIST, Fashion MNIST, and SVHN datasets and trained with a uniform 0.001 learning rate & Adam [14] optimizer while DenseNet, MobileNet, Inception Module, WideResNet is used with uniform 0.001 learning rate, Adam optimizer and trained for 100 epochs and with same the settings SimpleNet is trained with 200 epochs to evaluate performance on CIFAR10 and CIFAR100 datasets. All three activation functions from EIS family are compared with state-of-the-art five baseline activations. The test accuracy’s of five bench-marking databases have been reported in table 4, 4, 5, 6 and 7. A accuracy and loss plot on WRN 28-10 model with CIFAR100 dataset for , , , ReLU and Swish is given in figure 12 and 12.

Activation Function 5-fold mean Accuracy on MNIST data 99.30 99.32 99.40 ReLU 99.17 Swish 99.21 Leaky ReLU( = 0.01) 99.18 ELU () 99.15 Softplus 99.02
Table 3: Experimental Results with MNIST Dataset
Activation Function 5-fold mean Accuracy on Fashion MNIST data 93.35 93.15 93.20 ReLU 92.85 Swish 92.97 Leaky ReLU( = 0.01) 92.91 ELU () 92.80 Softplus 92.30
Table 4: Experimental Results with Fashion MNIST Dataset
Activation Function
5-fold mean Accuracy on SVHN data
95.38
95.30
95.41
ReLU 95.20
Swish 95.23
Leaky ReLU( = 0.01) 95.22
ELU() 95.20
Softplus 95.10
Table 5: Experimental Results with SVHN Dataset
Activation
Function
DN-121
Top-1
Acc.
DN-121
Top-3
Acc.
DN-169
Top-1
Acc.
DN-169
Top-3
Acc.
IN-V3
Top-1
Acc.
IN-V3
Top-3
Acc.
MN
Top-1
Acc.
MN
Top-3
Acc.
SN
Top-1
Acc.

90.99 98.70 90.89 98.87 92.14 98.99 86.63 97.55 92.01
91.05 98.75 90.91 98.75 92.01 99.00 86.32 97.50 92.20
91.12 98.82 90.96 98.79 92.12 99.05 86.00 97.36 92.35
ReLU 90.69 98.71 90.76 98.82 91.01 98.85 85.49 97.10 91.07
Leaky ReLU
( = 0.01)
90.72 98.72 90.70 98.71 91.62 98.80 85.56 97.20 91.32
ELU
()
90.31 98.41 90.55 98.70 91.09 98.69 85.59 97.24 91.01
Swish 90.80 98.85 90.70 98.75 91.50 98.81 85.55 97.15 91.70
Softplus 90.55 98.75 90.31 98.42 91.59 98.72 85.54 97.10 91.01

Table 6: Experimental results on CIFAR10 dataset with different models. “Acc." stands for Accuracy in %.
Activation
Function
MN
Top-1
Acc.
MN
Top-3
Acc.
IN-V3
Top-1
Acc.
IN-V3
Top-3
Acc.
DN-121
Top-1
Acc.
DN-121
Top-3
Acc.
DN-169
Top-1
Acc.
DN-169
Top-3
Acc
SN
Top-1
Acc.
WRN
28-10
Top-1
Acc.
57.24 76.37 69.25 85.62 67.45 83.90 64.99 82.22 64.99 69.01
57.60 76.70 69.31 85.65 67.50 83.95 65.78 82.80 65.15 69.08
57.46 76.20 69.18 85.42 67.42 83.87 65.15 82.51 65.09 69.20
ReLU 56.90 76.20 69.01 85.33 66.20 83.01 64.10 81.46 62.77 66.60
Leaky ReLU
( = 0.01)
57.54 76.62 69.07 85.21 66.99 83.25 63.32 81.12 62.51 68.97
ELU
()
56.99 75.95 68.55 85.14 66.62 83.49 64.32 81.54 63.60 64.56
Swish 56.25 75.50 68.12 84.85 66.91 83.70 64.80 82.12 64.85 68.52
SoftPlus 56.95 76.05 69.02 84.99 66.20 83.41 64.82 82.20 62.80 61.70

Table 7: Experimental results on CIFAR100 dataset with different models. “Acc." stands for Accuracy in %.
Figure 11: Train and Test accuracy on CIFAR100 dataset with WideResNet 28-10 model
Figure 12: Train and Test loss on CIFAR100 dataset with WideResNet 28-10 model

7 Conclusion

The activation function is an integral part of neural networks. It helps to generalize the networks. In this paper, we have proposed a hyper-parametric family of Activation functions. The proposed activation functions have been evaluated in dense neural network architectures. We have used benchmark databases, and the results showcase the effectiveness of the proposed functions.

References

  1. B. Baker, O. Gupta, N. Naik and R. Raskar (2016) Designing neural network architectures using reinforcement learning. External Links: 1611.02167 Cited by: §2.
  2. K. Biswas, S. Kumar, S. Banerjee and A. K. Pandey (2020) TanhSoft – a family of activation functions combining tanh and softplus. External Links: 2009.03863 Cited by: §1, §1.
  3. B. Carlile, G. Delamarter, P. Kinney, A. Marti and B. Whitney (2017) Improving deep learning by inverse square root linear units (isrlus). External Links: 1710.09967 Cited by: §1, §2, §3.
  4. D. Clevert, T. Unterthiner and S. Hochreiter (2015) Fast and accurate deep network learning by exponential linear units (elus). External Links: 1511.07289 Cited by: §1, §2, 3rd item.
  5. C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau and R. Garcia (2000) Incorporating second-order functional knowledge for better option pricing. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, Cambridge, MA, USA, pp. 451–457. Cited by: 4th item.
  6. A. Gupta and R. Duggal (2017) P-telu: parametric tan hyperbolic linear unit activation for deep neural networks. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Vol. , pp. 974–978. Cited by: §1.
  7. R. Hahnloser, R. Sarpeshkar, M. Mahowald, R. Douglas and H. Seung (2000-07) Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, pp. 947–51. External Links: Document Cited by: §1, 1st item.
  8. Hao Zheng, Zhanlei Yang, Wenju Liu, Jizhong Liang and Yanpeng Li (2015) Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–4. Cited by: §3, 4th item.
  9. S. H. Hasanpour, M. Rouhani, M. Fayyaz and M. Sabokrou (2016) Lets keep it simple, using simple architectures to outperform deeper and more complex architectures. External Links: 1608.06037 Cited by: §4, §6.
  10. K. He, X. Zhang, S. Ren and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. External Links: 1502.01852 Cited by: §1, §2.
  11. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. External Links: 1704.04861 Cited by: §4, §6.
  12. G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger (2016) Densely connected convolutional networks. External Links: 1608.06993 Cited by: §4, §6.
  13. K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y. LeCun (2009) What is the best multi-stage architecture for object recognition?. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27 - October 4, 2009, pp. 2146–2153. External Links: Link, Document Cited by: §1, 1st item.
  14. D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Implementation and Evaluation of EIS.
  15. A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §4, 4th item.
  16. Y. LeCun, C. Cortes and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2. Cited by: 1st item.
  17. A. L. Maas, A. Y. Hannun and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: §1, §2, 2nd item.
  18. V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, J. Fürnkranz and T. Joachims (Eds.), pp. 807–814. External Links: Link Cited by: §1, 1st item.
  19. R. Negrinho and G. Gordon (2017) DeepArchitect: automatically designing and training deep architectures. External Links: 1704.08792 Cited by: §2.
  20. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: 3rd item.
  21. P. Ramachandran, B. Zoph and Q. V. Le (2017) Searching for activation functions. External Links: 1710.05941 Cited by: §1, §2, §3, 5th item.
  22. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna (2015) Rethinking the inception architecture for computer vision. External Links: 1512.00567 Cited by: §6.
  23. H. Xiao, K. Rasul and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: 2nd item.
  24. B. Xu, N. Wang, T. Chen and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. External Links: 1505.00853 Cited by: §1, §2.
  25. Y. Zhou, D. Li, S. Huo and S. Kung (2020) Soft-root-sign activation function. External Links: 2003.00547 Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414503
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description