LiSHT: NonParametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks
Abstract
The activation function in neural network is one of the important aspects which facilitates the deep training by introducing the nonlinearity into the learning process. However, because of zerohard rectification, some the of existing activations function such as ReLU and Swish miss to utilize the negative input values and may suffer from the dying gradient problem. Thus, it is important to look for a better activation function which is free from such problems. As a remedy, this paper proposes a new nonparametric function, called Linearly Scaled Hyperbolic Tangent (LiSHT) for Neural Networks (NNs). The proposed LiSHT activation function is an attempt to scale the nonlinear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem. The training and classification experiments are performed over benchmark Car Evaluation, Iris, MNIST, CIFAR10, CIFAR100 and twitter140 datasets to show that the proposed activation achieves faster convergence and higher performance. A very promising performance improvement is observed on three different type of neural networks including Multilayer Perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent neural network like Longshort term memory (LSTM). The advantages of proposed activation function are also visualized in terms of the feature activation maps, weight distribution and loss landscape.
Rcases .}
I Introduction
The deep learning method is one of the breakthroughs which replaced the handtuning tasks in many problems including computer vision, speech processing, natural language processing, robotics, and many more [1]. In recent times, the deep Artificial Neural Networks (ANNs) have shown a tremendous performance improvement due to availability of large scale datasets as well as high end computational resources [2]. Various types of ANN have been proposed for different type of problems such as Multilayer Perceptrons (MLP) [3] to deal with the real vector dimensional data [4], [5]. Convolutional Neural Networks (CNN) are used to deal with the image and videos [6], [7]. Recurrent Neural Network (RNN) like LongShort Term Memory (LSTM) are used for the speech and text data classification [8] and the sentiment analysis [9], etc.
Recently, a multipath ensemble CNN (MECNN) model is proposed which directly concatenates the lowlevel with that of highlevel output features [10]. In MECNN, the backpropagation has shorter path, suitable for the ultradeep network training. Ensemble Neural Networks (ENN) is introduced as an efficient stochastic gradientfree optimization method that relies on covariance matrices [11]. Wen et al. proposed a Memristive Fully Convolutional Network (MFCN) for accelerating the hardware imagesegmentor, where the memory segment is able to do the computation [12]. A light gated recurrent unit (LGRU) based neural network is proposed recently by Ravanelli et al. for Speech Recognition [13]. Recently, the deep learning is used in the different tasks such as deep CNN for audiovisual speech enhancement [14], RCCNet (a CNN) for histological routine colon cancer nuclei classification [15], DFUNet (a CNN) for diabetic foot ulcer classification [16], combining CNN and RNN for video dynamics detection [17], dual CNN models for unsupervised monocular depth estimation [18], CNNs for face antispoofing [19], and nonsymmetric deep autoencoder (NDAE) based unsupervised deep learning for network intrusion detection [20], etc.
The main aim of any type of neural network is to transform the input data in some other feature space, where it becomes linearly separable into classes. In order to achieve it, all the neural networks rely on a compulsory unit called the activation function [21]. The job of activation functions is to introduce the nonlinearity in the network to facilitate automatic learning of connection weights for a specific problem. Thus, the activation function is the backbone of any neural network.
The activation function was commonly used in the early days of neural networks. It is a special case of the logistic function. The function squashes the realvalued numbers into or . In turn, the large negative number becomes and large positive number becomes . The hyperbolic tangent function is the another popular activation function. The output values of lie between and depending on the input real values. The vanishing gradient in both positive as well as negative directions is one of the major problems with both and activation functions. The Rectified Linear Unit () activation function was proposed in recent past for training deep networks [6]. is a breakthrough against vanishing gradient. It is an identity function for the nonnegative inputs and zero function (i.e., the output is zero) for the negative inputs. The has become very popular due to its simplicity and used in various types of neural networks. The major drawback with is the diminishing gradient for the negative values of the inputs. Due to the diminishing gradient, the units can be fragile during training and the gradient may die.
Several researchers have proposed the improvement on such as Leaky ReLU () [22], Parametric ReLU () [23], [24], Exponential Linear Unit () [25], Scaled Exponential Linear Unit () [26], Gaussian Error Linear Unit () [27], Average Biased ReLU () [28], etc. The is similar to , except it allows a small, nonnegative and constant gradient (such as 0.01) for the negative regime of input in oder to reduce the dying neuron input problem [22]. The is the extension of by training the coefficient of leakage into a parameter instead of fixing it with a constant value [23]. The leakage parameter in is learned along with the other neural network parameters. Thus, both and are in between the linear function and which is a drawback in terms of decreased amounts of nonlinearity. The activation function tries to make the transition of (i.e., at ) smooth by fitting the log function [24]. Otherwise, the activation is very similar to the activation. The function is very similar to in terms of the identity functions for nonnegative inputs [25]. Unlike the which sharply becomes smooth, the becomes smooth slowly until its output equal to a constant negative value. For positive inputs, the [25] can blow up the activation, which can lead to the gradient exploding problem. The adds one scaling fixed parameter in which sustains the bad weight initialization [26] better. The randomly applies the identity or zero map to a neuronâs input according to Gaussian fashion [27]. The shape of is in between the and . The is proposed recently which allows the prominent negative values after converting it into positive values by biasing it with the average of all activations [28]. The also could not utilize all the negative values due to trimming of values at zero, similar to . Most of these existing activation methods are sometimes not able to take the advantage of negative values which is solved in the proposed activation.
Recently, Xu et al. have performed an empirical study of rectified activations in CNNs [29]. Very recently, Ramachandran et al. proposed a sigmoidweighted linear unit activation function [30]. They tested with many functions and found , i.e., is a promising activation function. The activation function interpolates between and Linear function, by learning the parameter . However, the gradient diminishing problem is still present in case of function. Recently, an attempt is made to design the complex and nonparametric activation function for complexvalued neural networks (CVNNs) [31]. It relies over the kernel expansion with a fixed dictionary which can be implemented on vectorized hardware.
The existing activation functions are still weak to replace the due to inconsistent gains. In this paper, a linearly scaled hyperbolic tangent activation function () is proposed to introduce the nonlinearities in the neural networks. The scales the function linearly to tackle its gradient diminishing problem. Here, some classification tackle experiments are performed to show the improvement over variety of datasets on three type of neural networks.
The contributions of this paper are as follows,

A new activation function named nonparametric Linearly Scaled Hyperbolic Tangent () is proposed by linearly scaling the activation function.

The increased amount of nonlinearity of the proposed activation function is visualized from its first and second order derivatives curves (Fig. 1)

The proposed activation function is tested with different types of neural networks, including older Artificial Neural Network (ANN) by Multilayer Perceptron (MLP), Convolutional Neural Network (CNN) by Residual Neural Network (ResNet), and Recurrent Neural Network (RNN) by LongShort Term Memory (LSTM) network.

Rigorous experiments and analysis are performed for some classification problems.

Three different types of experimental data are used 1) onedimensional data, including Car, Iris and MNIST (converted from image) datasets, 2) 2D image data, including MNIST, CIFAR10 and CIFAR100 datasets, and 3) sentiment analysis data, including twitter140 dataset.

The impact of different nonlinearity functions over activation feature maps and weight distribution has been analyzed.
This paper is organized as follows: Section 2 outlines the proposed activation; Section 3 describes the experimental setup; Section 4 presents the results and analysis; and Section 5 contains the concluding remarks.
Ii Proposed LiSHT Activation Function
A feedforward Deep Neural Network (DNN) comprises of more than one hidden nonlinear layer. Let an input vector be , and each hidden layer is capable to transform its input vector by applying a linear affine transform followed by a nonlinear mapping from the layer to the layer as follows:
(1) 
Here, , and represent the vectors of output, weights, biases and number of units in the hidden layer, respectively, and a nonlinear activation mapping . Looking for an efficient and powerful activation function in DNN is always demanding due to the overabundance by the saturation properties of existing activation functions. An activation function is said to be saturate [32], if its derivative tends to zero in both directions (i.e., and , respectively). The training of a deep neural networks is almost impossible with of and activation functions due to the gradient diminishing problem when input is either too small or too large [2]. For the first time, the Rectified Linear Unit () (i.e., ) became very popular activation for training the DNN [6]. But, also suffers due to the gradient diminishing problem for negative inputs which lead to the dying neuron problem.
Hence, we propose a nonparametric linearly scaled hyperbolic tangent activation function, so called . Like [6] and [30, 33], shares the similar unbounded upper limits property on the right hand side of activation curve. However, because of the symmetry preserving property of , the left hand side of the activation is in the upwardly unbounded direction, hence it satisfies nonmonotonicity (see Fig. 1(a)). Apart from the literature [25, 30] and to the best of our knowledge, first time in the history of activation function, utilizes the benefits of positive valued activation without identically propagating all the inputs, which mitigates gradient vanishing at back propagation and acquiesces faster training of deep neural network. The proposed activation function is computed by multiplying the function to its input and defined as,
(2) 
where is a hyperbolic tangent function and defined as,
(3) 
For the large positive inputs, the behavior of the is close to the and , i.e., the output is close to the input as depicted in Fig. 1(a). Whereas, unlike and other commonly used activation functions, the output of for negative inputs is symmetric to the output of for positive inputs as illustrated in Fig. 1(a). The order derivative (i.e., ) of is given as follows,
(4) 
Similarly, the order derivative (i.e., ) of is given as follows,
(5) 
The and order derivatives of the proposed LiSHT are plotted in Fig. 1(b) and Fig. 1(c), respectively. An attractive characteristic of the LiSHT is selfstability property, the magnitude of derivatives is less than for . It can be observed from the derivatives of in Fig. 1 that the amount of nonlinearity is very high near to zero as compared to the existing activations which can boost the learning of a complex model. Another major advantage of proposed activation is due to the nearly linear nature for very small and very large inputs which resolves the gradient diminishing problem of existing methods. It can be seen in Fig. 1(b) that the gradients of proposed activation function have both the positive and negative values, which tackles the problem of exploding gradient of . As described in Fig. 1(c) that the order derivative of proposed activation function is similar to the opposite of the Laplacian operator (i.e., the order derivative of Gaussian operator) which is useful to maximize a function. Thus, due to opposite nature of Gaussian operator, the proposed activation function boosts the training of the neural network for the minimization problem of the loss function.
Theoretically, it is difficult to prove why an activation function works well, it may be due to many puzzling factors that directly affect DNN training. We understand that being unbounded in both positive and negative directions, smooth, and nonmonotonicity are the advantages of the proposed activation. The complete unbounded property makes different from all the traditional activation functions. Moreover, it makes use of strong advantage of positive feature space, which is suitable for assessing the “information content” of features in complex classification tasks. Unlike , is a smooth and nonmonotonic function. In addition is a symmetric function and introduces more amount of nonlinearity in the training process than .
Iii Experimental Setup
This section is devoted to the experimental setup used in this paper. First, six datasets are described in detail, then the three types of networks are summarized, and finally the training settings are stated in detail.
Iiia Datasets Used
The effectiveness of the proposed LiSHT activation function is evaluated on six benchmark databases, including Car evaluation, Iris, MNIST, CIFAR10, CIFAR100 and twitter140. The Car Evaluation dataset is used in this paper to test the performance of activation functions with multilayer perceptron. It contains classes with a total of samples having dimensional features [34]. The Fisher’s Iris Flower dataset^{1}^{1}1C. Blake, C. Merz, UCI Repository of Machine Learning Databases. [34] is a classic and one of the best known multivariate dataset that contains a total of samples from different species of Iris, including Iris setosa, Iris virginica and Iris versicolor. In Iris dataset, every sample is a vector of length four (i.e., “sepal length”, “sepal width”, “petal length”, and “petal width”). The MNIST dataset is widely used for recognizing the English digits from images. This dataset contains a total of training samples and test samples of dimension [35]. In this dataset, one image contains only one digit from to . The sample images of the MNIST dataset are shown Fig. 2. The CIFAR10 dataset consists of resolution color images from classes, with images per class [36]. The images with images per class are used for the training and images with images per class are used for the testing. The classes of CIFAR10 dataset are from the‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, and ‘truck’ categories, respectively. See Fig. 2 for the sample images of CIFAR10 dataset. The CIFAR100 dataset contains the same CIFAR10 dataset images (i.e., for training and for testing). The training and testing images are categorized into classes in CIFAR100 dataset. The twitter140 dataset [37] is used to perform the classification of sentiments of Twitter messages by classifying as either positive, negative or neutral with respect to a query. In this dataset, we have considered 1,600,000 examples, where 85% are used as training set and the rest 15% as validation set.
IiiB Tested Neural Networks
Three different types of neural networks (i.e., a Multilayer Perceptron (MLP) with one hidden layer, a widely used deep convolutional Preactivation Residual Network (ResNetPreAct) [38], and a LongSort Term Memory (LSTM)) were tested to show the performance of activation functions. These architectures are explained in this section. The Multilayer Perceptron (MLP) with one hidden layer is used in this paper for the classification of data. The typical structure of MLP is illustrated in Fig. 3. The internal architecture in MLP uses input, hidden and final layer with , , and nodes for the Car evaluation dataset. For Iris Flower dataset, the MLP uses , , and nodes in the input, hidden and final layer, respectively. The MNIST dataset images are stretched into onedimensional vectors when used with MLP. Thus, for MNIST dataset, the MLP uses , , and nodes in the input, hidden and final layer, respectively. The Residual Neural Network (ResNet) is the stateoftheart for the image classification task. In this the improved version of ResNet (i.e., ResNet with Pre Activation [38]) is used for the image classification over MNIST, CIFAR10, and CIFAR100 datasets. The ResNetPreAct is used with 164layer (i.e., very deep network) for CIFAR10 and CIFAR100 datasets, whereas it is used with 20layer for MNIST dataset. The preactivation residual block is illustrated in Fig. 4. The channel pixel mean subtraction is used for preprocessing over image datasets with this network as per the standard practice being followed by most image classification neural networks. In this paper, the Long Short Term Memory (LSTM) is used as the third type of neural network, which basically belongs to the Recurrent Neural Network (RNN) family. A single layered LSTM with cells is used for sentiment analysis over twitter140 dataset. The LSTM is fed with dimensional word vectors trained with FastText Embeddings.
IiiC Training Settings
The Keras deep learning Python libraries with TensorFlow at the backend are used for the implementation of activation function. Different computer systems, including different GPUs (such as NVIDIA Titan X, Pascal 12GB GPU and NVIDIA Titan V 12GB GPU) are used at different stages of the experiments. The optimizer [39] is used for the experiments in this paper. The batch size is set to for the training of the networks. The learning rate is initialized to and reduced by a factor of at , , , and epochs during training. For LSTM, after iteration on sized minibatches, the learning rate is dropped by a factor of up to minibatch iterations.
Dataset 

Training  Validation  
Loss  Accuracy  Loss  Accuracy  
Car Eval.  Tanh  0.0341  98.84  0.0989  96.40  
Sigmoid  0.0253  98.77  0.1110  96.24  
ReLU  0.0285  99.10  0.0769  97.40  
Swish  0.0270  99.13  0.0790  97.11  
LiSHT  0.0250  99.28  0.0663  97.98  
Iris  Tanh  0.0937  97.46  0.0898  96.26  
Sigmoid  0.0951  97.83  0.0913  96.23  
ReLU  0.0983  98.33  0.0886  96.41  
Swish  0.0953  98.50  0.0994  96.34  
LiSHT  0.0926  98.67  0.0862  97.33  
MNIST  Tanh  0.0138  99.56  0.0987  98.26  
Sigmoid  0.0064  99.60  0.0928  98.43  
ReLU  0.0192  99.51  0.1040  98.48  
Swish  0.0159  99.58  0.1048  98.45  
LiSHT  0.0127  99.68  0.0915  98.60 
Dataset 

Activation Functions  

Tanh  ReLU  Swish  LiSHT  
MNIST  20  99.48  99.56  99.53  99.59  
CIFAR10  164  89.74  91.15  91.60  92.92  
CIFAR100  164  68.80  72.84  74.45  75.32 
Dataset  Activation Functions  

Tanh  ReLU  Swish  LiSHT  
Twitter140  82.27  82.47  82.22  82.47 
Iv Results and Analysis
We investigate the performance and effectiveness of the proposed activation and compare with stateoftheart activation functions such as , , and . In this section, first the experimental results using MLP, ResNet and LSTM are presented, then the results analysis are done in terms of the accuracy and loss vs epochs, then the analysis of feature activation maps and weight distribution is performed for different activation functions, and finally the effect of activation functions is analyzed using loss landscape.
Iva Data Classification Results using MLP
The classification performance using MLP over Cars, Iris and MNIST datasets are reported in Table I in terms of the loss and accuracy over training and validation sets. The training is performed for epochs using the categorical crossentropy loss. In order to run training smoothly in both the dataset, of samples were randomly chosen for training and remaining are used for validation. The proposed activation achieves minimum training and validation loss over three datasets. The proposed activation function achieves the best accuracies of , and over Car evaluation, Iris and MNIST datasets, respectively.
(a)  (c)  (e) 
(b)  (d)  (f) 
(a)  (c)  (e) 
(b)  (d)  (f) 
IvB Image Classification Results using ResNet
Table II summarizes the validation accuracies of different activations over MNIST, CIFAR10 and CIFAR100 datasets with preactivation ResNet. The depth of ResNet is for MNIST and for CIFAR datasets. The training is performed for the epochs using the categorical crossentropy loss. It is observed that the proposed activation function outperforms the other activation functions with a record accuracies of and , and over MNIST, CIFAR10 and CIFAR100 datasets, respectively. Moreover, a significant improvement has been shown by on CIFAR datasets. The unbounded, symmetric and more nonlinear properties of the proposed activation function facilitates better and efficient training as compared to the other activation functions such as , and . The unbounded and symmetric nature of leads to the more exploration of weights and positive and negative gradients to tackle the gradient diminishing and exploding problems.
IvC Sentiment Classification Results using LSTM
The sentiment classification performance in terms of the validation accuracy is reported in Table III over twitter140 dataset with LSTM for different activations. It is observed that the performance of proposed activation function is better than and , whereas the same as . It points out one important observation that by considering the negative values as negative by degrades the performance because it leads the activation more towards the linear function as compared to the activation.
IvD Result Analysis
The training and testing accuracy over the epochs are plotted in Fig. 5(a)(f) for MNIST, CIFAR10 and CIFAR100 datasets using ResNet. The convergence curve of losses is also used as the metric to measure the learning ability of the ResNet model with different activation functions. The training and testing loss over the epochs are plotted in Fig. 6(a)(f) for MNIST, CIFAR10 and CIFAR100 datasets using ResNet. It can be clearly visualized that the proposed activation breakthrough the convergence speed as compared to the other activations. The Fig. 7(a)(b) shows the convergence of validation loss and accuracy for LSTM network using different activations over twitter140 sentiment classification dataset. It is clearly observed that the proposed boosts the convergence speed. It is also observed that the outperforms the existing nonlinearities across several classification tasks with MLP, ResNet and LSTM networks. The training and validation losses converge faster using the activation.
(a) ReLU  (b) Swish  (c) LiSHT 
(d) ReLU  (e) Swish  (f) LiSHT 
IvE Analysis of Activation Feature Maps
In deep learning, it is a common practice to visualize the activations of different layer of the network. In order to understand the effect of activation functions over the learning of important features at different layer, we have shown the activation feature maps for different nonlinearities at and layer of the preactivation ResNet of MNIST digit in Fig. 8 and 9, respectively. The number of activation feature maps in and layers are (each having the spatial dimensions) and (each having the spatial dimensions), respectively. It can be seen from Fig. 8 and 9 that the images looking deeper blue are due to the dying neuron problem caused by the nonlearnable behavior arose due to the improper handling of negative values by the activation functions. The proposed activation consistently outperforms other activations. It is observed that the generates the less number of nonlearnable filters due to the unbounded nature in both positive and negative scenarios which helps it to overcome from the problem of dying gradient. It is also observed that some image patches contain noise in terms of the Yellow color. The patches corresponding to the contain less noise. Moreover, it is uniformly distributed over all the patches, when is used, compared to other activation functions. It may be also one of the factors that proposed outperforms other activations.
IvF Analysis of Final Weight Distribution
The weights of the layers are useful to visualize because it gives the idea about the learning pattern of the network in terms of 1) the positive and negative biasedness and 2) the exploration of weights caused by the activation functions. It is also expected that the welltrained networks usually display well normalized and smooth filters without much noisy patterns. These characteristics are usually most interpretable on the first layer which is looking directly at the raw pixel label data, but it is also possible to show the filter weights deeper in the network to gain the intuition about the abstract level learning of the network. The learning of strong filter weights is highly dependent upon the nonlinearity used for the training [40]. We have portrayed the weight distribution of final layer in Fig. 10 for preactivation ResNet over the MNIST dataset using , , and activations. The weight distribution for is limited in between and (see 10(a)) due to its bounded nature in both negative and positive regions. Interestingly, as depicted in 10(b), the weight distribution for is biased towards the positive region because it converts all negative values to zero which restricts the learning of weights in the negative direction. This leads to the problems of dying gradient as well as gradient exploding. The tries to overcome the problems of , but unable to succeed due to the bounded nature in negative region (see 10(c)). The above mentioned problems are removed in the as suggested by its weight distribution shown in Fig. 10(d). The activation leads to the symmetric and smoother weight distribution. Moreover, it also allows the exploration of weights in the higher range (i.e., in between and in the example of Fig. 10).
IvG Analysis of Loss Landscape
Training of deep neural networks requires minimizing a nonconvex highdimensional loss function â which is a hard task in theory, but sometimes easier in practice. Simple gradient descent methods [41] are often used to find global minimum/saddle points, where the defined configurations reaches training loss zero or near to zero, even when the data and labels are randomized before training. However, this behavior is desirable, but always not universal. The training ability of DNN is directly and indirectly influenced by the factors like network architecture, the choice of optimizer, variable initialization, and most importantly, what kind of nonlinearity function to be used in the architecture. In order to understand the effects of network architecture on nonconvexity, we trained the ResNet152 using , and proposed activations and try to explore the structure of the neural network loss landscape. The and visualizations of loss landscapes are illustrated in Fig. 11 by following the visualization technique proposed by Li et al. [42].
As depicted in the loss landscape visualizations in Fig. 11(a)(c), the makes the network to produce the smoother loss landscapes with smaller convergence steps which is populated by the narrow, and convex regions. It directly impacts the loss landscape. However, and also produce smooth loss landscape with large convergence steps, but unlike , both and cover wider searching area which leads to poor training behavior. In landscape visualization, it can be seen in Fig. 11(d)(f), it can be observed that the slope of the loss landscape is higher than the and which enables to train deep network efficiently. Therefore, we can say that, the decreases the nonconvex nature of overall loss minimization landscape as compared to the and activation functions.
V Conclusion
In this paper, a novel nonparametric linearly scaled hyper tangent activation function () is proposed for training the neural networks. The proposed activation function introduces more nonlinearity in the network. It is completely unbounded and solves the problems of diminishing gradient problems. Other properties of are symmetry, smoothness and nonmonotonicity, which play an important roles in training. The classification results are compared with the stateoftheart activation functions using MLP, ResNet and LSTM models over benchmark datasets. The performance is tested for the data classification, image classification and sentiment classification problems. The experimental results confirm the effectiveness of the unbounded, symmetric and highly nonlinear nature of the proposed activation function. The importance of unbounded and symmetric nonlinearity in both positive and negative regions are analyzed in terms of the activation maps and weight distribution of the learned network. A positive correlation is observed between the network learning and proposed unbounded and symmetric activation function. The visualization of loss landscape confirms the effectiveness of the proposed activations to make the training more smoother with faster convergence.
Acknowledgments
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce Titan X Pascal GPU used partially in this research.
References
 [1] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
 [2] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.
 [3] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
 [4] J. Tang, C. Deng, and G.B. Huang, “Extreme learning machine for multilayer perceptron,” IEEE transactions on neural networks and learning systems, vol. 27, no. 4, pp. 809–821, 2016.
 [5] L.W. Kim, “Deepx: Deep learning accelerator for restricted boltzmann machine artificial neural networks,” IEEE transactions on neural networks and learning systems, vol. 29, no. 5, pp. 1441–1453, 2018.
 [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [8] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
 [9] Y. Wang, M. Huang, L. Zhao et al., “Attentionbased lstm for aspectlevel sentiment classification,” in Proceedings of the 2016 conference on empirical methods in natural language processing, 2016, pp. 606–615.
 [10] X. Wang, A. Bao, Y. Cheng, and Q. Yu, “Multipath ensemble convolutional neural network,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2018.
 [11] Y. Chen, H. Chang, M. Jin, and D. Zhang, “Ensemble neural networks (enn): A gradientfree stochastic method,” Neural Networks, 2018.
 [12] S. Wen, H. Wei, Z. Zeng, and T. Huang, “Memristive fully convolutional network: An accurate hardware imagesegmentor in deep learning,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 5, pp. 324–334, 2018.
 [13] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Light gated recurrent units for speech recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92–102, 2018.
 [14] J.C. Hou, S.S. Wang, Y.H. Lai, Y. Tsao, H.W. Chang, and H.M. Wang, “Audiovisual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018.
 [15] S. Basha, S. Ghosh, K. K. Babu, S. R. Dubey, V. Pulabaigari, S. Mukherjee et al., “Rccnet: An efficient convolutional neural network for histological routine colon cancer nuclei classification,” arXiv preprint arXiv:1810.02797, 2018.
 [16] M. Goyal, N. D. Reeves, A. K. Davison, S. Rajbhandari, J. Spragg, and M. H. Yap, “Dfunet: Convolutional neural networks for diabetic foot ulcer classification,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2018.
 [17] K. Zheng, W. Q. Yan, and P. Nand, “Video dynamics detection using deep neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 3, pp. 224–234, 2018.
 [18] V. K. Repala and S. R. Dubey, “Dual cnn models for unsupervised monocular depth estimation,” arXiv preprint arXiv:1804.06324, 2018.
 [19] C. Nagpal and S. R. Dubey, “A performance evaluation of convolutional neural networks for face anti spoofing,” arXiv preprint arXiv:1805.04176, 2018.
 [20] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A deep learning approach to network intrusion detection,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 41–50, 2018.
 [21] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning activation functions to improve deep neural networks,” arXiv preprint arXiv:1412.6830, 2014.
 [22] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.
 [23] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
 [24] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 807–814.
 [25] D.A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
 [26] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Selfnormalizing neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 971–980.
 [27] D. Hendrycks and K. Gimpel, “Bridging nonlinearities and stochastic regularizers with gaussian error linear units,” arXiv preprint arXiv:1606.08415, 2016.
 [28] S. R. Dubey and S. Chakraborty, “Average biased relu based cnn descriptor for improved face retrieval,” arXiv preprint arXiv:1804.02051, 2018.
 [29] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
 [30] P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: a selfgated activation function,” arXiv preprint arXiv:1710.05941, 2017.
 [31] S. Scardapane, S. Van Vaerenbergh, A. Hussain, and A. Uncini, “Complexvalued neural networks with nonparametric activation functions,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2018.
 [32] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 315–323.
 [33] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoidweighted linear units for neural network function approximation in reinforcement learning,” Neural Networks, 2018.
 [34] L. Zhang and P. N. Suganthan, “Random forests with ensemble of feature spaces,” Pattern Recognition, vol. 47, no. 10, pp. 3429–3437, 2014.
 [35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [36] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
 [37] A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” Technical report, Stanford, 2009.
 [38] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
 [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [40] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833.
 [41] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
 [42] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems, 2018, pp. 6389–6399.