LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks

LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks

Swalpa Kumar Roy2,  Suvojit Manna, Shiv Ram Dubey,  and Bidyut B. Chaudhuri,  S.K. Roy and B.B. Chaudhuri are with the Computer Vision and Pattern Recognition Unit at Indian Statistical Institute, Kolkata 700108, India (email: swalpa@cse.jgec.ac.in, bbc@isical.ac.in).S. Manna is with the CureSkin as a Data Scientist, Bengaluru, Karnataka-560102, India (email: suvojit@heallo.ai).S.R. Dubey is with the Computer Vision Group, Indian Institute of Information Technology, Sri City, Andhra Pradesh-517646, India (e-mail: srdubey@iiits.in). 2 Corresponding author: S.K. Roy (email: swalpa@students.iiests.ac.in)This paper is submitted to IEEE for possible publication. Copyright may be transfered to IEEE.
Abstract

The activation function in neural network is one of the important aspects which facilitates the deep training by introducing the non-linearity into the learning process. However, because of zero-hard rectification, some the of existing activations function such as ReLU and Swish miss to utilize the negative input values and may suffer from the dying gradient problem. Thus, it is important to look for a better activation function which is free from such problems. As a remedy, this paper proposes a new non-parametric function, called Linearly Scaled Hyperbolic Tangent (LiSHT) for Neural Networks (NNs). The proposed LiSHT activation function is an attempt to scale the non-linear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem. The training and classification experiments are performed over benchmark Car Evaluation, Iris, MNIST, CIFAR10, CIFAR100 and twitter140 datasets to show that the proposed activation achieves faster convergence and higher performance. A very promising performance improvement is observed on three different type of neural networks including Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent neural network like Long-short term memory (LSTM). The advantages of proposed activation function are also visualized in terms of the feature activation maps, weight distribution and loss landscape.

Activation, Convolutional Neural Networks, Non-Linearity, Tanh function, Image Classification.
\newcases

Rcases .}

I Introduction

The deep learning method is one of the breakthroughs which replaced the hand-tuning tasks in many problems including computer vision, speech processing, natural language processing, robotics, and many more [1]. In recent times, the deep Artificial Neural Networks (ANNs) have shown a tremendous performance improvement due to availability of large scale datasets as well as high end computational resources [2]. Various types of ANN have been proposed for different type of problems such as Multilayer Perceptrons (MLP) [3] to deal with the real vector -dimensional data [4], [5]. Convolutional Neural Networks (CNN) are used to deal with the image and videos [6], [7]. Recurrent Neural Network (RNN) like Long-Short Term Memory (LSTM) are used for the speech and text data classification [8] and the sentiment analysis [9], etc.

Recently, a multipath ensemble CNN (ME-CNN) model is proposed which directly concatenates the low-level with that of high-level output features [10]. In ME-CNN, the back-propagation has shorter path, suitable for the ultradeep network training. Ensemble Neural Networks (ENN) is introduced as an efficient stochastic gradient-free optimization method that relies on covariance matrices [11]. Wen et al. proposed a Memristive Fully Convolutional Network (MFCN) for accelerating the hardware image-segmentor, where the memory segment is able to do the computation [12]. A light gated recurrent unit (LGRU) based neural network is proposed recently by Ravanelli et al. for Speech Recognition [13]. Recently, the deep learning is used in the different tasks such as deep CNN for audio-visual speech enhancement [14], RCCNet (a CNN) for histological routine colon cancer nuclei classification [15], DFUNet (a CNN) for diabetic foot ulcer classification [16], combining CNN and RNN for video dynamics detection [17], dual CNN models for unsupervised monocular depth estimation [18], CNNs for face anti-spoofing [19], and nonsymmetric deep autoencoder (NDAE) based unsupervised deep learning for network intrusion detection [20], etc.

The main aim of any type of neural network is to transform the input data in some other feature space, where it becomes linearly separable into classes. In order to achieve it, all the neural networks rely on a compulsory unit called the activation function [21]. The job of activation functions is to introduce the non-linearity in the network to facilitate automatic learning of connection weights for a specific problem. Thus, the activation function is the backbone of any neural network.

The activation function was commonly used in the early days of neural networks. It is a special case of the logistic function. The function squashes the real-valued numbers into or . In turn, the large negative number becomes and large positive number becomes . The hyperbolic tangent function is the another popular activation function. The output values of lie between and depending on the input real values. The vanishing gradient in both positive as well as negative directions is one of the major problems with both and activation functions. The Rectified Linear Unit () activation function was proposed in recent past for training deep networks [6]. is a breakthrough against vanishing gradient. It is an identity function for the non-negative inputs and zero function (i.e., the output is zero) for the negative inputs. The has become very popular due to its simplicity and used in various types of neural networks. The major drawback with is the diminishing gradient for the negative values of the inputs. Due to the diminishing gradient, the units can be fragile during training and the gradient may die.

Several researchers have proposed the improvement on such as Leaky ReLU () [22], Parametric ReLU () [23], [24], Exponential Linear Unit () [25], Scaled Exponential Linear Unit () [26], Gaussian Error Linear Unit () [27], Average Biased ReLU () [28], etc. The is similar to , except it allows a small, non-negative and constant gradient (such as 0.01) for the negative regime of input in oder to reduce the dying neuron input problem [22]. The is the extension of by training the coefficient of leakage into a parameter instead of fixing it with a constant value [23]. The leakage parameter in is learned along with the other neural network parameters. Thus, both and are in between the linear function and which is a drawback in terms of decreased amounts of non-linearity. The activation function tries to make the transition of (i.e., at ) smooth by fitting the log function [24]. Otherwise, the activation is very similar to the activation. The function is very similar to in terms of the identity functions for non-negative inputs [25]. Unlike the which sharply becomes smooth, the becomes smooth slowly until its output equal to a constant negative value. For positive inputs, the [25] can blow up the activation, which can lead to the gradient exploding problem. The adds one scaling fixed parameter in which sustains the bad weight initialization [26] better. The randomly applies the identity or zero map to a neuron’s input according to Gaussian fashion [27]. The shape of is in between the and . The is proposed recently which allows the prominent negative values after converting it into positive values by biasing it with the average of all activations [28]. The also could not utilize all the negative values due to trimming of values at zero, similar to . Most of these existing activation methods are sometimes not able to take the advantage of negative values which is solved in the proposed activation.

Recently, Xu et al. have performed an empirical study of rectified activations in CNNs [29]. Very recently, Ramachandran et al. proposed a sigmoid-weighted linear unit activation function [30]. They tested with many functions and found , i.e., is a promising activation function. The activation function interpolates between and Linear function, by learning the parameter . However, the gradient diminishing problem is still present in case of function. Recently, an attempt is made to design the complex and nonparametric activation function for complex-valued neural networks (CVNNs) [31]. It relies over the kernel expansion with a fixed dictionary which can be implemented on vectorized hardware.

The existing activation functions are still weak to replace the due to inconsistent gains. In this paper, a linearly scaled hyperbolic tangent activation function () is proposed to introduce the non-linearities in the neural networks. The scales the function linearly to tackle its gradient diminishing problem. Here, some classification tackle experiments are performed to show the improvement over variety of datasets on three type of neural networks.

(a)
(b)
(c)
Fig. 1: (a) The characteristics of the activation functions by visualization of the proposed LiSHT and commonly used activation functions such as , , , and . It can be noted that the proposed introduces more amount of non-linearity as compared to the other activations. Moreover, the proposed activation is symmetric w.r.t. y-axis and converts the negative values into positive values. This property of the proposed activation function helps to avoid the gradient diminishing problem of other activation functions. (b) The order Derivative of proposed LiSHT activation. The order derivative is positive for the positive input and negative for the negative input. This property of proposed activation function helps to avoid the gradient exploding problem of other activation functions. (c) The order Derivative of proposed LiSHT activation. Very importantly, it is similar to the opposite of the Laplacian operator (i.e., order derivative of Gaussian), which is used to find the maximum. Thus, it makes sense to use the activation function like proposed which is similar to the opposite of Gaussian for the better training of the neural networks in terms of the minimization of loss functions.

The contributions of this paper are as follows,

  • A new activation function named non-parametric Linearly Scaled Hyperbolic Tangent () is proposed by linearly scaling the activation function.

  • The increased amount of non-linearity of the proposed activation function is visualized from its first and second order derivatives curves (Fig. 1)

  • The proposed activation function is tested with different types of neural networks, including older Artificial Neural Network (ANN) by Multilayer Perceptron (MLP), Convolutional Neural Network (CNN) by Residual Neural Network (ResNet), and Recurrent Neural Network (RNN) by Long-Short Term Memory (LSTM) network.

  • Rigorous experiments and analysis are performed for some classification problems.

  • Three different types of experimental data are used 1) one-dimensional data, including Car, Iris and MNIST (converted from image) datasets, 2) 2-D image data, including MNIST, CIFAR-10 and CIFAR-100 datasets, and 3) sentiment analysis data, including twitter140 dataset.

  • The impact of different non-linearity functions over activation feature maps and weight distribution has been analyzed.

This paper is organized as follows: Section 2 outlines the proposed activation; Section 3 describes the experimental setup; Section 4 presents the results and analysis; and Section 5 contains the concluding remarks.

Ii Proposed LiSHT Activation Function

A feed-forward Deep Neural Network (DNN) comprises of more than one hidden nonlinear layer. Let an input vector be , and each hidden layer is capable to transform its input vector by applying a linear affine transform followed by a nonlinear mapping from the layer to the layer as follows:

(1)

Here, , and represent the vectors of output, weights, biases and number of units in the hidden layer, respectively, and a non-linear activation mapping . Looking for an efficient and powerful activation function in DNN is always demanding due to the overabundance by the saturation properties of existing activation functions. An activation function is said to be saturate [32], if its derivative tends to zero in both directions (i.e., and , respectively). The training of a deep neural networks is almost impossible with of and activation functions due to the gradient diminishing problem when input is either too small or too large [2]. For the first time, the Rectified Linear Unit () (i.e., ) became very popular activation for training the DNN [6]. But, also suffers due to the gradient diminishing problem for negative inputs which lead to the dying neuron problem.

Hence, we propose a non-parametric linearly scaled hyperbolic tangent activation function, so called . Like [6] and  [30, 33], shares the similar unbounded upper limits property on the right hand side of activation curve. However, because of the symmetry preserving property of , the left hand side of the activation is in the upwardly unbounded direction, hence it satisfies non-monotonicity (see Fig. 1(a)). Apart from the literature [25, 30] and to the best of our knowledge, first time in the history of activation function, utilizes the benefits of positive valued activation without identically propagating all the inputs, which mitigates gradient vanishing at back propagation and acquiesces faster training of deep neural network. The proposed activation function is computed by multiplying the function to its input and defined as,

(2)

where is a hyperbolic tangent function and defined as,

(3)

For the large positive inputs, the behavior of the is close to the and , i.e., the output is close to the input as depicted in Fig. 1(a). Whereas, unlike and other commonly used activation functions, the output of for negative inputs is symmetric to the output of for positive inputs as illustrated in Fig. 1(a). The order derivative (i.e., ) of is given as follows,

(4)

Similarly, the order derivative (i.e., ) of is given as follows,

(5)

The and order derivatives of the proposed LiSHT are plotted in Fig. 1(b) and Fig. 1(c), respectively. An attractive characteristic of the LiSHT is self-stability property, the magnitude of derivatives is less than for . It can be observed from the derivatives of in Fig. 1 that the amount of non-linearity is very high near to zero as compared to the existing activations which can boost the learning of a complex model. Another major advantage of proposed activation is due to the nearly linear nature for very small and very large inputs which resolves the gradient diminishing problem of existing methods. It can be seen in Fig. 1(b) that the gradients of proposed activation function have both the positive and negative values, which tackles the problem of exploding gradient of . As described in Fig. 1(c) that the order derivative of proposed activation function is similar to the opposite of the Laplacian operator (i.e., the order derivative of Gaussian operator) which is useful to maximize a function. Thus, due to opposite nature of Gaussian operator, the proposed activation function boosts the training of the neural network for the minimization problem of the loss function.

Fig. 2: Random sample images taken from each category of MNIST (left) and CIFAR-10 (right) dataset. CIFAR-10 images are taken from https://www.cs.toronto.edu/kriz/cifar.html.

Theoretically, it is difficult to prove why an activation function works well, it may be due to many puzzling factors that directly affect DNN training. We understand that being unbounded in both positive and negative directions, smooth, and non-monotonicity are the advantages of the proposed activation. The complete unbounded property makes different from all the traditional activation functions. Moreover, it makes use of strong advantage of positive feature space, which is suitable for assessing the “information content” of features in complex classification tasks. Unlike , is a smooth and non-monotonic function. In addition is a symmetric function and introduces more amount of non-linearity in the training process than .

Iii Experimental Setup

This section is devoted to the experimental setup used in this paper. First, six datasets are described in detail, then the three types of networks are summarized, and finally the training settings are stated in detail.

Iii-a Datasets Used

The effectiveness of the proposed LiSHT activation function is evaluated on six benchmark databases, including Car evaluation, Iris, MNIST, CIFAR-10, CIFAR-100 and twitter140. The Car Evaluation dataset is used in this paper to test the performance of activation functions with multilayer perceptron. It contains classes with a total of samples having dimensional features  [34]. The Fisher’s Iris Flower dataset111C. Blake, C. Merz, UCI Repository of Machine Learning Databases. [34] is a classic and one of the best known multivariate dataset that contains a total of samples from different species of Iris, including Iris setosa, Iris virginica and Iris versicolor. In Iris dataset, every sample is a vector of length four (i.e., “sepal length”, “sepal width”, “petal length”, and “petal width”). The MNIST dataset is widely used for recognizing the English digits from images. This dataset contains a total of training samples and test samples of dimension [35]. In this dataset, one image contains only one digit from to . The sample images of the MNIST dataset are shown Fig. 2. The CIFAR-10 dataset consists of resolution color images from classes, with images per class [36]. The images with images per class are used for the training and images with images per class are used for the testing. The classes of CIFAR-10 dataset are from the‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, and ‘truck’ categories, respectively. See Fig. 2 for the sample images of CIFAR10 dataset. The CIFAR-100 dataset contains the same CIFAR-10 dataset images (i.e., for training and for testing). The training and testing images are categorized into classes in CIFAR-100 dataset. The twitter140 dataset [37] is used to perform the classification of sentiments of Twitter messages by classifying as either positive, negative or neutral with respect to a query. In this dataset, we have considered 1,600,000 examples, where 85% are used as training set and the rest 15% as validation set.

Fig. 3: A multi-layer perceptron (MLP) with input units, output units and hidden units, where superscript represents a single hidden layer.
Fig. 4: The Pre-Activated Residual Module: BN, LiSHT, and Weight represent batch normalization, activation, and convolutional (Conv) layers, respectively. The and are the input and output volumes, respectively. The residual module basically adds the input volume with output volume of the second Conv layer to produce the final output volume.

Iii-B Tested Neural Networks

Three different types of neural networks (i.e., a Multi-layer Perceptron (MLP) with one hidden layer, a widely used deep convolutional Pre-activation Residual Network (ResNet-PreAct) [38], and a Long-Sort Term Memory (LSTM)) were tested to show the performance of activation functions. These architectures are explained in this section. The Multi-layer Perceptron (MLP) with one hidden layer is used in this paper for the classification of data. The typical structure of MLP is illustrated in Fig. 3. The internal architecture in MLP uses input, hidden and final layer with , , and nodes for the Car evaluation dataset. For Iris Flower dataset, the MLP uses , , and nodes in the input, hidden and final layer, respectively. The MNIST dataset images are stretched into one-dimensional vectors when used with MLP. Thus, for MNIST dataset, the MLP uses , , and nodes in the input, hidden and final layer, respectively. The Residual Neural Network (ResNet) is the state-of-the-art for the image classification task. In this the improved version of ResNet (i.e., ResNet with Pre Activation [38]) is used for the image classification over MNIST, CIFAR-10, and CIFAR-100 datasets. The ResNet-PreAct is used with 164-layer (i.e., very deep network) for CIFAR-10 and CIFAR-100 datasets, whereas it is used with 20-layer for MNIST dataset. The pre-activation residual block is illustrated in Fig. 4. The channel pixel mean subtraction is used for preprocessing over image datasets with this network as per the standard practice being followed by most image classification neural networks. In this paper, the Long Short Term Memory (LSTM) is used as the third type of neural network, which basically belongs to the Recurrent Neural Network (RNN) family. A single layered LSTM with cells is used for sentiment analysis over twitter140 dataset. The LSTM is fed with dimensional word vectors trained with FastText Embeddings.

Iii-C Training Settings

The Keras deep learning Python libraries with TensorFlow at the back-end are used for the implementation of activation function. Different computer systems, including different GPUs (such as NVIDIA Titan X, Pascal 12GB GPU and NVIDIA Titan V 12GB GPU) are used at different stages of the experiments. The optimizer [39] is used for the experiments in this paper. The batch size is set to for the training of the networks. The learning rate is initialized to and reduced by a factor of at , , , and epochs during training. For LSTM, after iteration on sized mini-batches, the learning rate is dropped by a factor of up to mini-batch iterations.

Dataset
Activation
Function
Training Validation
Loss Accuracy Loss Accuracy
Car Eval. Tanh 0.0341 98.84 0.0989 96.40
Sigmoid 0.0253 98.77 0.1110 96.24
ReLU 0.0285 99.10 0.0769 97.40
Swish 0.0270 99.13 0.0790 97.11
LiSHT 0.0250 99.28 0.0663 97.98
Iris Tanh 0.0937 97.46 0.0898 96.26
Sigmoid 0.0951 97.83 0.0913 96.23
ReLU 0.0983 98.33 0.0886 96.41
Swish 0.0953 98.50 0.0994 96.34
LiSHT 0.0926 98.67 0.0862 97.33
MNIST Tanh 0.0138 99.56 0.0987 98.26
Sigmoid 0.0064 99.60 0.0928 98.43
ReLU 0.0192 99.51 0.1040 98.48
Swish 0.0159 99.58 0.1048 98.45
LiSHT 0.0127 99.68 0.0915 98.60
TABLE I: The classification performance of MLP for different activations over Cars Evaluation, Iris and MNIST datasets.
Dataset
ResNet
Depth
Activation Functions
Tanh ReLU Swish LiSHT
MNIST 20 99.48 99.56 99.53 99.59
CIFAR-10 164 89.74 91.15 91.60 92.92
CIFAR-100 164 68.80 72.84 74.45 75.32
TABLE II: The classification performance of ResNet for different activations over MNIST and CIFAR-10/100 datasets.
Dataset Activation Functions
Tanh ReLU Swish LiSHT
Twitter140 82.27 82.47 82.22 82.47
TABLE III: The classification performance of LSTM for different activations over twitter140 dataset.

Iv Results and Analysis

We investigate the performance and effectiveness of the proposed activation and compare with state-of-the-art activation functions such as , , and . In this section, first the experimental results using MLP, ResNet and LSTM are presented, then the results analysis are done in terms of the accuracy and loss vs epochs, then the analysis of feature activation maps and weight distribution is performed for different activation functions, and finally the effect of activation functions is analyzed using loss landscape.

Iv-a Data Classification Results using MLP

The classification performance using MLP over Cars, Iris and MNIST datasets are reported in Table I in terms of the loss and accuracy over training and validation sets. The training is performed for epochs using the categorical cross-entropy loss. In order to run training smoothly in both the dataset, of samples were randomly chosen for training and remaining are used for validation. The proposed activation achieves minimum training and validation loss over three datasets. The proposed activation function achieves the best accuracies of , and over Car evaluation, Iris and MNIST datasets, respectively.

(a) (c) (e)
(b) (d) (f)
Fig. 5: The classification accuracy for training and validation sets using the and state-of-the-art , and activations with ResNet model over (a)-(b) MNIST (c)-(d) CIFAR-10 and (e)-(f) CIFAR-100 datasets.
(a) (c) (e)
(b) (d) (f)
Fig. 6: The convergence curves in terms of loss for training and validation sets using the and state-of-the-art , , and activations with ResNet model over (a)-(b) MNIST (c)-(d) CIFAR-10 and (e)-(f) CIFAR-100 datasets.
(a)
(b)
(c)
(d)
Fig. 7: (a)-(b) The training loss and training accuracy (c)-(d) validation loss and accuracy using LSTM over twitter140 dataset for different activations.
(a) Using activation function
(b) Using activation function
(c) Using activation function
(d) Using activation function
Fig. 8: Visualization of MNIST digit from the layer activation feature maps without feature scale clipping using a fully trained pre-activation ResNet model using the (a) (b) (c) and (d) activation, respectively. Note that there are feature maps of dimension in the layer, represented in rows and columns.
(a) Using activation function
(b) Using activation function
(c) Using activation function
(d) Using activation function
Fig. 9: Visualization of MNIST digit from the layer activation feature maps without feature scale clipping using a fully trained pre-activation ResNet model using the (a) (b) (c) and (d) activation, respectively. Note that there are feature maps of dimension in the layer, represented in rows and columns.
(a)
(b)
(c)
(d)
Fig. 10: Visualizations of the distribution of weights from the final layer of pre-activation ResNet over MNIST dataset for the (a) (b) (c) and (d) activations, respectively.

Iv-B Image Classification Results using ResNet

Table II summarizes the validation accuracies of different activations over MNIST, CIFAR-10 and CIFAR-100 datasets with pre-activation ResNet. The depth of ResNet is for MNIST and for CIFAR datasets. The training is performed for the epochs using the categorical cross-entropy loss. It is observed that the proposed activation function outperforms the other activation functions with a record accuracies of and , and over MNIST, CIFAR-10 and CIFAR-100 datasets, respectively. Moreover, a significant improvement has been shown by on CIFAR datasets. The unbounded, symmetric and more non-linear properties of the proposed activation function facilitates better and efficient training as compared to the other activation functions such as , and . The unbounded and symmetric nature of leads to the more exploration of weights and positive and negative gradients to tackle the gradient diminishing and exploding problems.

Iv-C Sentiment Classification Results using LSTM

The sentiment classification performance in terms of the validation accuracy is reported in Table III over twitter140 dataset with LSTM for different activations. It is observed that the performance of proposed activation function is better than and , whereas the same as . It points out one important observation that by considering the negative values as negative by degrades the performance because it leads the activation more towards the linear function as compared to the activation.

Iv-D Result Analysis

The training and testing accuracy over the epochs are plotted in Fig. 5(a)-(f) for MNIST, CIFAR-10 and CIFAR-100 datasets using ResNet. The convergence curve of losses is also used as the metric to measure the learning ability of the ResNet model with different activation functions. The training and testing loss over the epochs are plotted in Fig. 6(a)-(f) for MNIST, CIFAR-10 and CIFAR-100 datasets using ResNet. It can be clearly visualized that the proposed activation breakthrough the convergence speed as compared to the other activations. The Fig. 7(a)-(b) shows the convergence of validation loss and accuracy for LSTM network using different activations over twitter140 sentiment classification dataset. It is clearly observed that the proposed boosts the convergence speed. It is also observed that the outperforms the existing non-linearities across several classification tasks with MLP, ResNet and LSTM networks. The training and validation losses converge faster using the activation.

(a) ReLU (b) Swish (c) LiSHT
(d) ReLU (e) Swish (f) LiSHT
Fig. 11: The vizualization of 2D (top row) and 3D (bottom row) Loss Landscape plot of CIFAR-10 shown using ReLU, Swish and LiSHT, respectively.

Iv-E Analysis of Activation Feature Maps

In deep learning, it is a common practice to visualize the activations of different layer of the network. In order to understand the effect of activation functions over the learning of important features at different layer, we have shown the activation feature maps for different non-linearities at and layer of the pre-activation ResNet of MNIST digit in Fig. 8 and 9, respectively. The number of activation feature maps in and layers are (each having the spatial dimensions) and (each having the spatial dimensions), respectively. It can be seen from Fig. 8 and 9 that the images looking deeper blue are due to the dying neuron problem caused by the non-learnable behavior arose due to the improper handling of negative values by the activation functions. The proposed activation consistently outperforms other activations. It is observed that the generates the less number of non-learnable filters due to the unbounded nature in both positive and negative scenarios which helps it to overcome from the problem of dying gradient. It is also observed that some image patches contain noise in terms of the Yellow color. The patches corresponding to the contain less noise. Moreover, it is uniformly distributed over all the patches, when is used, compared to other activation functions. It may be also one of the factors that proposed outperforms other activations.

Iv-F Analysis of Final Weight Distribution

The weights of the layers are useful to visualize because it gives the idea about the learning pattern of the network in terms of 1) the positive and negative biasedness and 2) the exploration of weights caused by the activation functions. It is also expected that the well-trained networks usually display well normalized and smooth filters without much noisy patterns. These characteristics are usually most interpretable on the first layer which is looking directly at the raw pixel label data, but it is also possible to show the filter weights deeper in the network to gain the intuition about the abstract level learning of the network. The learning of strong filter weights is highly dependent upon the non-linearity used for the training [40]. We have portrayed the weight distribution of final layer in Fig. 10 for pre-activation ResNet over the MNIST dataset using , , and activations. The weight distribution for is limited in between and (see 10(a)) due to its bounded nature in both negative and positive regions. Interestingly, as depicted in 10(b), the weight distribution for is biased towards the positive region because it converts all negative values to zero which restricts the learning of weights in the negative direction. This leads to the problems of dying gradient as well as gradient exploding. The tries to overcome the problems of , but unable to succeed due to the bounded nature in negative region (see 10(c)). The above mentioned problems are removed in the as suggested by its weight distribution shown in Fig. 10(d). The activation leads to the symmetric and smoother weight distribution. Moreover, it also allows the exploration of weights in the higher range (i.e., in between and in the example of Fig. 10).

Iv-G Analysis of Loss Landscape

Training of deep neural networks requires minimizing a non-convex high-dimensional loss function – which is a hard task in theory, but sometimes easier in practice. Simple gradient descent methods [41] are often used to find global minimum/saddle points, where the defined configurations reaches training loss zero or near to zero, even when the data and labels are randomized before training. However, this behavior is desirable, but always not universal. The training ability of DNN is directly and indirectly influenced by the factors like network architecture, the choice of optimizer, variable initialization, and most importantly, what kind of non-linearity function to be used in the architecture. In order to understand the effects of network architecture on non-convexity, we trained the ResNet-152 using , and proposed activations and try to explore the structure of the neural network loss landscape. The and visualizations of loss landscapes are illustrated in Fig. 11 by following the visualization technique proposed by Li et al. [42].

As depicted in the loss landscape visualizations in Fig. 11(a)-(c), the makes the network to produce the smoother loss landscapes with smaller convergence steps which is populated by the narrow, and convex regions. It directly impacts the loss landscape. However, and also produce smooth loss landscape with large convergence steps, but unlike , both and cover wider searching area which leads to poor training behavior. In landscape visualization, it can be seen in Fig. 11(d)-(f), it can be observed that the slope of the loss landscape is higher than the and which enables to train deep network efficiently. Therefore, we can say that, the decreases the non-convex nature of overall loss minimization landscape as compared to the and activation functions.

V Conclusion

In this paper, a novel non-parametric linearly scaled hyper tangent activation function () is proposed for training the neural networks. The proposed activation function introduces more non-linearity in the network. It is completely unbounded and solves the problems of diminishing gradient problems. Other properties of are symmetry, smoothness and non-monotonicity, which play an important roles in training. The classification results are compared with the state-of-the-art activation functions using MLP, ResNet and LSTM models over benchmark datasets. The performance is tested for the data classification, image classification and sentiment classification problems. The experimental results confirm the effectiveness of the unbounded, symmetric and highly non-linear nature of the proposed activation function. The importance of unbounded and symmetric non-linearity in both positive and negative regions are analyzed in terms of the activation maps and weight distribution of the learned network. A positive correlation is observed between the network learning and proposed unbounded and symmetric activation function. The visualization of loss landscape confirms the effectiveness of the proposed activations to make the training more smoother with faster convergence.

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce Titan X Pascal GPU used partially in this research.

References

  • [1] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
  • [2] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.    MIT press Cambridge, 2016, vol. 1.
  • [3] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
  • [4] J. Tang, C. Deng, and G.-B. Huang, “Extreme learning machine for multilayer perceptron,” IEEE transactions on neural networks and learning systems, vol. 27, no. 4, pp. 809–821, 2016.
  • [5] L.-W. Kim, “Deepx: Deep learning accelerator for restricted boltzmann machine artificial neural networks,” IEEE transactions on neural networks and learning systems, vol. 29, no. 5, pp. 1441–1453, 2018.
  • [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [8] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
  • [9] Y. Wang, M. Huang, L. Zhao et al., “Attention-based lstm for aspect-level sentiment classification,” in Proceedings of the 2016 conference on empirical methods in natural language processing, 2016, pp. 606–615.
  • [10] X. Wang, A. Bao, Y. Cheng, and Q. Yu, “Multipath ensemble convolutional neural network,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2018.
  • [11] Y. Chen, H. Chang, M. Jin, and D. Zhang, “Ensemble neural networks (enn): A gradient-free stochastic method,” Neural Networks, 2018.
  • [12] S. Wen, H. Wei, Z. Zeng, and T. Huang, “Memristive fully convolutional network: An accurate hardware image-segmentor in deep learning,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 5, pp. 324–334, 2018.
  • [13] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Light gated recurrent units for speech recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92–102, 2018.
  • [14] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018.
  • [15] S. Basha, S. Ghosh, K. K. Babu, S. R. Dubey, V. Pulabaigari, S. Mukherjee et al., “Rccnet: An efficient convolutional neural network for histological routine colon cancer nuclei classification,” arXiv preprint arXiv:1810.02797, 2018.
  • [16] M. Goyal, N. D. Reeves, A. K. Davison, S. Rajbhandari, J. Spragg, and M. H. Yap, “Dfunet: Convolutional neural networks for diabetic foot ulcer classification,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2018.
  • [17] K. Zheng, W. Q. Yan, and P. Nand, “Video dynamics detection using deep neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 3, pp. 224–234, 2018.
  • [18] V. K. Repala and S. R. Dubey, “Dual cnn models for unsupervised monocular depth estimation,” arXiv preprint arXiv:1804.06324, 2018.
  • [19] C. Nagpal and S. R. Dubey, “A performance evaluation of convolutional neural networks for face anti spoofing,” arXiv preprint arXiv:1805.04176, 2018.
  • [20] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A deep learning approach to network intrusion detection,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 41–50, 2018.
  • [21] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning activation functions to improve deep neural networks,” arXiv preprint arXiv:1412.6830, 2014.
  • [22] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [24] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
  • [25] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
  • [26] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 971–980.
  • [27] D. Hendrycks and K. Gimpel, “Bridging nonlinearities and stochastic regularizers with gaussian error linear units,” arXiv preprint arXiv:1606.08415, 2016.
  • [28] S. R. Dubey and S. Chakraborty, “Average biased relu based cnn descriptor for improved face retrieval,” arXiv preprint arXiv:1804.02051, 2018.
  • [29] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
  • [30] P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: a self-gated activation function,” arXiv preprint arXiv:1710.05941, 2017.
  • [31] S. Scardapane, S. Van Vaerenbergh, A. Hussain, and A. Uncini, “Complex-valued neural networks with non-parametric activation functions,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2018.
  • [32] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 315–323.
  • [33] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Networks, 2018.
  • [34] L. Zhang and P. N. Suganthan, “Random forests with ensemble of feature spaces,” Pattern Recognition, vol. 47, no. 10, pp. 3429–3437, 2014.
  • [35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [36] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
  • [37] A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” Technical report, Stanford, 2009.
  • [38] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision.    Springer, 2016, pp. 630–645.
  • [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [40] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision.    Springer, 2014, pp. 818–833.
  • [41] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
  • [42] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems, 2018, pp. 6389–6399.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
331906
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description