Locally adaptive activation functions with slope recovery term for deep and physicsinformed neural networks
Abstract
We propose two approaches of locally adaptive activation functions namely, layerwise and neuronwise locally adaptive activation functions, which improve the performance of deep and physicsinformed neural networks. The local adaptation of activation function is achieved by introducing scalable hyperparameters in each layer (layerwise) and for every neuron separately (neuronwise), and then optimizing it using the stochastic gradient descent algorithm. Introduction of neuronwise activation function acts like a vector activation function as opposed to the traditional scalar activation function given by fixed, global and layerwise activations. In order to further increase the training speed, an activation slope based slope recovery term is added in the loss function, which further accelerate convergence, thereby reducing the training cost. For numerical experiments, a nonlinear discontinuous function is approximated using a deep neural network with layerwise and neuronwise locally adaptive activation functions with and without the slope recovery term and compared with its global counterpart. Moreover, solution of the nonlinear Burgers equation, which exhibits steep gradients, is also obtained using the proposed methods. On the theoretical side, we prove that in the proposed method the gradient descent algorithms are not attracted to suboptimal critical points or local minima under practical conditions on the initialization and learning rate. Furthermore, the proposed adaptive activation functions with the slope recovery are shown to accelerate the training process in standard deep learning benchmarks using CIFAR10, CIFAR100, SVHN, MNIST, KMNIST, FashionMNIST, and Semeion data sets with and without data augmentation.
keywords:
Machine learning, bad minima, stochastic gradients, accelerated training, PINN, deep learning benchmarks.1 Introduction
In recent years, research on neural networks (NNs) has intensified around the world due to their successful applications in many diverse fields such as speech recognition SpeechRecog , computer vision CompVision , natural language translation NLT , etc. Training of NN is performed on data sets before using it in the actual applications. Various data sets are available for applications like image classification, which is a subset of computer vision. MNIST LeCun and their variants like, FashionMNIST FMNIST , and KMNIST KMNIST are the data sets for handwritten digits, images of clothing and accessories, and Japanese letters, respectively. Apart from MNIST, Semeion Semeion is a handwritten digit data set that contains 1593 digits collected from 80 persons. SVHN SVHN is another data set for street view house numbers obtained from house numbers in Google Street View images. CIFAR CIFAR is the popular data set containing color images commonly used to train machine learning algorithms. In particular, the CIFAR10 data set contains 50000 training and 10000 testing images in 10 classes with image resolution of 32x32. CIFAR100 is similar to the CIFAR10, except it has 100 classes with 600 images in each class, which is more challenging than the CIFAR10 data set.
Recently, NNs have also been used to solve partial differential equations (PDEs) due to their ability to effectively approximate complex functions arising in diverse scientific disciplines (cf., Physicsinformed Neural Networks (PINNs) by Raissi et al., RK1 ). PINNs can accurately solve both forward problems, where the approximate solutions of governing equations are obtained, as well as inverse problems, where parameters involved in the governing equation are inferred from the training data. In the PINN algorithm, along with the contribution from the neural network the loss function is enriched by the addition of the residual term from the governing equation, which act as a penalizing term constraining the space of admissible solutions.
Highly efficient and adaptable algorithms are important to design the most effective NN which not only increases the accuracy of the solution but also reduces the training cost. Various architectures of NN like Dropout NN DropoutNN are proposed in the literature, which can improve the efficiency of the algorithm for specific applications. Activation function plays an important role in the training process of NN. In this work, we are particularly focusing on adaptive activation functions, which adapt automatically such that the network can be trained faster. Various methods are proposed in the literature for adaptive activation function, like the adaptive sigmoidal activation function proposed by Yu et al Yu for multilayer feedforward NNs, while Qian et al Qian focuses on learning activation functions in convolutional NNs by combining basic activation functions in a datadriven way. Multiple activation functions per neuron are proposed by Dushkoff and Ptucha MiRa , where individual neurons select between a multitude of activation functions. Li et al LLR proposed a tunable activation function, where only a single hidden layer is used and the activation function is tuned. Shen et al SWCC , used a similar idea of tunable activation function but with multiple outputs. Recently, Kunc and Klma proposed a transformative adaptive activation functions for gene expression inference, see KK . One such adaptive activation function is proposed by Jagtap and Karniadakis JK by introducing scalable hyperparameter in the activation function, which can be optimized. Mathematically, it changes the slope of activation function thereby increasing the learning process, especially during the initial training period. Due to single scalar hyperparameter, we call such adaptive activation functions globally adaptive activations, meaning that it gives an optimized slope for the entire network. One can think of doing such optimization at the local level, where the scalable hyperparameter are introduced hidden layerwise or even for each neuron in the network. Such local adaptation can further improve the performance of the network.
Figure 1 shows a sketch of a neuronwise locally adaptive activation function based physicsinformed neural network (LAAFPINN), where both the NN part along with the physicsinformed part can be seen. In this architecture, along with the output of NN and the residual term from the governing equation, the activation slopes from every neuron are also contributing to the loss function in the form of slope recovery term. In every neuron, a scalable hyperparameter is introduced, which can be optimized to lead to a quick and efficient training process in NN.
The rest of the paper is organized as follows. Section 2 presents the methodology of the proposed layerwise and neuronwise locally adaptive activations in detail. This also includes a discussion on the slope recovery term, expansion of parametric space due to layerwise and neuronwise introduction of hyperparameters, its effect on the overall training cost, and a theoretical result for gradient decent algorithms. In section 3, we perform various numerical experiments, where we approximate a nonlinear discontinuous function using deep NN by the proposed approaches. We also solve the Burgers equation using the proposed algorithm and present various comparisons with fixed and global activation functions. Section 4 presents numerical results with various standard deep learning benchmarks using CIFAR10, CIFAR100, SVHN, MNIST, KMNIST, FashionMNIST, and Semeion data sets. Finally, in section 5, we summarize the conclusions of our work.
2 Methodology
We use a NN of depth corresponding to a network with an input layer, hiddenlayers and an output layer. In the hiddenlayer, number of neurons are present. Each hiddenlayer of the network receives an output from the previous layer where an affine transformation of the form
(1) 
is performed. The network weights and bias term associated with the layer are chosen from independent and identically distributed sampling. The nonlinearactivation function is applied to each component of the transformed vector before sending it as an input to the next layer. The activation function is an identity function after an output layer. Thus, the final neural network representation is given by the composition
where the operator is the composition operator, represents the trainable parameters in the network, is the output and is the input.
In supervised learning of solution of PDEs, the training data is important to train the neural network, which can be obtained from the exact solution (if available) or from highresolution numerical solution given by efficient numerical schemes and it can be even obtained from carefully performed experiments, which may yield both high and lowfidelity data sets.
We aim to find the optimal weights for which the suitably defined loss function is minimized. In PINN the loss function is defined as
(2) 
where the mean squared error (MSE) is given by
Here represents the residual training points in spacetime domain, while represents the boundary/initial training data. The neural network solution must satisfy the governing equation at randomly chosen points in the domain, which constitutes the physicsinformed part of neural network given by first term, whereas the second term includes the known boundary/initial conditions, which must be satisfied by the neural network solution. The resulting optimization problem leads to finding the minimum of a loss function by optimizing the parameters like, weights and biases, i.e., we seek to find One can approximate the solutions to this minimization problem iteratively by one of the forms of gradient descent algorithm. The stochastic gradient descent (SGD) algorithm is widely used in machine learning community see, Rud for a complete survey. In SGD the weights are updated as , where is the learning rate. SGD methods can be initialized with some starting value . In this work, the ADAM optimizer ADAM , which is a variant of the SGD method is used.
A deep network is required to solve complex problems, which on the other hand is difficult to train. In most cases, a suitable architecture is selected based on the researcher’s experience. One can also think of tuning the network to get the best performance out of it. In this regard, we propose the following two approaches to optimize the adaptive activation function.
1. Layerwise locally adaptive activation functions (LLAAF)
Instead of globally defining the hyperparameter for the adaptive activation function, let us define this parameter hidden layerwise as
This gives additional hyperparameters to be optimized along with weights and biases. Here, every hiddenlayer has its own slope for the activation function.
2. Neuronwise locally adaptive activation functions (NLAAF)
One can also define such activation function at the neuron level as
This gives additional hyperparameters to be optimized. Neuronwise activation function acts as a vector activation function as opposed to scalar activation function given by LLAAF and global adaptive activation function (GAAF) approaches, where every neuron has its own slope for the activation function.
The resulting optimization problem leads to finding the minimum of a loss function by optimizing along with weights and biases, i.e., we seek to find . The process of training NN can be further accelerated by multiplying with scaling factor . The final form of the activation function is given by It is important to note that the introduction of the scalable hyperparameter does not change the structure of the loss function defined previously. Then, the final adaptive activation function based neural network representation of the solution is given by
In this case, the set of trainable parameters consists of and . In all the proposed methods, the initialization of scalable hyperparameters is done such that . Compared to single additional parameter of global adaptive activation function, the locally adaptive activation function based PINN has several additional scalable hyperparameters to train. Thus, it is important to consider the additional computational cost required.
2.1 Loss function with slope recovery term
The main motivation of adaptive activation function is to increase the slope of activation function, resulting in nonvanishing gradients and fast training of the network. It is clear that one should quickly increase the slope of activation in order to improve the performance of NN. Thus, instead of only depending on the optimization methods, another way to achieve this is to include the slope recovery term based on the activation slope in the loss function as
(3) 
where the slope recovery term is given by
where is a linear/nonlinear operator. Although, there can be several choices of this operator, including the linear identity operator, in this work we use the exponential operator. The main reason behind this is that such term contributes to the gradient of the loss function without vanishing. The overall effect of inclusion of this term is that it forces the network to increase the value of activation slope quickly thereby increasing the training speed. The following algorithm summarizes the LAAFPINN algorithm with slope recovery term.
2.2 Expansion of parametric space and its effect on the computational cost
The increase of the parametric space leads to a highdimensional optimization problem whose solution can be difficult to obtain. Between the previously discussed two approaches, i.e, LLAAF and NLAAF, NLAAF introduces highest number of hyperparameters for optimization. Let and be the total number of weights and biases in the NN. Then, the ratio which is the size of parametric space of NLAAF to that of fixed activation is
where .
Consider a fully connected NN with single hiddenlayer involving single neuron with just one input and an output. In this case, which gives . So, there is at most 33% increase in the number of hyperparameters. However, in general, NNs with large number of hidden layers involving large number of neurons are used in practice. As an example, let us consider a fully connected NN with 3 hiddenlayers involving 20 neurons in each layer, which gives the values of and . Thus, , i.e., 6.67% increase in the number of hyperparameters. This increment can further go down with an increase in the number of layers and neurons in each layer, which eventually results in negligible increase in the number of hyperparameters. In such cases, the computational costs for fixed activation function and that of neuronwise locally adaptive activations are comparable.
We now provide a theoretical result regarding the proposed methods. The following theorem states that a gradient descent algorithm minimizing our objective function in (3) does not converge to a suboptimal critical point or a suboptimal local minimum, for neither LLAAF nor NLAAF, given appropriate initialization and learning rates. In the following theorem, we treat as a realvalued vector. Let with the constant network for all where is a constant.
Theorem 2.1.
Let be a sequence generated by a gradient descent algorithm as . Assume that for any , is differentiable, and that for each , there exist differentiable function and input such that . Assume that at least one of the following three conditions holds.
 (i)

(constant learning rate) is Lipschitz continuous with Lipschitz constant (i.e., for all in its domain), and , where is a fixed positive number.
 (ii)

(diminishing learning rate) is Lipschitz continuous, and .
 (iii)

(adaptive learning rate) the learning rate is chosen by the minimization rule, the limited minimization rule, the Armjio rule, or the Goldstein rule (bertsekas1999nonlinear, ).
Then, for both LLAAF and NLAAF, no limit point of is a suboptimal critical point or a suboptimal local minimum.
The initial condition means that the initial value needs to be less than that of a constant network plus the highest value of the slope recovery term. Here, note that . The proof of Theorem 2.1 is included in appendix.
3 Numerical experiments
In this section we shall solve a regression problem of a nonlinear function approximation using deep neural network, and the Burgers equation, which can admit high gradient solution using PINN. For the assessment of the proposed methods, various comparisons are made with the standard fixed as well as global adaptive activation functions.
3.1 Neural network approximation of nonlinear discontinuous function
In this test case a standard neural network (without physicsinformed part) is used to approximate a discontinuous function. In this case, the loss function consists of the data mismatch and the slope recovery term as
The following discontinuous function with discontinuity at location is approximated by a deep neural network.
(4) 
Here, the domain is and the number of training points used is 300. The activation function is tanh, learning rate is 2.0e4 and the number of hidden layers are four with 50 neurons in each layer. In this case, the ratio , which shows just 2.56% increase in the number of hyperparameters.
Figure 2 shows the solution (first column), solution in frequency domain (second column) and pointwise absolute error in log scale (third column). The solution by standard fixed activation function is given in the first row, GAAF solution is given in second row, whereas the third row shows the solution given by LLAAF without and with (fourth row) slope recovery term. The solution given by NLAAF without slope recovery term is shown in the fifth row and with slope recovery term in the sixth row. We see that the NN training speed increases for the locally adaptive activation functions compared to fixed and globally adaptive activations. Moreover, both LLAAF and NLAAF with slope recovery term accelerate training and yield the least error as compared to other methods. Figure 3 (top) shows the variation of for GAAF, whereas the second and third rows show the layerwise variation of for LLAAF without and with the slope recovery term. The fourth and fifth rows show the variation of scalable hyperparameters for NLAAF without and with the slope recovery term, where the mean value of along with its standard deviation (Std) are plotted for each hiddenlayer. We see that the value of is quite large with the slope recovery term which shows the rapid increase in the activation slopes. Finally, the comparison of the loss function is shown in figure 4 for standard fixed activation, GAAF, LLAAF and NLAAF without the slope recovery (left) and for LLAAF and NLAAF with the slope recovery (right) using a scaling factor of 10. The Loss function for both LLAAF and NLAAF without the slope recovery term decreases faster, especially during the initial training period compared to the fixed and global activation function based algorithms.
3.2 Burgers equation
The Burgers equation is one of the fundamental partial differential equation arising in various fields such as gas dynamics, acoustics, fluid mechanics etc, see Whitham WH for more details. The Burgers equation was first introduced by H. Bateman Bat and later studied by J.M. Burgers Bur in the context of theory of turbulence. Here, we consider the viscous Burgers equation given by equation (5) along with its initial and boundary conditions. The nonlinearity in the convection term develops very steep gradient due to small value of diffusion coefficient. We consider the Burgers equation given by
(5) 
with initial condition , boundary conditions and . The analytical solution can be obtained using the HopfCole transformation, see Basdevant et al. Bes for more details. The number of boundary and initial training points is 400, whereas the number of residual training points is 10000. The activation function is tanh, learning rate is 0.0008 and the number of hidden layers are 6 with 20 neurons in each layer. In this case the ratio , which shows just a 5.5% increase in the number of hyperparameters.
Figure 5 shows the evolution of frequency plots of the solution at three different times using the standard fixed activation function (first row), global adaptive activation function (second row), LLAAF without (third row) and with (fourth row) slope recovery term, NLAAF without (fifth row) and with (sixth row) slope recovery term using scaling factor 10. Columns from left to right represent the frequency plots of solution at and 0.75, respectively. As discussed in JK , case needs large number of iterations for convergence, whereas both and 0.75 cases converges faster. Again, for both LLAAF and NLAAF the frequencies are converging faster towards the exact solution (shown by black line) with and without slope recovery term as compared to the fixed and global activation function based algorithms.
Figure 6 shows the comparison of solution of the Burgers equation using the standard fixed activation (first row), global adaptive activation function (second row), LLAAF with slope regularization (third row), and NLAAF with slope regularization (fourth row) using scaling factor 10. Both LLAAF and NLAAF gives more accurate solution compared to GAAF. Figure 7 shows the loss function for standard fixed, GAAF, LLAAF and NLAAF without slope recovery term (top left) and with slope recovery term (top right). Loss function decreases faster for all adaptive activations, in particular GAAF. Even though it is difficult to see from the actual solution plots given by figure 6, one can see from the table 1 that both LLAAF and NLAAF gives smaller relative error using slope recovery term.
Standard  GAAF  LLAAF  NLAAF  LLAAF  NLAAF  
w/o SRT  w/o SRT  with SRT  with SRT  
Relative error  1.913973e01  9.517134e02  8.866356e02  9.129446e02  7.693435e02  8.378424e02 
Figure 8 shows the comparison of layerwise variation of for LLAAF with and without slope recovery term. It can be seen that, the presence of slope recovery term further increases the slope of activation function thereby increasing the training speed. Similarly, figure 9 shows the mean and standard deviation of for NLAAF with and without slope recovery term, which again validates that with slope recovery term, network training speed increases.
4 Numerical results for deep learning problems
The previous sections demonstrated the advantages of adaptive activation functions with PINN for physics related problems. One of the remaining questions is whether or not the advantage of adaptive activations remains with standard deep neural networks for other types of deep learning applications. To explore the question, this section presents numerical results with various standard benchmark problems in deep learning.
Figures 10 and 11 shows the mean values and the uncertainty intervals of the training losses for fixed activation function (standard), GAAF, LLAAF, and NLAAF, by using the standard deep learning benchmarks. The solid and dashed lines are the mean values over three random trials with random seeds. The shaded regions represent the intervals of (the sample standard deviations) for each method. Figures 10 and 11 consistently show that adaptive activation accelerates the minimization process of the training loss values. Here, all of GAAF, LLAAF and NLAAF use the slope recovery term, which improved the methods without the recovery term. Accordingly, the results of GAAF are also new contributions of this paper. In general, LLAAF improved against GAAF as expected.
The standard cross entropy loss was used for training and plots. We used preactivation ResNet with 18 layers (he2016identity, ) for CIFAR10, CIFAR100, and SVHN data sets, whereas we used a standard variant of LeNet (LeCun, ) with ReLU for other data sets; i.e., the architecture of the variant of LeNet consists of the following five layers (with the three hidden layers): (1) input layer, (2) convolutional layer with 64 filters, followed by max pooling of size of 2 by 2 and ReLU, (3) convolutional layer with 64 filters, followed by max pooling of size of 2 by 2 and ReLU, (4) fully connected layer with 1014 output units, followed by ReLU, and (5) Fully connected layer with the number of output units being equal to the number of target classes. All hyperparameters were fixed a priori across all different data sets and models. We fixed the minibatch size to be 64, the initial learning rate to be , the momentum coefficient to be and we use scaling factor and 2. The learning rate was divided by at the beginning of 10th epoch for all experiments (with and without data augmentation), and of 100th epoch for those with data augmentation.
5 Conclusions
In this paper, we extended our previous work on global adaptive activation function for deep and physicsinformed neural network by introducing the locally adaptive activations. In particular, we present two versions of locally adaptive activation functions namely, layerwise and neuronwise locally adaptive activation functions. Such local activation functions further improve the training speed of the neural network compared to its global predecessor. To further accelerate the training process, an activation slope based slope recovery term is added in the loss function for both layerwise and neuronwise activation functions, which is shown to enhance the performance of the neural network. To verify our claim, two test cases namely, nonlinear discontinuous function approximation and viscous Burgers equation are solved using deep and physicsinformed neural networks, respectively. From these test cases, we conclude that the locally adaptive activations outperform fixed as well as global adaptive activations in terms of training speed. Moreover, while the proposed formulations increases the number of additional hyperparameters compared to the fixed activation function, the overall computational cost is comparable. The proposed adaptive activation function with the slope recovery term is also shown to accelerate the training process in standard deep learning benchmark problems. Moreover, we theoretically prove that no suboptimal critical point or local minimum attracts gradient descent algorithms in the proposed methods (LLAAF and NLAAF) with the slope recovery term under only mild assumptions.
Acknowledgement
This work was supported by the Department of Energy PhILMs grant DESC0019453, and by the DAPRAAIRA grant HR00111990025.
Appendix A Proof of Theorem 2.1
We first prove the statement by contradiction for LLAAF. Suppose that the parameter vector consisting of and is a limit point of and a suboptimal critical point or a suboptimal local minimum.
Let and . Let and be the outputs of the th layer for and , respectively. Define
and
for all , where and .
Following the proofs in (bertsekas1999nonlinear, , Propositions 1.2.11.2.4), we have that and , for all three cases of the conditions corresponding the different rules of the learning rate. Therefore, we have that for all
(6)  
Furthermore, we have that for all and all ,
(7)  
and
(8) 
By combining (6)–(8), for all ,
Therefore,
which implies that for all since . This implies that , which contradicts with . This proves the desired statement for LLAAF.
For NLAAF, we prove the statement by contradiction. Suppose that the parameter vector consisting of and is a limit point of and a suboptimal critical point or a suboptimal local minimum. Redefine
and
for all , where and . Then, by the same proof steps, we have that and , for all three cases of the conditions corresponding the different rules of the learning rate. Therefore, we have that for all and all ,
(9)  
By combining (7)–(9), for all and all , ,
Therefore,
which implies that for all since . This implies that , which contradicts with . This proves the desired statement for NLAAF.
∎
References
 (1) D. P. Kingma, J. L. Ba, ADAM: A method for stochastic optimization, arXiv:1412.6980v9, 2017.
 (2) C. Basdevant, et al., Spectral and finite difference solution of the Burgers equation, Comput. Fluids, 14 (1986) 2341.
 (3) H. Bateman, Some recent researches on the motion of fluids, Monthly Weather Review, 43(4), 163170, 1915.
 (4) D.P. Bertsekas, Nonlinear programming, Athena scientific Belmont, 1999.
 (5) J.M. Burgers,A mathematical model illustrating the theory of turbulence. In advances in applied mechanics, Vol. 1, pp. 171199, 1948.
 (6) Jagtap A.D., Karniadakis G.E., Adaptive activation functions increases convergence of deep and physicsinformed neural networks, arXiv:1906.01170v1, 4th June 2019.
 (7) M. Dushkoff, R. Ptucha, Adaptive Activation Functions for Deep Networks, Electronic Imaging, Computational Imaging XIV, pp. 15(5).
 (8) A.G. Baydin, B.A. Pearlmutter, A.A. Radul, J.M. Siskind, Automatic differentiation in machine learning: a survey, Journal of Machine Learning Research, 18 (2018) 143.
 (9) G.B. Whitham, Linear and nonlinear waves, Vol. 42, JohnWiley & Sons, 2011.
 (10) K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, European conference on computer vision, pp. 630645, 2016.
 (11) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11): 22782324, 1998.
 (12) B. Li, Y. Li and X. Rong, The extreme learning machine learning algorithm with tunable activation function, Neural Comput & Applie (2013) 22: 531539.
 (13) G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
 (14) V. Kunc and J. Klma, On transformative adaptive activation functions in neural networks for gene expression inference, bioRxiv, doi:http://dx.doi.org/10.1101/587287, 2019.
 (15) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 (16) A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 (17) M. Raissi, P. Perdikaris, G.E. Karniadakis, Physicsinformed neural network: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys., 378, 686707, 2019.
 (18) Dropout: A Simple Way to Prevent Neural Networks from Overfitting, N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov; Journal of Machine Learning Research, 15(Jun):19291958, 2014.
 (19) J.X Wang, et al., A comprehensive physicsinformed machine learning framework for predictive turbulence modeling, arXiv:1701.07102.
 (20) S. Ruder, An overview of gradient descent optimization algorithms, arXiv:1609.04747v2, 2017.
 (21) S. Qian, et al, Adaptive activation functions in convolutional neural networks, Neurocomputing Volume 272, 10 January 2018, Pages 204212.
 (22) Y. Shen, B. Wang, F. Chen and L. Cheng, A new multioutput neural model with tunable activation function and its applications, Neural Processing Letters, 20: 85104, 2004.
 (23) Tactile Srl, Brescia, Italy (1994). Semeion Handwritten Digit Data Set. Semeion Research Center of Sciences of Communication, Rome, Italy
 (24) T. Clanuwat, M. BoberIrizar, A. Kitamoto, A. Lamb, K. Yamamoto, D. Ha, ”Deep Learning for Classical Japanese Literature”, arXiv:1812.01718.
 (25) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 (26) H. Xiao, K. Rasul, and R. Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 (27) C. Yu, et al.,An adaptive activation function for multilayer feedforward neural networks, 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM ’02. Proceedings.
 (28) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.