Locally adaptive activation functions with slope recovery term for deep and physics-informed neural networks
We propose two approaches of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activation functions, which improve the performance of deep and physics-informed neural networks. The local adaptation of activation function is achieved by introducing scalable hyper-parameters in each layer (layer-wise) and for every neuron separately (neuron-wise), and then optimizing it using the stochastic gradient descent algorithm. Introduction of neuron-wise activation function acts like a vector activation function as opposed to the traditional scalar activation function given by fixed, global and layer-wise activations. In order to further increase the training speed, an activation slope based slope recovery term is added in the loss function, which further accelerate convergence, thereby reducing the training cost. For numerical experiments, a nonlinear discontinuous function is approximated using a deep neural network with layer-wise and neuron-wise locally adaptive activation functions with and without the slope recovery term and compared with its global counterpart. Moreover, solution of the nonlinear Burgers equation, which exhibits steep gradients, is also obtained using the proposed methods. On the theoretical side, we prove that in the proposed method the gradient descent algorithms are not attracted to sub-optimal critical points or local minima under practical conditions on the initialization and learning rate. Furthermore, the proposed adaptive activation functions with the slope recovery are shown to accelerate the training process in standard deep learning benchmarks using CIFAR-10, CIFAR-100, SVHN, MNIST, KMNIST, Fashion-MNIST, and Semeion data sets with and without data augmentation.
keywords:Machine learning, bad minima, stochastic gradients, accelerated training, PINN, deep learning benchmarks.
In recent years, research on neural networks (NNs) has intensified around the world due to their successful applications in many diverse fields such as speech recognition SpeechRecog , computer vision CompVision , natural language translation NLT , etc. Training of NN is performed on data sets before using it in the actual applications. Various data sets are available for applications like image classification, which is a subset of computer vision. MNIST LeCun and their variants like, Fashion-MNIST FMNIST , and KMNIST KMNIST are the data sets for handwritten digits, images of clothing and accessories, and Japanese letters, respectively. Apart from MNIST, Semeion Semeion is a handwritten digit data set that contains 1593 digits collected from 80 persons. SVHN SVHN is another data set for street view house numbers obtained from house numbers in Google Street View images. CIFAR CIFAR is the popular data set containing color images commonly used to train machine learning algorithms. In particular, the CIFAR-10 data set contains 50000 training and 10000 testing images in 10 classes with image resolution of 32x32. CIFAR-100 is similar to the CIFAR-10, except it has 100 classes with 600 images in each class, which is more challenging than the CIFAR-10 data set.
Recently, NNs have also been used to solve partial differential equations (PDEs) due to their ability to effectively approximate complex functions arising in diverse scientific disciplines (cf., Physics-informed Neural Networks (PINNs) by Raissi et al., RK1 ). PINNs can accurately solve both forward problems, where the approximate solutions of governing equations are obtained, as well as inverse problems, where parameters involved in the governing equation are inferred from the training data. In the PINN algorithm, along with the contribution from the neural network the loss function is enriched by the addition of the residual term from the governing equation, which act as a penalizing term constraining the space of admissible solutions.
Highly efficient and adaptable algorithms are important to design the most effective NN which not only increases the accuracy of the solution but also reduces the training cost. Various architectures of NN like Dropout NN DropoutNN are proposed in the literature, which can improve the efficiency of the algorithm for specific applications. Activation function plays an important role in the training process of NN. In this work, we are particularly focusing on adaptive activation functions, which adapt automatically such that the network can be trained faster. Various methods are proposed in the literature for adaptive activation function, like the adaptive sigmoidal activation function proposed by Yu et al Yu for multilayer feedforward NNs, while Qian et al Qian focuses on learning activation functions in convolutional NNs by combining basic activation functions in a data-driven way. Multiple activation functions per neuron are proposed by Dushkoff and Ptucha MiRa , where individual neurons select between a multitude of activation functions. Li et al LLR proposed a tunable activation function, where only a single hidden layer is used and the activation function is tuned. Shen et al SWCC , used a similar idea of tunable activation function but with multiple outputs. Recently, Kunc and Klma proposed a transformative adaptive activation functions for gene expression inference, see KK . One such adaptive activation function is proposed by Jagtap and Karniadakis JK by introducing scalable hyper-parameter in the activation function, which can be optimized. Mathematically, it changes the slope of activation function thereby increasing the learning process, especially during the initial training period. Due to single scalar hyper-parameter, we call such adaptive activation functions globally adaptive activations, meaning that it gives an optimized slope for the entire network. One can think of doing such optimization at the local level, where the scalable hyper-parameter are introduced hidden layer-wise or even for each neuron in the network. Such local adaptation can further improve the performance of the network.
Figure 1 shows a sketch of a neuron-wise locally adaptive activation function based physics-informed neural network (LAAF-PINN), where both the NN part along with the physics-informed part can be seen. In this architecture, along with the output of NN and the residual term from the governing equation, the activation slopes from every neuron are also contributing to the loss function in the form of slope recovery term. In every neuron, a scalable hyper-parameter is introduced, which can be optimized to lead to a quick and efficient training process in NN.
The rest of the paper is organized as follows. Section 2 presents the methodology of the proposed layer-wise and neuron-wise locally adaptive activations in detail. This also includes a discussion on the slope recovery term, expansion of parametric space due to layer-wise and neuron-wise introduction of hyper-parameters, its effect on the overall training cost, and a theoretical result for gradient decent algorithms. In section 3, we perform various numerical experiments, where we approximate a nonlinear discontinuous function using deep NN by the proposed approaches. We also solve the Burgers equation using the proposed algorithm and present various comparisons with fixed and global activation functions. Section 4 presents numerical results with various standard deep learning benchmarks using CIFAR-10, CIFAR-100, SVHN, MNIST, KMNIST, Fashion-MNIST, and Semeion data sets. Finally, in section 5, we summarize the conclusions of our work.
We use a NN of depth corresponding to a network with an input layer, hidden-layers and an output layer. In the hidden-layer, number of neurons are present. Each hidden-layer of the network receives an output from the previous layer where an affine transformation of the form
is performed. The network weights and bias term associated with the layer are chosen from independent and identically distributed sampling. The nonlinear-activation function is applied to each component of the transformed vector before sending it as an input to the next layer. The activation function is an identity function after an output layer. Thus, the final neural network representation is given by the composition
where the operator is the composition operator, represents the trainable parameters in the network, is the output and is the input.
In supervised learning of solution of PDEs, the training data is important to train the neural network, which can be obtained from the exact solution (if available) or from high-resolution numerical solution given by efficient numerical schemes and it can be even obtained from carefully performed experiments, which may yield both high- and low-fidelity data sets.
We aim to find the optimal weights for which the suitably defined loss function is minimized. In PINN the loss function is defined as
where the mean squared error (MSE) is given by
Here represents the residual training points in space-time domain, while represents the boundary/initial training data. The neural network solution must satisfy the governing equation at randomly chosen points in the domain, which constitutes the physics-informed part of neural network given by first term, whereas the second term includes the known boundary/initial conditions, which must be satisfied by the neural network solution. The resulting optimization problem leads to finding the minimum of a loss function by optimizing the parameters like, weights and biases, i.e., we seek to find One can approximate the solutions to this minimization problem iteratively by one of the forms of gradient descent algorithm. The stochastic gradient descent (SGD) algorithm is widely used in machine learning community see, Rud for a complete survey. In SGD the weights are updated as , where is the learning rate. SGD methods can be initialized with some starting value . In this work, the ADAM optimizer ADAM , which is a variant of the SGD method is used.
A deep network is required to solve complex problems, which on the other hand is difficult to train. In most cases, a suitable architecture is selected based on the researcher’s experience. One can also think of tuning the network to get the best performance out of it. In this regard, we propose the following two approaches to optimize the adaptive activation function.
1. Layer-wise locally adaptive activation functions (L-LAAF)
Instead of globally defining the hyper-parameter for the adaptive activation function, let us define this parameter hidden layer-wise as
This gives additional hyper-parameters to be optimized along with weights and biases. Here, every hidden-layer has its own slope for the activation function.
2. Neuron-wise locally adaptive activation functions (N-LAAF)
One can also define such activation function at the neuron level as
This gives additional hyper-parameters to be optimized. Neuron-wise activation function acts as a vector activation function as opposed to scalar activation function given by L-LAAF and global adaptive activation function (GAAF) approaches, where every neuron has its own slope for the activation function.
The resulting optimization problem leads to finding the minimum of a loss function by optimizing along with weights and biases, i.e., we seek to find . The process of training NN can be further accelerated by multiplying with scaling factor . The final form of the activation function is given by It is important to note that the introduction of the scalable hyper-parameter does not change the structure of the loss function defined previously. Then, the final adaptive activation function based neural network representation of the solution is given by
In this case, the set of trainable parameters consists of and . In all the proposed methods, the initialization of scalable hyper-parameters is done such that . Compared to single additional parameter of global adaptive activation function, the locally adaptive activation function based PINN has several additional scalable hyper-parameters to train. Thus, it is important to consider the additional computational cost required.
2.1 Loss function with slope recovery term
The main motivation of adaptive activation function is to increase the slope of activation function, resulting in non-vanishing gradients and fast training of the network. It is clear that one should quickly increase the slope of activation in order to improve the performance of NN. Thus, instead of only depending on the optimization methods, another way to achieve this is to include the slope recovery term based on the activation slope in the loss function as
where the slope recovery term is given by
where is a linear/nonlinear operator. Although, there can be several choices of this operator, including the linear identity operator, in this work we use the exponential operator. The main reason behind this is that such term contributes to the gradient of the loss function without vanishing. The overall effect of inclusion of this term is that it forces the network to increase the value of activation slope quickly thereby increasing the training speed. The following algorithm summarizes the LAAF-PINN algorithm with slope recovery term.
2.2 Expansion of parametric space and its effect on the computational cost
The increase of the parametric space leads to a high-dimensional optimization problem whose solution can be difficult to obtain. Between the previously discussed two approaches, i.e, L-LAAF and N-LAAF, N-LAAF introduces highest number of hyper-parameters for optimization. Let and be the total number of weights and biases in the NN. Then, the ratio which is the size of parametric space of N-LAAF to that of fixed activation is
Consider a fully connected NN with single hidden-layer involving single neuron with just one input and an output. In this case, which gives . So, there is at most 33% increase in the number of hyper-parameters. However, in general, NNs with large number of hidden layers involving large number of neurons are used in practice. As an example, let us consider a fully connected NN with 3 hidden-layers involving 20 neurons in each layer, which gives the values of and . Thus, , i.e., 6.67% increase in the number of hyper-parameters. This increment can further go down with an increase in the number of layers and neurons in each layer, which eventually results in negligible increase in the number of hyper-parameters. In such cases, the computational costs for fixed activation function and that of neuron-wise locally adaptive activations are comparable.
We now provide a theoretical result regarding the proposed methods. The following theorem states that a gradient descent algorithm minimizing our objective function in (3) does not converge to a sub-optimal critical point or a sub-optimal local minimum, for neither L-LAAF nor N-LAAF, given appropriate initialization and learning rates. In the following theorem, we treat as a real-valued vector. Let with the constant network for all where is a constant.
Let be a sequence generated by a gradient descent algorithm as . Assume that for any , is differentiable, and that for each , there exist differentiable function and input such that . Assume that at least one of the following three conditions holds.
(constant learning rate) is Lipschitz continuous with Lipschitz constant (i.e., for all in its domain), and , where is a fixed positive number.
(diminishing learning rate) is Lipschitz continuous, and .
(adaptive learning rate) the learning rate is chosen by the minimization rule, the limited minimization rule, the Armjio rule, or the Goldstein rule (bertsekas1999nonlinear, ).
Then, for both L-LAAF and N-LAAF, no limit point of is a sub-optimal critical point or a sub-optimal local minimum.
The initial condition means that the initial value needs to be less than that of a constant network plus the highest value of the slope recovery term. Here, note that . The proof of Theorem 2.1 is included in appendix.
3 Numerical experiments
In this section we shall solve a regression problem of a nonlinear function approximation using deep neural network, and the Burgers equation, which can admit high gradient solution using PINN. For the assessment of the proposed methods, various comparisons are made with the standard fixed as well as global adaptive activation functions.
3.1 Neural network approximation of nonlinear discontinuous function
In this test case a standard neural network (without physics-informed part) is used to approximate a discontinuous function. In this case, the loss function consists of the data mismatch and the slope recovery term as
The following discontinuous function with discontinuity at location is approximated by a deep neural network.
Here, the domain is and the number of training points used is 300. The activation function is tanh, learning rate is 2.0e-4 and the number of hidden layers are four with 50 neurons in each layer. In this case, the ratio , which shows just 2.56% increase in the number of hyper-parameters.
Figure 2 shows the solution (first column), solution in frequency domain (second column) and point-wise absolute error in log scale (third column). The solution by standard fixed activation function is given in the first row, GAAF solution is given in second row, whereas the third row shows the solution given by L-LAAF without and with (fourth row) slope recovery term. The solution given by N-LAAF without slope recovery term is shown in the fifth row and with slope recovery term in the sixth row. We see that the NN training speed increases for the locally adaptive activation functions compared to fixed and globally adaptive activations. Moreover, both L-LAAF and N-LAAF with slope recovery term accelerate training and yield the least error as compared to other methods. Figure 3 (top) shows the variation of for GAAF, whereas the second and third rows show the layer-wise variation of for L-LAAF without and with the slope recovery term. The fourth and fifth rows show the variation of scalable hyper-parameters for N-LAAF without and with the slope recovery term, where the mean value of along with its standard deviation (Std) are plotted for each hidden-layer. We see that the value of is quite large with the slope recovery term which shows the rapid increase in the activation slopes. Finally, the comparison of the loss function is shown in figure 4 for standard fixed activation, GAAF, L-LAAF and N-LAAF without the slope recovery (left) and for L-LAAF and N-LAAF with the slope recovery (right) using a scaling factor of 10. The Loss function for both L-LAAF and N-LAAF without the slope recovery term decreases faster, especially during the initial training period compared to the fixed and global activation function based algorithms.
3.2 Burgers equation
The Burgers equation is one of the fundamental partial differential equation arising in various fields such as gas dynamics, acoustics, fluid mechanics etc, see Whitham WH for more details. The Burgers equation was first introduced by H. Bateman Bat and later studied by J.M. Burgers Bur in the context of theory of turbulence. Here, we consider the viscous Burgers equation given by equation (5) along with its initial and boundary conditions. The non-linearity in the convection term develops very steep gradient due to small value of diffusion coefficient. We consider the Burgers equation given by
with initial condition , boundary conditions and . The analytical solution can be obtained using the Hopf-Cole transformation, see Basdevant et al. Bes for more details. The number of boundary and initial training points is 400, whereas the number of residual training points is 10000. The activation function is tanh, learning rate is 0.0008 and the number of hidden layers are 6 with 20 neurons in each layer. In this case the ratio , which shows just a 5.5% increase in the number of hyper-parameters.
Figure 5 shows the evolution of frequency plots of the solution at three different times using the standard fixed activation function (first row), global adaptive activation function (second row), L-LAAF without (third row) and with (fourth row) slope recovery term, N-LAAF without (fifth row) and with (sixth row) slope recovery term using scaling factor 10. Columns from left to right represent the frequency plots of solution at and 0.75, respectively. As discussed in JK , case needs large number of iterations for convergence, whereas both and 0.75 cases converges faster. Again, for both L-LAAF and N-LAAF the frequencies are converging faster towards the exact solution (shown by black line) with and without slope recovery term as compared to the fixed and global activation function based algorithms.
Figure 6 shows the comparison of solution of the Burgers equation using the standard fixed activation (first row), global adaptive activation function (second row), L-LAAF with slope regularization (third row), and N-LAAF with slope regularization (fourth row) using scaling factor 10. Both L-LAAF and N-LAAF gives more accurate solution compared to GAAF. Figure 7 shows the loss function for standard fixed, GAAF, L-LAAF and N-LAAF without slope recovery term (top left) and with slope recovery term (top right). Loss function decreases faster for all adaptive activations, in particular GAAF. Even though it is difficult to see from the actual solution plots given by figure 6, one can see from the table 1 that both L-LAAF and N-LAAF gives smaller relative error using slope recovery term.
|w/o SRT||w/o SRT||with SRT||with SRT|
Figure 8 shows the comparison of layer-wise variation of for L-LAAF with and without slope recovery term. It can be seen that, the presence of slope recovery term further increases the slope of activation function thereby increasing the training speed. Similarly, figure 9 shows the mean and standard deviation of for N-LAAF with and without slope recovery term, which again validates that with slope recovery term, network training speed increases.
4 Numerical results for deep learning problems
The previous sections demonstrated the advantages of adaptive activation functions with PINN for physics related problems. One of the remaining questions is whether or not the advantage of adaptive activations remains with standard deep neural networks for other types of deep learning applications. To explore the question, this section presents numerical results with various standard benchmark problems in deep learning.
Figures 10 and 11 shows the mean values and the uncertainty intervals of the training losses for fixed activation function (standard), GAAF, L-LAAF, and N-LAAF, by using the standard deep learning benchmarks. The solid and dashed lines are the mean values over three random trials with random seeds. The shaded regions represent the intervals of (the sample standard deviations) for each method. Figures 10 and 11 consistently show that adaptive activation accelerates the minimization process of the training loss values. Here, all of GAAF, L-LAAF and N-LAAF use the slope recovery term, which improved the methods without the recovery term. Accordingly, the results of GAAF are also new contributions of this paper. In general, L-LAAF improved against GAAF as expected.
The standard cross entropy loss was used for training and plots. We used pre-activation ResNet with 18 layers (he2016identity, ) for CIFAR-10, CIFAR-100, and SVHN data sets, whereas we used a standard variant of LeNet (LeCun, ) with ReLU for other data sets; i.e., the architecture of the variant of LeNet consists of the following five layers (with the three hidden layers): (1) input layer, (2) convolutional layer with 64 filters, followed by max pooling of size of 2 by 2 and ReLU, (3) convolutional layer with 64 filters, followed by max pooling of size of 2 by 2 and ReLU, (4) fully connected layer with 1014 output units, followed by ReLU, and (5) Fully connected layer with the number of output units being equal to the number of target classes. All hyper-parameters were fixed a priori across all different data sets and models. We fixed the mini-batch size to be 64, the initial learning rate to be , the momentum coefficient to be and we use scaling factor and 2. The learning rate was divided by at the beginning of 10th epoch for all experiments (with and without data augmentation), and of 100th epoch for those with data augmentation.
In this paper, we extended our previous work on global adaptive activation function for deep and physics-informed neural network by introducing the locally adaptive activations. In particular, we present two versions of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activation functions. Such local activation functions further improve the training speed of the neural network compared to its global predecessor. To further accelerate the training process, an activation slope based slope recovery term is added in the loss function for both layer-wise and neuron-wise activation functions, which is shown to enhance the performance of the neural network. To verify our claim, two test cases namely, nonlinear discontinuous function approximation and viscous Burgers equation are solved using deep and physics-informed neural networks, respectively. From these test cases, we conclude that the locally adaptive activations outperform fixed as well as global adaptive activations in terms of training speed. Moreover, while the proposed formulations increases the number of additional hyper-parameters compared to the fixed activation function, the overall computational cost is comparable. The proposed adaptive activation function with the slope recovery term is also shown to accelerate the training process in standard deep learning benchmark problems. Moreover, we theoretically prove that no sub-optimal critical point or local minimum attracts gradient descent algorithms in the proposed methods (L-LAAF and N-LAAF) with the slope recovery term under only mild assumptions.
This work was supported by the Department of Energy PhILMs grant DE-SC0019453, and by the DAPRA-AIRA grant HR00111990025.
Appendix A Proof of Theorem 2.1
We first prove the statement by contradiction for L-LAAF. Suppose that the parameter vector consisting of and is a limit point of and a sub-optimal critical point or a sub-optimal local minimum.
Let and . Let and be the outputs of the -th layer for and , respectively. Define
for all , where and .
Following the proofs in (bertsekas1999nonlinear, , Propositions 1.2.1-1.2.4), we have that and , for all three cases of the conditions corresponding the different rules of the learning rate. Therefore, we have that for all
Furthermore, we have that for all and all ,
which implies that for all since . This implies that , which contradicts with . This proves the desired statement for L-LAAF.
For N-LAAF, we prove the statement by contradiction. Suppose that the parameter vector consisting of and is a limit point of and a sub-optimal critical point or a sub-optimal local minimum. Redefine
for all , where and . Then, by the same proof steps, we have that and , for all three cases of the conditions corresponding the different rules of the learning rate. Therefore, we have that for all and all ,
which implies that for all since . This implies that , which contradicts with . This proves the desired statement for N-LAAF.
- (1) D. P. Kingma, J. L. Ba, ADAM: A method for stochastic optimization, arXiv:1412.6980v9, 2017.
- (2) C. Basdevant, et al., Spectral and finite difference solution of the Burgers equation, Comput. Fluids, 14 (1986) 23-41.
- (3) H. Bateman, Some recent researches on the motion of fluids, Monthly Weather Review, 43(4), 163-170, 1915.
- (4) D.P. Bertsekas, Nonlinear programming, Athena scientific Belmont, 1999.
- (5) J.M. Burgers,A mathematical model illustrating the theory of turbulence. In advances in applied mechanics, Vol. 1, pp. 171-199, 1948.
- (6) Jagtap A.D., Karniadakis G.E., Adaptive activation functions increases convergence of deep and physics-informed neural networks, arXiv:1906.01170v1, 4th June 2019.
- (7) M. Dushkoff, R. Ptucha, Adaptive Activation Functions for Deep Networks, Electronic Imaging, Computational Imaging XIV, pp. 1-5(5).
- (8) A.G. Baydin, B.A. Pearlmutter, A.A. Radul, J.M. Siskind, Automatic differentiation in machine learning: a survey, Journal of Machine Learning Research, 18 (2018) 1-43.
- (9) G.B. Whitham, Linear and nonlinear waves, Vol. 42, John-Wiley & Sons, 2011.
- (10) K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, European conference on computer vision, pp. 630-645, 2016.
- (11) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278-2324, 1998.
- (12) B. Li, Y. Li and X. Rong, The extreme learning machine learning algorithm with tunable activation function, Neural Comput & Applie (2013) 22: 531-539.
- (13) G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
- (14) V. Kunc and J. Klma, On transformative adaptive activation functions in neural networks for gene expression inference, bioRxiv, doi:http://dx.doi.org/10.1101/587287, 2019.
- (15) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- (16) A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- (17) M. Raissi, P. Perdikaris, G.E. Karniadakis, Physics-informed neural network: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys., 378, 686-707, 2019.
- (18) Dropout: A Simple Way to Prevent Neural Networks from Overfitting, N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov; Journal of Machine Learning Research, 15(Jun):1929-1958, 2014.
- (19) J.-X Wang, et al., A comprehensive physics-informed machine learning framework for predictive turbulence modeling, arXiv:1701.07102.
- (20) S. Ruder, An overview of gradient descent optimization algorithms, arXiv:1609.04747v2, 2017.
- (21) S. Qian, et al, Adaptive activation functions in convolutional neural networks, Neurocomputing Volume 272, 10 January 2018, Pages 204-212.
- (22) Y. Shen, B. Wang, F. Chen and L. Cheng, A new multi-output neural model with tunable activation function and its applications, Neural Processing Letters, 20: 85-104, 2004.
- (23) Tactile Srl, Brescia, Italy (1994). Semeion Handwritten Digit Data Set. Semeion Research Center of Sciences of Communication, Rome, Italy
- (24) T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, D. Ha, ”Deep Learning for Classical Japanese Literature”, arXiv:1812.01718.
- (25) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- (26) H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- (27) C. Yu, et al.,An adaptive activation function for multilayer feedforward neural networks, 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM ’02. Proceedings.
- (28) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.