Adaptive Blending Units: Trainable Activation Functions for Deep Neural Networks
Abstract
The most widely used activation functions in current deep feed-forward neural networks are rectified linear units (ReLU), and many alternatives have been successfully applied, as well. However, none of the alternatives have managed to consistently outperform the rest and there is no unified theory connecting properties of the task and network with properties of activation functions for most efficient training. A possible solution is to have the network learn its preferred activation functions. In this work, we introduce Adaptive Blending Units (ABUs), a trainable linear combination of a set of activation functions. Since ABUs learn the shape, as well as the overall scaling of the activation function, we also analyze the effects of adaptive scaling in common activation functions. We experimentally demonstrate advantages of both adaptive scaling and ABUs over common activation functions across a set of systematically varied network specifications. We further show that adaptive scaling works by mitigating covariate shifts during training, and that the observed advantages in performance of ABUs likewise rely largely on the activation function’s ability to adapt over the course of training.
Adaptive Blending Units: Trainable Activation Functions for Deep Neural Networks
Leon René Sütfeld, Flemming Brieger, Holger Finger, Sonja Füllhase & Gordon Pipa |
---|
Institute for Cognitive Science |
Osnabrück University |
Wachsbleiche 27, 49090 Osnabrück, Germany |
{lsuetfel,fbrieger,hofinger,sfuellhase,gpipa}@uos.de |
1 Introduction
Deep neural networks are structured around layers, each of which performs a linear transformation of its input before feeding the signal through a scalar non-linear activation function. Chaining larger numbers of non-linear functions then allows the networks to find and extract complex features in the input. Activation functions thus have a key function in deep neural networks: Without intermittent non-linearities, these networks could only perform linear operations on the input. But despite a large number of activation functions proven successful in the literature, it remains unclear, what properties of an activation function are most desirable, given a particular task and network configuration. Ideally, the network would sort this issue out by itself, but most common activation functions are fixed during training, i.e., their shape and scaling are treated as hyperparameters. We suggest changing this practice by making an activation function’s shape and scaling a trainable parameter of the network. Our main contribution in this work is the Adaptive Blending Unit (ABU), a linear combination of a set of basic activation functions that allows the shape and scaling of the resulting activation function to be learned during training. In an effort to separate and understand the effects of the activation function’s shape and its scaling, we also examine the effect of adaptive scaling on common activation functions without adaptation of the shape, as well as normalizing the blending weights in ABUs, thus learning its shape without learning any scaling. Throughout this work, we apply one scaling weight, or one set of blending weights (i.e., one ABU) per layer of the network. This way, the network is free to learn the activation function and / or scaling that best suits the computations performed in any given layer, while the number of parameters in the network is kept low enough, as not to require regularization. The remainder of this work is structured as follows. In section , we will review related work, before comparing and analyzing common activation functions, their adaptively scaled counterparts and ABUs on CIFAR image classification tasks in section . In section , we examine multiple ways of normalizing ABUs, to provide an account of adaptive shape without adaptive scaling. Finally, in section , we examine pre-training of the scaling and blending weights, to examine the role of adaptiveness over the course of training.
2 Related Work
The most prevalent activation function in modern neural networks is the Rectified Linear Unit (ReLU) (Hahnloser et al., 2000; Nair & Hinton, 2010), a piecewise-linear function returning the identity for all positive inputs and zero otherwise. Its constant derivative of on the positive part helps alleviating the vanishing gradient problem (Glorot et al., 2011), making it the first activation function allowing for a large number of stacked layers to be trained efficiently. With this, ReLU was partly responsible for the breakthrough of deep neural networks around 2012, marked by AlexNet’s victory in the annual ILSVRC challenge (Krizhevsky et al., 2012). Leaky ReLU (LReLU) (Maas et al., 2013), Parametric ReLU (PReLU) (He et al., 2015), and Randomized Leaky ReLU (RReLU) (Xu et al., 2015) are all based on ReLU, but replace the zero-output for negative values by a linear function. In PReLU, the slope of the negative part of the function is controlled by a trainable parameter. Exponential Linear Units (ELU) (Clevert et al., 2015) like ReLU, return the identity for positive values, but for negative values, with typically set to . Scaled Exponential Linear Units (SELU) (Klambauer et al., 2017) are identical to ELU, except for an additional scaling parameter acting upon the function as a whole. The values for and in SELUs are analytically derived to ensure convergence of activations towards unit mean and variance across layers. In a more empirical approach, Ramachandran et al. (2017) performed a large reinforcement learning-based search for successful activation functions, and found multiple novel and well-performing functions. The most successful, given by and named Swish, uses the trainable parameter to control the overall shape of the function. E-Swish (Alcaide, 2018) ditches this parameter (setting it to ), and instead scales the whole function by a manually determined parameter between and . In addition to these, there are numerous approaches in which the activation function’s overall shape is learned, often using multiple parameters: Adaptive Piecewise Linear Units (APL) learn the slope of all piecewise linear elements and the position of the hinges independently for each neuron via backpropagation, while the number of linear pieces is a hyperparameter that is set manually (Agostinelli et al., 2014). Similarly, Maxout activations (Goodfellow et al., 2013) learn a convex piecewise-linear function by returning the maximum of a fixed set of neurons, while the regular network weights determine the shape of the resulting function. Eisenach et al. (2016) use Fourier series basis expansion to approximate non-linear parameterized basis functions, and train one activation function per feature map in a convolutional network. An approach suggested by Godfrey & Gashler (2015), called the soft exponential function, can switch between a large number of different mathematical operations, such as addition, multiplication and exponentiation, by adjusting a trainable parameter. However, to our knowledge, no empirical validation of the approach was offered so far. In an approach similar to ours, Dushkoff & Ptucha (2016) suggested blending a set of activation functions on a per-neuron basis, and constraining the blending weights to values between and , by gating them with exponential sigmoid functions. Blending activation functions on a per-neuron basis, however, required downscaling of gradient updates as a form of regularization. Most similar to our approach, (Manessi & Rozza, 2018) suggested a learned blending of multiple common activation functions per layer, where the blending weights are constrained to sum up to , and showed this approach to be successful over a range of tasks and network configurations. We will provide further details with respect to the similarities between their approach and ours in the appropriate sections.
3 Adaptive scaling and Adaptive Blending Units
In this section, we will introduce ABUs as an extension to the idea of an adaptive scaling of common activation functions, and analyze both ABUs and adaptive scaling with respect to task performance and the mechanics they introduce to the network. The activation functions we used as a baseline throughout this work are the hyperbolic tangent (tanh), ReLU, ELU, SELU, the identity and Swish. We will reference the adaptively scaled versions of these by adding ”” to the function’s name, e.g., ”ReLU”.
3.1 methods
Given a deep neural network of layers, and an activation function , the adaptively scaled version of the activation function is given by , with . The scaling weights are initialized at by default, and trained via backpropagation alongside all other network parameters. Swish’s is initialized as a trainable parameter per layer (i.e., ) and likewise trained via backpropagation in all cases. ABUs can be viewed as an extension to this approach, in which the shape of the activation function is determined by a blending of multiple common activation functions within the unit. Given a deep neural network of layers, and a set of activation functions per layer, the ABU for the th layer is defined as with and . The blending weights are initialized at by default, and also trained via backpropagation alongside all other network parameters. With respect to the set of activation functions used in ABUs, we chose tanh, ELU, ReLU, the identity, and Swish in order to allow for high flexibility of the resulting function. However, we did not conduct an exhaustive search over possible sets of activation functions, so other sets may outperform the chosen configuration.
The CIFAR 10 and CIFAR 100 datasets (Krizhevsky & Hinton, 2009) served as benchmarks to assess the performance of our approaches. Per-image z-transformation was applied as pre-processing to all images, and of the training set was used as a validation set during training. To evaluate model performance, we applied post-hoc early stopping: The model was saved once every epochs and the validation accuracy was estimated frequently over the course of training. All networks were trained for steps, after which we smoothed the validation accuracy curve and selected the model save point for which said curve indicated the highest performance. For each network and task specification, we report the mean of 30 runs, as well as the standard error. Training, validation and testing were all performed using mini-batches.
For the networks used in our tests, we created a set of small to mid-sized convolutional networks, called Simple Modular Convolutional Networks (SMCN). In different variations of these, features were added or subtracted to test the robustness of our approaches across different network design choices. The vanilla SMCN consists of four convolutional layers, followed by two dense layers. Max pooling (, stride ) is performed after the second and fourth convolutional layer, and dropout (Srivastava et al., 2014) with a rate of is applied after the first and third convolutional layer, and after the first dense layer. The convolutional layers use zero-padding and stride . They feature filters of size , , , and , and the dense layers consist of 384 and 192 neurons, respectively (see Figure 1). The network contains no residual connections, and no batch normalization (Ioffe & Szegedy, 2015) is performed by default. Initial weights are randomly sampled using He initialization (He et al., 2015). Bias units were initialized at , except for the first convolutional layer (). The network is trained for steps on mini-batches of size , using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of . The total number of trainable parameters in the vanilla SMCN is roughly M. In addition to this, we used the following variations in our tests: SMCN_{10}, a mid-sized network ( layers, roughly M parameters) identical to SMCN, with the exception that all layers between the two max pooling operations are repeated three times. SMCN_{S}, a simplified architecture where max pooling was replaced with average pooling, and all dropout layers were removed from the network (thus, the activation functions constitute the only non-linearities in this network). SMCN_{BN}, in which batch normalization is performed before applying the activation functions. We decided not to use dropout in this architecture, as batch normalization in conjunction with dropout can be problematic (Li et al., 2018). Note that since batch normalization negates the effect of any preceding scaling, adaptive scaling should not make a difference here. Finally, we also tested the vanilla SMCN with a Stochastic Gradient Descent optimizer with Momentum, instead of the Adam optimizer. Here, the networks were again trained for steps, with the momentum parameter set to , an initial learning rate of , and a learning schedule linearly decreasing the learning rate per weight update, reaching at the end of training.
3.2 Performance
Network | SMCN | SMCN_{10} | SMCN_{S} | SMCN_{BN} | SMCN | SMCN | |
---|---|---|---|---|---|---|---|
Optimizer | Adam | Adam | Adam | Adam | Momentum | Adam | Mean |
Task | CIFAR10 | CIFAR10 | CIFAR10 | CIFAR10 | CIFAR10 | CIFAR100 | Rank |
I | |||||||
I | |||||||
tanh | |||||||
tanh | |||||||
ReLU | |||||||
ReLU | |||||||
ELU | |||||||
ELU | |||||||
SELU | |||||||
SELU | |||||||
Swish | |||||||
Swish | |||||||
ABU (ours) |
For our performance tests, we chose a vanilla SMCN with Adam optimizer and CIFAR10 as the default setup to compare the various activation functions. All other tested setups are systematically varied versions of this, and differ in only one aspect each, i.e., network architecture, optimizer, or task. On average, adding adaptive scaling yielded improved performance for all activation functions, as evidenced by higher mean ranks of all adaptively scaled activation functions, compared to their fixed counterparts (see Table 1). However, as expected beforehand, batch-normalized networks (SMCN_{BN}) were found to be indifferent to adaptive scaling. Interestingly, also in networks trained with the Momentum optimizer, adaptive scaling yielded little to no improvement over the fixed activation functions. Adaptive Blending Units, on the other hand, outperformed all other activation functions in four out of six setups (including the Momentum setup), showing remarkable robustness across architectural choices, and consequently scoring the highest mean rank of all tested activation functions. Since the ability to perform adaptive scaling is an integral part of Adaptive Blending Units, any improvements over adaptively scaled activation functions can likely be attributed to their adaptive shape.
3.3 Analysis
But what exactly changes in the networks, when we introduce adaptive scaling or ABUs? In order to provide some insight into the mechanisms introduced by the two approaches, we carried out further analyses based on the default setup, i.e., a vanilla SMCN with Adam optimizer trained on CIFAR10.
Let us first examine how scaling weights behave during training. In our tests, the scaling weights almost unanimously decreased to values far below (see Figure 2A). This behavior was highly consistent over repeated runs with random initializations and mini-batch sampling: The mean standard deviation (over repeated runs) of the final scaling weights reached at the end of training is . With respect to how this influences the activations in the network, it is sensible to consider the succeeding layer’s pre-activation statistics, i.e., the distribution of values going into its activation function: The distribution of pre-activation is approximately Gaussian for large layers due to the Central Limit Theorem, and is thus easier to compare between networks with different activation functions. For many activation functions, the pre-activations are also crucial with respect to the magnitude of the gradients, in that they determine the fraction of inputs reaching saturated regions of the activation function. Our analyses show that the decreasing scaling weights rather precisely counteract an increase in the variance of the following weight matrix over the course of training. This stabilizes the distributions of pre-activation states in the following layers in both mean and variance, thus drastically reducing any covariate shift. We illustrate this by comparing the pre-activation variance of the last layer in SMCN networks, using tanh and tanh, in Figure 2B. Without adaptive scaling, the variance of pre-activations increased throughout training for all layers and all tested activation functions. With adaptive scaling, the standard deviations typically converged to a value between and early on in training, and remained stable at this value for the remainder of the training procedure. At the same time, pre-activation means were kept stable at less than a standard deviation from zero.
We take from this that adaptive scaling acts as a normalization technique, similar to batch normalization (Ioffe & Szegedy, 2015) or layer normalization (Ba et al., 2016). In contrast to these, however, adaptive scaling doesn’t require any explicit calculations of variance or other statistics, or keeping track of running averages in inference, and does not depend on batch or layer size. In principle, it also allows the network to optimize the statistics of the neurons’ input distributions, instead of enforcing unit mean and variance across the layer or batch. Depending on the type of calculations that are predominantly performed by a given layer, it appears plausible that some layers would prefer stronger saturation, while others may benefit from less saturation, provided the activation functions feature increasingly saturating regions. That being said, our analysis does not allow us to infer whether or not the realized distributions actually constitute an optimum for the required computation in a given layer. If an activation function is a homogeneous function of degree (scale-invariant; e.g., ReLU), the network performance would likely not be influenced by the variance of the pre-activation’s distribution, but may still be affected by shifts of the mean, which are also mitigated by adaptive scaling. We consider an in-depth analysis of such self-organizing processes, as well as further exploration of this principle for deep networks highly desirable, but out of scope for this work.
Turning to ABUs, we observe the same normalizing effect on the pre-activation statistics of succeeding layers. As illustrated in Figure 2C, ABUs realize a layer’s overall downscaling in multiple ways. In the first convolutional layer, for instance, the weights unanimously decrease and mostly converge towards values close to zero. At the end of training, the identity and ReLU have arrived at effectively zero, while the final activation function is mostly a mixture of ELU and tanh. By contrast, the first dense layer achieves the overall downscaling of positive inputs by subtracting ReLU from a mixture of ELU and Swish. In both cases, the resulting function is rather flat, pushing the activations (layer output) closer to zero. These different compositions of blending weights translate into substantially different shapes of the resulting ABU (see Figure 2D). But while the variation of the ABUs’ shape across layers is substantial, their shape within each layer is remarkably consistent over repeated runs, as indicated by a mean standard deviation of per layer and blending weight. This consistency, in conjunction with the good performance figures achieved by ABUs, lead to the conclusion that the learned shapes are meaningful with respect to the computations performed in the network.
In summary, while adaptive scaling stabilizes the pre-activation statistics of succeeding layers, the learned shapes of the resulting functions are meaningful, as well. Moreover, both adaptive scaling and an adaptive shape were found to yield improvements in performance for image classification tasks with convolutional networks.
4 Normalized Blending Weights
Network | SMCN | SMCN_{10} | SMCN_{S} | SMCN_{BN} | SMCN | SMCN | |
---|---|---|---|---|---|---|---|
Optimizer | Adam | Adam | Adam | Adam | Momentum | Adam | Mean |
Task | CIFAR10 | CIFAR10 | CIFAR10 | CIFAR10 | CIFAR10 | CIFAR100 | Rank |
ABU | |||||||
ABU_{nrm} | |||||||
ABU_{abs} | |||||||
ABU_{pos} | |||||||
ABU_{soft} |
So far, we focused on adaptive scaling as an integral part of ABUs. In order to better understand the effects of shape in ABUs, we conducted an additional experiment, in which we normalized the blending weights of the ABUs in four different ways, taking away their ability to scale the layer output by overall increases or decreases of the blending weights.
4.1 Methods
The following four methods of normalization for ABUs were used: ABU_{nrm} denotes the case where a layer’s blending weights are normalized to sum up to (). The normalization was implemented as part of the graph, dividing the blending weights by their sum, before applying them in the respective ABU. Similarly, in ABU_{abs}, we divided the raw blending weights by the sum of their absolute values, thus keeping the sum of the absolute values of the blending weights at (). Note that under this constraint, scaling is still possible, albeit not independent of the resulting shape: By having similar activation functions cancel each other out with blending weights on either side of zero, functions can be constructed that return only a fraction of the input, or even zero, for all positive values. We decided to include this form of normalization in the test to provide a more complete account of possible normalizations. In ABU_{pos}, any negative values are clipped before normalization, such that all blending weights are strictly positive (; ). Finally, in ABU_{soft}, we realized the same constraint (all-positive and normalized) by applying softmax normalization to the blending weights. With the exception of ABU_{abs}, none of the normalized versions of ABUs can realize an overall scaling of the resulting functions for positive input values. For the experiments, we used the same network and task configurations as in the previous section.
4.2 Performance & Analysis
The results of our performance tests are reported in Table 2. All five versions of ABUs showed remarkably similar performance throughout the tested task and network configurations - the average gap between the best and weakest performing ABU in a given setup is a mere %. In terms of mean rank, ABU and ABU_{abs} lead the field, and are thus the two most robust configurations. However, none of the other three versions fell far behind. We again used the default setup (vanilla SMCN, Adam, CIFAR10) for an analysis of the blending weights and their effects on the succeeding pre-activations. We found ABU_{abs}s to behave much like unconstrained ABUs, implementing adaptive scaling, keeping the layer statistics constant over the course of training, thus mitigating covariate shift. Despite the fact that the scaling imposes constraints on the shape of the resulting function (as outlined above), ABU_{abs} performed very similarly to unconstrained ABUs in most settings. By contrast, but very much expectedly, ABU_{nrm}, ABU_{pos}, and ABU_{soft}, were unable to keep the layer statistics at constant levels, and a considerable covariate shift akin to that in fixed activation functions was observed. Interestingly, this appears to have only a minor impact on performance, and they were able to keep up with, or even outperform unconstrained ABUs in some of the tested settings. The good performance of normalized ABUs in our tests is in line with Manessi & Rozza (2018), who found very similar or identical units^{1}^{1}1With respect to the constraints, Manessi & Rozza (2018)’s affine() units are equivalent to ABU_{nrm}, and their convex() units are equivalent to ABU_{pos}. Unfortunately, the authors did not provide details with respect to their implementation, so we cannot say whether or not the implementations are equivalent. to outperform common activation functions in MNIST, CIFAR and ImageNet tasks, using widely used network architectures, such as AlexNet and ResNet-56. In conclusion, while ABUs generally apply adaptive scaling when possible, the ability to learn the function’s shape by itself already helps to improve network performance beyond the level of the established activation functions they are comprised of.
5 Pre-Training Scaling and Blending weights
Network | SMCN | SMCN | SMCN | SMCN | SMCN |
Optimizer | Adam | Adam | Adam | Adam | Adam |
Task | CIFAR10 | CIFAR10 | CIFAR10 | CIFAR10 | CIFAR10 |
init | pre-trained | pre-trained | pre-trained (norm.) | pre-trained (norm.) | |
trainable | ✓ | - | ✓ | - | ✓ |
tanh | - | - | |||
ReLU | - | - | |||
ELU | - | - | |||
ABU | |||||
ABU_{nrm} | - | - | |||
ABU_{abs} | - | - | |||
ABU_{pos} | - | - | |||
ABU_{soft} | - | - |
Finally, we investigated for both adaptive scaling and ABUs, whether or not the adaptiveness of scaling and blending weights by itself is an important factor for the overall performance of the network, and if the performance could possibly be further improved by using pre-trained weights. To this end, we set up an experiment with two main conditions. In both of them, we initialized the networks with the final scaling or blending weights of a preceding run. We then fixed these values after initialization in one condition, while keeping them adaptive in another. In case of ABUs, it is conceivable that the shape of the function at the end of training would be ideal, while the scaling may be too low at the beginning of a new run. Therefore, we added two more conditions akin to the main condition and only for unconstrained ABUs, in which we normalized the pre-trained blending weights after initialization, thus keeping the learned shape, while resetting the learned scaling. All tests were based on the default setup (vanilla SMCN, Adam, CIFAR10).
The results are shown in Table 3. For tanh and ReLU, initializing the scaling weights at the predominantly low final values of a preceding run helped to improve performance. In both cases, runs with fixed pre-trained already surpassed the performance of the preceding run, but keeping them adaptive over the course of training led to further improvements. The fact that fixed pre-trained scaling weights yielded an increase in performance suggests that the initial variance of weights in the weight matrices (derived using He initialization), may not have been ideal as initial conditions, despite resulting in pre-activation variances of about . ELU, by contrast, substantially lost performance after initialization with pre-trained scaling weights, irrespective of whether or not they were fixed or adaptive throughout the run. With the exception of ABU_{soft}, the same was the case in all versions of ABUs, where fixed blending weights, in particular, led to sizable drops in performance of up to four percent. ABU_{soft}, being the exception to this rule, improved slightly for fixed pre-trained blending weights, but lost performance with adaptive pre-trained blending weights. Normalizing the unrestricted ABU blending weights after initialization with pre-trained values led to improvements over non-normalized pre-trained blending weights, but the default setup with initialization at still performed best. Overall, we found all but one of the tested activation functions to perform best, when the blending weights were kept adaptive, as opposed to fixed after initialization. These results suggest that in both adaptive scaling and ABUs, much of the gained performance is won by keeping the scaling and / or shape adaptive. Moreover, the fact that this applies also to most normalized versions of ABUs indicates that there may not be any single optimal shape for an activation function in a given layer.
6 Conclusion & Outlook
In summary, we introduced Adaptive Blending Units (ABUs), and analyzed adaptive scaling for common activation functions. We found robust performance advantages of both approaches over established activation functions in a range of tasks and network architectures. In adaptive scaling, the performance advantages could be traced back to stabilized pre-activation statistics during training, mitigating covariate shift. The same behavior was found for unconstrained ABUs, while normalized ABUs reached similar levels of performance without the ability to significantly scale the layer output. Our results suggest that the adaptiveness of the shape over the course of training may play a major role in this, as well, as opposed to simply converging to some ideal shape.
With respect to adaptive scaling, a logical next step would be to introduce a shifting parameter per layer, to allow the network to further optimize the input distributions to the activation functions, and to move this learned normalization in front of the activation: . This form of self-organized normalization could be explicitly combined with ABUs, thus detaching the handling of layer statistics from the shape of the activation function. Recurrent networks may be a particularly interesting field of application for self-optimization of layer statistics, as it should, in principle, mitigate some of the issues associated with explicit normalization techniques. Interestingly, adaptive scaling has been discussed for neural populations outside of the field of deep learning and proven helpful in maintaining stable output distributions (Turrigiano & Nelson, 2004; Leugering & Pipa, 2018). Beyond this, an increase in the number of distinct ABUs within a layer may yield further improvements in performance, as well as a systematic search for high-performing sets of activation functions in ABUs.
Acknowledgments
We would like to thank the Nvidia Corporation for their kind donation of a Titan X Pascal graphics card, as well as Tom Hatton for his assistance in this project.
References
- Agostinelli et al. (2014) Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830, 2014.
- Alcaide (2018) Eric Alcaide. E-swish: Adjusting activations to different network depths. arXiv preprint arXiv:1801.07145, 2018.
- Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Clevert et al. (2015) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
- Dushkoff & Ptucha (2016) Michael Dushkoff and Raymond Ptucha. Adaptive activation functions for deep networks. Electronic Imaging, 2016(19):1–5, 2016.
- Eisenach et al. (2016) Carson Eisenach, Zhaoran Wang, and Han Liu. Nonparametrically learning activation functions in deep neural nets. 2016.
- Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323, 2011.
- Godfrey & Gashler (2015) Luke B Godfrey and Michael S Gashler. A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks. In Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), 2015 7th International Joint Conference on, volume 1, pp. 481–486. IEEE, 2015.
- Goodfellow et al. (2013) Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.
- Hahnloser et al. (2000) Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947, 2000.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
- Klambauer et al. (2017) Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems, pp. 972–981, 2017.
- Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- Leugering & Pipa (2018) Johannes Leugering and Gordon Pipa. A unifying framework of synaptic and intrinsic plasticity in neural populations. Neural computation, (Early Access):1–42, 2018.
- Li et al. (2018) Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. arXiv preprint arXiv:1801.05134, 2018.
- Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, pp. 3, 2013.
- Manessi & Rozza (2018) Franco Manessi and Alessandro Rozza. Learning combinations of activation functions. arXiv preprint arXiv:1801.09403, 2018.
- Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
- Ramachandran et al. (2017) Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. CoRR, abs/1710.05941, 2017. URL http://arxiv.org/abs/1710.05941.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
- Turrigiano & Nelson (2004) Gina G Turrigiano and Sacha B Nelson. Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5(2):97, 2004.
- Xu et al. (2015) Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.