On the Selection of Initialization and Activation Function for Deep Neural Networks
Abstract
The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the learning procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during backpropagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by Schoenholz et al. [13] who showed that for deep feedforward neural networks only a specific choice of hyperparameters known as the ‘edge of chaos’ can lead to good performance. We complete these recent results by providing quantitative results showing that, for a class of ReLUlike activation functions, the information propagates indeed deeper when the network is initialized at the edge of chaos. By extending our analysis to a larger class of functions, we then identify an activation function, , which improves the information propagation over ReLUlike functions and does not suffer from the vanishing gradient problem. We demonstrate empirically that this activation function combined to a random initialization on the edge of chaos outperforms standard approaches. This complements recent independent work by Ramachandran et al. [12] who have observed empirically in extensive simulations that this activation function performs better than many alternatives.
On the Selection of Initialization and Activation Function for Deep Neural Networks
Soufiane Hayou, Arnaud Doucet, Judith Rousseau Department of Statistics University of Oxford {soufiane.hayou, arnaud.doucet, judith.rousseau}@stats.ox.ac.uk
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Deep neural networks have become extremely popular in machine learning as they achieve stateoftheart performance on a variety of important applications including language processing and computer vision; see, e.g., [4]. The success of these models has motivated the use of increasingly deep networks and stimulated a large body of work to understand their theoretical properties. It is impossible to provide here a comprehensive summary of the large number of contributions within this field. To cite a few results relevant to our contributions, Montufar et al. [8] have shown that neural networks have exponential expressive power with respect to the depth while Poole et al. [11] obtained similar results using a topological measure of expressiveness.
We follow here the approach of Poole et al. [11] and Schoenholz et al. [13] by investigating the behaviour of random networks in the infinitewidth and finitevariance i.i.d. weights context where they can be approximated by a Gaussian process as established by Matthews et al. [7] and Lee et al. [5]. This Gaussian process approximation generalizes the work of Neal [9] from one to multiple layers.
In this paper, our contribution is threefold. First, we give a theoretical framework clarifying the results of Poole et al. [11] and Schoenholz et al. [13] and show that initializing a network on the edge of chaos is linked to a deeper propagation of the information through the network. In particular, we establish that for a class of ReLUlike activation functions, the exponential depth scale introduced in [13] is replaced by a polynomial depth scale. This implies that the information can propagate deeper when the network is initialized on the edge of chaos. Second, we outline the limitations of ReLUlike activation functions by showing that, even on the edge of chaos, the limiting Gaussian Process admits a degenerate kernel as the number of layers goes to infinity. We derive from first principles a novel activation function better suited for deep neural networks by requiring this latter to satisfy the following three conditions: 1) is nonpolynomial (otherwise the output is a polynomial function of the input), 2) does not suffer from the exploding/vanishing gradient problem, and 3) allows a good ‘information flow’ through the network. We show that enjoys all of these conditions. Third, we demonstrate empirically the good performance of this function and of the initialization procedure. In recent independent work, Ramachandran et al. [12] used automated search techniques to discover new activation functions and found in extensive simulations that functions of the form appear to perform indeed better than many alternatives, including ReLU, for deep networks. Our paper provides a theoretical grounding for these results.
2 On Gaussian process approximations of neural networks and their stability
2.1 Setup and notations
Consider a fully connected random neural network of depth , widths , weights and bias , where denotes the normal distribution of mean and variance . For some input , the propagation of this input through the network is given for an activation function by
(1) 
Throughout the paper we assume that for all the processes are independent (across ) centred Gaussian processes of covariance kernels and write accordingly . This is an idealized version of the true processes corresponding to choosing (which implies, using Central Limit Theorem, that is a Gaussian variable for any input ). The approximation of as Gaussian processes was proposed in the single layer case by Neal [9] and has been extended to the multiple layer case by Lee et al. [5] and Matthews et al. [7].
Notice that for any input , so that for any inputs
where is a function that depends only on . This gives a recursion to calculate the kernel ; see, e.g., [5] for more details. We can also express the kernel in terms of the correlation in the layer (we will use this expression in the rest of the paper)
where , resp. , is the variance, resp. correlation, in the layer and , are independent standard Gaussian random variables. Since , the variance can be interpreted as the square of the length of the input vector This length is updated through the layers by the recursive formula , where is the ‘variance function’ given by
(2) 
Throughout the paper, will always denote independent standard Gaussian variables.
2.2 Limiting behaviour of the variance and covariance operators
We analyze here the limiting behaviour of and as the network depth goes to infinity under the assumption that has a second derivative at least in the distribution sense^{1}^{1}1ReLU admits a Dirac mass in 0 as second derivative and so is covered by our developments.. From now onward, we will also assume without loss of generality that (similar results can be obtained straightforwardly when ). We first need to define the Domains of Convergence associated with an activation function .
Definition 1.
Let be an activation function, .
(i) is in (domain of convergence for the variance) if there exists , such that for any input with , . We denote by the maximal satisfying this condition.
(ii) is in (domain of convergence for the correlation) if there exists such that for any two inputs with , . We denote by the maximal satisfying this condition.
Remark : Typically, in Definition 1 is a fixed point of the variance function (2). Therefore, it is easy to see that for any such that is increasing and admits at least one fixed point, we have by taking to be the minimal fixed point; i.e. . This means that if we rescale the input data (assuming the data is in some compact set of ) to have , the variance converges to . We can also rescale the variance of the first layer (only) to have for all inputs .
The next result gives sufficient conditions on to be in the domains of convergence of .
Proposition 1.
Let . Assume , then for and any , we have and
Moreover, let . Assume for some positive , then for and any , we have and .
Example : In the case of the ReLU activation function, we have and , for any .
In the domain of convergence , for all , almost surely and the outputs of the network are constant functions. Figure 1 illustrates this behaviour in dimension 2 for ReLU and Tanh for a Network of depth using neurons per layer. The draws of outputs of these networks are indeed almost constant.
To refine the analysis and improves our understanding of , we study the rates of convergence of and . For two inputs and , we note and . Schoenholz et al. [13] established the existence of and such that and when the fixed points exist. The quantities and are called ‘depth scales’ since they represent the depth to which the information (variance and correlation respectively) can propagate without being exponentially small. More precisely, if we write and then the depth scales are given by and . The equation is linked to an infinite depth scale of the correlation. It is called the edge of chaos as it separates two states: an ordered phase where the correlation converges to 1 if and a chaotic phase, where and the correlation does not converge to 1. In particular, it is observed in [13] that the correlations converge to some random value if when .
Definition 2.
For , let be the limiting variance^{2}^{2}2The limiting variance is a function of but we do not emphasize it notationally.. The Edge of Chaos, hereafter EOC, is the set of values of satisfying .
To further study the EOC regime, the next lemma introduces a function called the ‘correlation function’ simplifying the analysis of the correlations. It states that the correlations has the same asymptotic behaviour as the timehomogeneous dynamical system .
Lemma 1.
Let such that , and an activation function such that for all compact sets . Define by and by . Then .
The condition on is violated only by activation functions with exponential growth (which are not used in practice), so from now onward, we use this approximation in our analysis. Note that being on the EOC is equivalent to satisfying . In the next section, we analyze this phase transition carefully for a large class of activation functions.
3 Edge of Chaos
To illustrate the effect of the initialization on the edge of chaos, we plot in Figure 2 the output of a ReLU neural network with 20 layers and 100 neurons per layer with parameters (as we will see later in this scenario). Unlike the output in Figure 1, this output displays much more variability. However, we will prove here that the correlations still converges to 1 even in the EOC regime, albeit at a slower rate.
3.1 Homogeneous activation functions
We consider a positively homogeneous activation function (ie, for ). It is easy to see that has necessarily the form: if and if . ReLU corresponds to and . In this special case of homogeneous activation functions, we will see (Proposition 2) that the variance is unchanged () when , and does not formally exist in the sense that the limit of depends on . This does not modify however the analysis of the correlations and it can be seen interpreted as an extension of the definition of the domains of convergence .
Proposition 2.
Let be a positively homogeneous function. Then for any and , we have with . Moreover and, on the EOC, for any .
This class of activation functions has the interesting property of preserving the variance across layers since . However, we show in Proposition 3 below that, even in this EOC regime, the correlations converge to 1 but at a slower rate. We only present the result for ReLU but the generalization to the class of homogeneous functions is straightforward.
Example : ReLU: The EOC is reduced to the singleton . From Proposition 2, the correlation does not depend on the variance. In [1], the authors derive a general formula for the kernel of activations of the form . In the next result, we use a different method to derive the correlation function when the network is initialized on the EOC, and we further show that the correlations converge to 1 at a polynomial rate of .
Proposition 3 (ReLU kernel).
Consider a ReLU network with parameters on the EOC.We have
(i) for , ,
ii) for any , and as .
Figure 3 displays the correlation function with two different sets of parameters . The red graph corresponds to the edge of chaos , and the blue one corresponds to an ordered phase . Even on the EOC, the correlation converges eventually to 1 as . Although the convergence rate is no longer exponential (), convergence still occurs. We observe it numerically for . As the variance is preserved by the network () and the correlations converge to 1 as increases, the output function is of the form for a constant (notice that even in Figure 2 with depth 20, we start to observe this effect).
3.2 A better class of activation functions
We now introduce a large class of activation functions which appear more suitable for deep networks. For this class of functions, can be indeed tuned to slow the convergence of the correlations to 1 by making the correlation function sufficiently close to the identity function.
Proposition 4 (Main Result).
Let be an activation function. Recall the definition of the variance function . Assume that
(i) , , and there exists such that .
(ii) For any , there exists such that for all , the function is increasing and has a fixed point. We denote the minimal fixed point of , .
(iii) For and , the correlation function introduced in Lemma 1 is convex.
Then, for any and , we have and . We also have . Moreover, if ,
ReLU does not satisfy the condition on since this quantity is not defined while this condition is crucial for the result to be true. In the proof of Proposition 4, we show that . Hence the condition on guarantees that the limits on the edge of chaos when are well defined. The result of Proposition 4 states that we can make close to the identity function by considering , however, this also implies , therefore, we cannot take too small.
The next proposition gives sufficient conditions for bounded activation functions to satisfy all the conditions of Proposition 4.
Proposition 5.
Let be a bounded function such that , , , , and . Then, satisfies all the conditions of Proposition 4.
The conditions in Proposition 5 are easy to verify and are, for example, satisfied by Tanh and Arctan. Figure 4 (a) displays for different values of (we find by solving numerically ). We see that is approaching the identity function when is small, preventing the correlation from converging to 1. The correlation becomes approximately stationary after layers ( being the number of layers for which we can use the approximation ), therefore, the kernel of the equivalent Gaussian process is essentially defined by the first layers. To see the practical implications of this result, Figure 4(b) displays an example of the output of a neural network of depth 50 and width 100 with Tanh activation, and . The outputs displays much more variability than the ones of the ReLU network with the same architecture .
Tanhlike activation functions provide a better information flow in deep neural networks compared to ReLUlike functions. However, these functions suffer from the vanishing gradient problem during backpropagation, see Pascanu et al. [10] and Kolen et al. [6]. This is the main reason why ReLU has been adopted as it has a gradient equal to 1 for positive numbers. Thus, finding an activation function that satisfies the conditions of Proposition 4 (in order to have a good flow of information) and that does not have the vanishing gradient issue is expected to perform better than ReLU.
Proposition 6.
The activation function satisfies all the conditions of Proposition 4.
It is clear that does not suffer from the vanishing gradient problem as it has a gradient close to 1 for large inputs like ReLU. We present in Table 3 some values on on the EOC as well as the corresponding limiting variance for this activation function. As mentioned in the discussion of Proposition 4, the limiting variance decreases with .
4 Experimental Results
We demonstrate empirically our results on the MNIST dataset. In all the figures below, we compare the learning speed (test accuracy with respect to the number of epochs) for different activation functions and initialization parameters. We use Adam optimizer in Tensorflow. We also present the final test errors after training for ReLU and .
Initialization on the Edge of Chaos We initialize randomly the deep network by sampling and . In Figure 5, we compare the learning speed of a ReLU network for different choices of random initialization. For deep neural networks (Figure 5 (b) with depth 60), any initialization other than on the edge of chaos results in the optimization algorithm being stuck at a very poor test accuracy of (equivalent to selecting the output uniformly at random). To understand what is happening in this case, let us recall how the optimization algorithm works. Let be the MNIST dataset. The loss we want to optimize is given by where is the output of the deep network (Forward Propagation) of parameters and . For any , we know that the output converges exponentially to a fixed value (same value for all ), thus a small change in and will not (considerably) change the value of the loss function, therefore the numerical gradient is approximately zero and the gradient descent algorithm will be stuck around the initial value. We avoid this problem of convergence to a fixed point when the parameters of the network are sampled at initialization using on the EOC.
ReLU versus Tanh We proved in Section 3.2 that the Tanh activation guarantees better information propagation through the network when initialized on the EOC. However, Tanh networks suffer
from the vanishing gradient problem. Consequently, we expect Tanh to perform better than ReLU (in terms of learning speed) for shallow networks as opposed to deep networks, where the problem of the vanishing gradient is not encountered. Numerical results confirm this fact. Figure 6 shows that for depths 5 and 20, the learning algorithm converges faster for Tanh compared to ReLu. However, for deeper networks (), Tanh is stuck at a very low test accuracy, this is due to the fact that a lot of parameters remain essentially unchanged because the gradient is very small.
ReLU versus As established in Section 3.2, , like Tanh, propagates the information better than ReLU and, contrary to Tanh, it does not suffer from the vanishing gradient problem. Hence our results suggest that should perform better than ReLU, especially for deep architectures. Numerical results confirm this fact. Figure 7 shows a comparison between and ReLU networks (both initialized on the edge of chaos) for different depths (5, 20, 40). A comparative study of final accuracy is shown in Table 2. We observe a clear advantage for , especially for large depths. Additive simulations results on diverse datasets demonstrating better performance of over many other activation functions can be found in [12] (Notice that these authors have already implemented in Tensorflow under the name "swish").
(10,5)  (20,10)  (40,30)  (60,40)  

ReLU  94.01  96.01  96.51  91.45 
94.46  96.34  97.09  97.14 
5 Conclusion and Discussion
We have shown that initializing the network on the edge of chaos provides a better propagation of information across layers. In particular, networks with ReLUlike functions (positively homogeneous functions) have an interesting property on the EOC: the variance is preserved across layers. However, the correlations still converge to 1 for this class of activation functions. To solve this problem, we have introduced a set of sufficient conditions that activation functions have to ensure better propagation of the information when the parameters are on the EOC and is small. We have identified in particular the function which additionally does not suffer from the vanishing gradient problem. Alternative architectures that enjoy good information propagations, such as Residual Neural Networks [2], are an interesting alternative to address these problem as recently established in [16].
Our results have interesting implications for Bayesian neural networks which have received renewed attention lately; see, e.g., [3] and [5]. They indeed allow us to better understand the properties of the functional spaces described by assigning independent Gaussian prior distributions to the weights and biases. Our results indicate that, even on the EOC, ReLUlike activation functions will induce poor functional spaces if the number of layers and neurons per layer are large, the resulting prior distribution being concentrated on close to constant functions. To obtain much richer priors, our results indicate that we need to select not only parameters on the EOC but also an activation function satisfying Proposition 4.
Another interesting research direction is to design prior distribution on the weights which prevents this Gaussian process behaviour as originally suggested in [9]. One possible way would be to introduce heavytailed nonGaussian distributions and potential correlations on the weights to ensure the Central Limit Theorem does not hold as the number of neurons increases.
References
[1 ] Cho, Y. and Saul, L.K. (2009) Kernel Methods for Deep Learning. Advances in Neural Information Processing Systems 22, pp. 342350.
[2] He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition.
[3] HernandezLobato, J. M. and Adams, R.P. (2015) Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. ICML
[4] LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep learning. Nature, 521:436–4454.
[5] Lee, J., Bahri, Y., Novak, R., Schoenholz, S.S., Pennington, J. and SohlDickstein, J. (2018) Deep Neural Networks as Gaussian Processes. 6th International Conference on Learning Representations.
[6] Kolen, J.F. and Kremer, S.C. (2001) Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. In A Field Guide to Dynamical Recurrent Networks, 1, WileyIEEE Press, 2001, pp.464
[7] Matthews, A.G., Hron, J., Rowland, M., Turner, R.E. and Ghahramani, Z. (2018) Gaussian Process Behaviour in Wide Deep Neural Networks. 6th International Conference on Learning Representations.
[8] Montufar, G.F., Pascanu, R., Cho, K. and Bengio, Y. (2014) On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.). Advances in Neural Information Processing Systems 27, pp. 2924–2932. Curran Associates, Inc., 2014.
[9] Neal, R.M. (1995) Bayesian learning for Neural Networks, volume 118. Springer Science & Business Media, 1995.
[10] Pascanu, R., Mikolov, T. and Bengio, Y. (2013) On the Difficulty of Training Recurrent Neural Networks. Proceedings of the 30th International Conference on Machine Learning  Volume 28, pp. III1310–III1318.
[11] Poole, B., Lahiri, S., Raghu, M., SohlDickstein, J. and Ganguli, S. (2016) Exponential expressivity in deep neural networks through transient chaos. 30th Conference on Neural Information Processing Systems.
[12] Ramachandran, P., Zoph, B. and Le, Q.V. (2017) Searching for Activation Functions. arXiv eprint 1710.05941
[13] Schoenholz, S.S., Gilmer, J., Ganguli, S. and SohlDickstein, J. (2017) Deep Information Propagation. 5th International Conference on Learning Representations.
[14] Sussillo, D. and Abbott, L.F. (2015) Random walks: Training very deep nonlinear feedforward networks with smart initialization. ICLR 2015.
[15] Sutskever, I., Martens, J., Dahl, G. & Hinton, G. (2013) On the importance of initialization and momentum in deep learning. Proceedings of the 30th International Conference on Machine Learning.
[16] Yang, G. and Schoenholz, S. (2017) Mean Field Residual Networks: On the Edge of Chaos. Advances in Neural Information Processing Systems 30, pp. 2869–2869.
Supplementary material
We provide in the supplementary material the proofs of the propositions presented in the main document, and we give additive theoretical and experimental results.
Appendix A Proofs
For the sake of clarity we recall the propositions before giving their proofs.
a.1 Convergence to the fixed point: Proposition 1
Proposition 1.
Let . Suppose , then for and any , we have and
Moreover, let . Suppose for some positive , then for and any , we have and .
Proof.
To abbreviate the notation, we use for some fixed input a.
Convergence of the variances: We first consider the asymptotic behaviour of . Recall that where,
The first derivative of this function is given by:
(3) 
where we used the Gaussian integration by parts for any sufficiently regular function .
Using the condition on , we see that for , the function is a contraction mapping, and the Banach fixedpoint theorem guaranties the existence of a unique fixed point of , with . Note that this fixed point depends only on , therefore, this is true for any input , and .
Convergence of the covariances: Since , then for all there exists such that, for all , . Let , we have
We cannot use the Banach fixed point theorem directly because the integrated function here depends on l through . For ease of notation, we write , we have
For , is a Cauchy sequence, therefore, it converges to a limit . At the limit
where , and .
The derivatives of this function is given by :
By assumption on and the choice of , we have , so that f is a contraction, and has a unique fixed point. Since , . The above result is true for any , therefore, . ∎
As an illustration we plot in Figure A.1 the variance for three different inputs with , as a function of the layer . In this example, the convergence in the Tanh network is faster than that of the ReLU network.
Lemma 1.
Let such that , and an activation function such that for all compact sets . Define by and by . Then .
Proof.
For , we have
the first term goes to zero uniformly in using the condition on and CauchySchwartz inequality. As for the second term, it can be written as
again, using CauchySchwartz and the condition on , both terms can be controlled uniformly in by an integrable upper bound. We conclude using the Dominated convergence. ∎
a.2 Results for ReLUlike activation functions: proof of Propositions 2 and 3
Proposition 2.
Let be a positively homogeneous activation function. Then
(i) For any and , we have with .
(ii) and, on the EOC, for any .
Proof.
We write throughout the proof. Note first that the variance satisfies the recursion:
(4) 
For all , is a fixed point. This is true for any input, therefore and (i) is proved.
Now, the EOC equation is given by . Therefore, . Replacing by its critical value in (4) yields
Thus if and only if , otherwise diverges to infinity. So the frontier is reduced to a single point , and the variance does not depend on .
∎
Proposition 3 (ReLU kernel).
Consider a ReLU network with parameters on the EOC.We have
(i) for , ,
ii) for any , and as .
Proof.
In this case the correlation function is given by .

Let , note that is differentiable and satisfies,
which is also differentiable. Simple algebra leads to
Since and ,
We conclude using the fact that and .

We first derive a Taylor expansion of near 1. Consider the change of variable with close to 0, then
so that
and
Since
we obtain that