On weight initialization in deep neural networks
A proper initialization of the weights in a neural network is critical to its convergence. Current insights into weight initialization come primarily from linear activation functions. In this paper, I develop a theory for weight initializations with non-linear activations. First, I derive a general weight initialization strategy for any neural network using activation functions differentiable at 0. Next, I derive the weight initialization strategy for the Rectified Linear Unit (RELU), and provide theoretical insights into why the Xavier initialization is a poor choice with RELU activations. My analysis provides a clear demonstration of the role of non-linearities in determining the proper weight initializations.
In recent years, there have been rapid advances in our understanding of deep neural networks. These advances have resulted in breakthroughs in several fields, ranging from image recognition (,,) to speech recognition (,,) to natural language processing (,, ). These successes have been achieved despite the notorious difficulty in training these deep models.
Part of the difficulty in training these models lies in determining the proper initialization strategy for the parameters in the model. It is well known  that arbitrary initializations can slow down or even completely stall the convergence process. The slowdown arises because arbitrary initializations can result in the deeper layers receiving inputs with small variances, which in turn slows down back propagation, and retards the overall convergence process. Weight initialization is an area of active research, and numerous methods (, ,  to state a few) have been proposed to deal with the problem of the shrinking variance in the deeper layers.
In this paper, I revisit the oldest, and most widely used approach to the problem with the goal of resolving some of the unanswered theoretical questions which remain in the literature. The problem can be stated as follows: If the weights in a neural network are initialized using samples from a normal distribution, , how should be chosen to ensure that the variance of the outputs from the different layers are approximately the same?
The first systematic analysis of this problem was conducted by Glorot and Bengio  who showed that for a linear activation function, the optimal value of , where is the number of nodes feeding into that layer. Although the paper makes several assumptions about the inputs to the model, it works extremely well in many cases and is widely used in the initialization of neural networks to date; this initialization scheme is commonly referred to as the Xavier initialization.
In an important follow up paper, He and colleagues  argue that the Xavier initialization does not work well with the RELU activation function, and instead propose an initialization of (commonly referred to as the He initialization). In support of their initialization, they provide an example of a 30 layer neural network which converges with the He initialization, but not under the Xavier initialization. To the best of my knowledge, the precise reason for the convergence of one method and the non-convergence of the other is not fully understood.
My main contributions in this paper are to (a) generalize the results of  to the case of non-linear activation functions and (b) to provide a continuum between the results of  and . For the class of activation functions differentiable at 0, I provide a general initialization strategy. For the class of activation functions that are not differentiable at 0, I focus on the Rectified Linear Unit (RELU) and provide a rigorous proof of the He initialization. I also provide theoretical insights into why the 30 layer neural network converges with the He initialization but not with the Xavier initialization. As a small bonus, I resolve an unanswered question posed in  regarding the distributions of activations under the hyperbolic tangent activation.
2 The setup
Consider a deep neural network with layers. The relationship between the inputs to the layer () and layer () are described by the recursions
Here , is a matrix of weights for the layer, is the non-linear activation function, and is the number of nodes in the hidden layers respectively. The weights are assumed to be independent identically distributed normal random variables with mean 0 and variance . Consistent with the assumptions in  and , I assume that the inputs to the first layer are independent and identically distributed random variables with mean 0 and variance 1. For convenience, I use and to denote the mean and variance of respectively.
Due to the symmetry in the problem, all inputs to the layer will have the same means and variances during the first forward pass (i.e., and for all ); the covariances between the inputs to the layer need not be 0.
The goal is to find the value of which ensures that during the first forward pass. To accomplish this, I need to express the central moments of in terms of the central moments of for an arbitrary value of . I begin by analyzing properties of the neural network that are independent of the activation function considered in the analysis.
During the first iteration, is independent of for all values of and
Using the recursions in (1) and (2), can be expressed as some non-linear function of the weights in the first layers, and the inputs to the first layer. Since the weights in the layer are independent of the inputs to the first layer and the weights in all other layers, the weights in the layer will also be independent of any non-linear function of these quantities. Therefore, is independent of for all values of and . Furthermore, since is independent of and , will also be independent of ∎
Taking expectations in (1) and using proposition 1, along with the fact that yields
Therefore, . Using (1),
From proposition 1, and will (a) be independent of each other, and (b) be independent of . Using these results, along with the fact that gives
Plugging (5) into (LABEL:eq:in203) gives
for all . Interestingly, this result holds for any arbitrary covariance structure of the inputs to the layer.
Equations (3) and (LABEL:eq:in204) provide insights into the central moments of , but can we derive insights into the distribution of ? To answer this question, I make the additional assumption that the number of nodes in the hidden layer () is ‘large’; this assumption is reasonable given that most neural networks have several hundred nodes in the hidden layers. Under this assumption, we have the following result
will be approximately normally distributed for all values of and .
For the first iteration, note that . Furthermore, for ,
where the last equality follows from (5). This implies that , … are independent and identically distributed random variables. Therefore by the Central Limit Theorem, we expect to converge to a normal distribution when is large .
Even when , … are dependent and not identically distributed, the conditions required to ensure that converges to a normal distribution are weak (for a list of all the conditions, see Theorem 2.8.2 in ). Thus, is expected to to be approximately normally distributed during most iterations.
The analysis thus far has focused on providing general insights into the distribution of resulting from equation (1). In order to analyze the role of the non-linearity induced by (2), assumptions need to be made about the nature of . In particular, my analysis critically hinges on the differentiability of at 0. Accordingly, I split the analysis into two cases. The first case deals with the general class of activation functions differentiable at 0. In the second case, instead of considering all possible non-differentiable functions, I focus on the Rectified Linear Unit (RELU) which is commonly used in the analysis of neural networks.
3 Activation functions differentiable at 0
When is differentiable at 0, we can perform a Taylor expansion in (2) about . Assuming that the higher order terms can be ignored,
Taking the expectation in (8) gives
This equation suggests that the expected value of the inputs to the layer has little dependence on the moments of the inputs to the layer. Using this result recursively suggests that for all layers (barring the first),
Equation (12) provide a general weight initialization strategy for any arbitrary differentiable activation function. I use the results developed in this section to analyze the optimal value of for two commonly used differentiable activation functions - the hyperbolic tangent and the sigmoid.111In their calculations,  and  impose an additional set of constraints to ensure that the variance is maintained even during the backward pass. I believe that this is not required, since the requirement that the variance of the inputs at each layer be the same ensures that the gradient flows through in the backward pass.
3.1 Hyperbolic tangent activation
For the hyperbolic tangent
we have and . Plugging these results in (12) yields
which is precisely the Xavier Initialization.
Sequential saturation with the hyperbolic tangent
In their analysis of a neural network with the hyperbolic tangent activations,  find that the deeper layers in the neural network have a greater proportion of unsaturated nodes than the shallower layers. As is stated in their paper, ‘why this is happening remains to be understood’.
To explain their finding, I begin by noting that in , the authors initialize the weights using samples from a uniform distribution having a variance of . Therefore, from (10) and (11), with and , we have and
respectively. From (LABEL:eq:in204), this implies that for all (i.e., is a decreasing function of ). Furthermore from proposition 2, . Therefore, will be the tanh transformation of a normal random variable. Using results from , will have a probability density function (pdf) given by
Plots of this pdf for different values of (provided in Figure 1) produce trends similar to those observed by the simulation studies of  (figure 4 in their paper). A comparison of Figure 1 and figure 4 of  suggests that is a decreasing function of , as is expected.
From the results in , we expect the activations to be (a) approximately normally distributed when is close to 0 and (b) bimodally distributed with local maximas near -1 and +1 when is close to 1. Accordingly, since is a decreasing function of , we expect the activations from the shallower layers to be more saturated (i.e., more concentrated near -1 and +1), and the saturation in the activations to reduce as we go to the deeper layers in the network.
3.2 Sigmoid activation
For the sigmoid activation defined as
we have , . Plugging these values in (12) yields
To compare the initialization described in (19) with the Xavier initialization, I use a simple 10 layer neural network whose architecture is described in Figure 2. For my experiments, I use the CIFAR 10 dataset  comprising 60,000, color images evenly split over 10 classes. The dataset comprises 50,000 training examples (which forms the training dataset in my analyses) and 10,000 test examples (which forms the validation dataset in my analyses).
First, I train the neural network with the Xavier initialization for 10 iterations and compute the top 5 accuracy on the validation dataset for each iteration. Next, I repeat the process using the initialization stated in (19). A comparison of the validation accuracies for the 2 cases is provided in Figure 3, which shows that the convergence appears to stall with the Xavier initialization, but proceeds rapidly with the initialization proposed in (19).222Python code (using the package Keras ) to replicate Figure 3 can be downloaded from https://github.com/sidkk86/weight_initialization
4 Activation functions not differentiable at 0
When is not differentiable at 0, the analysis seems more difficult than in the previous section. Instead of attempting to provide a general solution, I focus on the most important non-differentiable activation function used in the analysis of neural networks - the Rectified Linear Unit (RELU).
4.1 RELU activation
Since the RELU activation is not differentiable at 0, the results from (8 – 11) cannot be used to compute the optimal value of . To proceed, I use proposition 2 and (3) which state that for the first iteration, . We are interested in in the mean and variance of . The mean will be given by
For the variance to be maintained at each iteration, we require which yields
which is consistent with that obtained by .
To converge or not to converge, that is the question.
In  paper, the authors provide an example of a 22 layer neural network using RELU activations which converges with the Xavier Initialization, and a 30 layer neural network which does not converge with the same initializations and activation functions.
To understand why this happens, I compute the central moments of in terms of the central moments of when . From (LABEL:eq:in204) we have
for all . These approximations are remarkably accurate, as is demonstrated by comparisons with the simulation experiments described in Figure 4.
Equation (30) shows that the variance of the inputs to the deeper layers is exponentially smaller than the variance of the inputs to the shallower layer. Therefore, the deeper the neural network, the worse the performance of the Xavier Initialization will be. From (30), and . Thus the variance in the input to the layer will be times smaller than the variance to the layer, and explains the possible reason why the 30 layer neural network described in  converges, but the 22 layer neural network does not. 333It is surprising that the 22 layer neural network converges!
In this paper, I have provided a general framework for weight initialization with non-linear activation functions. First, I provide a general formula for the ideal weight initialization for all activation functions differentiable at 0. I show how the weight initializations change for the hyperbolic tangent and sigmoid activation functions. Second, I focus only on the Rectified Linear Unit (RELU) from the class of functions that are non-differentiable at 0, and I provide a rigorous proof of the He Initialization. Finally, I show why the Xavier initialization fails to work with the RELU activation function. Given the sharp increase in non-differentiable activation functions over the years, a more general version of my (largely incomplete) analysis of non-differentiable functions is warranted. My analysis repeatedly illustrates the drastic difference in dynamics which can result from introducing non-linearities in the system.
-  François Chollet. Keras. https://github.com/fchollet/keras, 2015.
-  Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
-  Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pages 249–256, 2010.
-  Michael D Godfrey. The tanh transformation. Information Systems Laboratory, Stanford University, 2009.
-  Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  Andrej Karpathy, Justin Johnson, and Fei Fei Li. Cs 231n: Convolutional neural networks for visual recognition, lecture 5, slide 61. http://cs231n.stanford.edu/slides/2016/winter1516_lecture5.pdf, 2016.
-  Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
-  Erich Leo Lehmann. Elements of large-sample theory. Springer Science & Business Media, 2004.
-  Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, 2013.
-  Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
-  Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 129–136, 2011.
-  David Sussillo and LF Abbott. Random walk initialization for training very deep feedforward networks. arXiv preprint arXiv:1412.6558, 2014.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
-  Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. In Advances in Neural Information Processing Systems, pages 2553–2561, 2013.
-  Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1653–1660, 2014.