Products of Many Large Random Matrices and Gradients in Deep Neural Networks
We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the norm of such a product applied to any fixed vector is asymptotically Gaussian. The fluctuations we find can be thought of as a finite temperature correction to the limit in which first the size and then the number of matrices tend to infinity. Depending on the scaling limit considered, the mean and variance of the limiting Gaussian depend only on either the first two or the first four moments of the measure from which matrix entries are drawn. We also obtain explicit error bounds on the moments of the norm and the Kolmogorov-Smirnov distance to a Gaussian. Finally, we apply our result to obtain precise information about the stability of gradients in randomly initialized deep neural networks with ReLU activations. This provides a quantitative measure of the extent to which the exploding and vanishing gradient problem occurs in a fully connected neural network with ReLU activations and a given architecture.
Products of independent random matrices are a classical topic in probability and mathematical physics with applications to a variety of fields, from wireless communication networks [tulino2004random] to the physics of black holes [cotler2017black], random dynamical systems [pollicott2010maximal], and recently to the numerical stability of randomly initialized neural networks [hanin2018neural, pennington2017resurrecting]. In the context of neural networks, products of random matrices are related to the numerical stability of gradients at initialization and therefore give precise information about the exploding and vanishing gradient problem (see Section 1.5, Proposition 2, and Corollary 3). The purpose of this article is to prove several new results about such products in the regime where the number of terms and the sizes of matrices grow simultaneously. This regime has attracted attention [dsl_2, joint_limit] but remains poorly understood. We find new phenomena not present when the number of terms and the size of the matrices are sent to infinity sequentially rather than simultaneously (see Section 1.1 for more on this point).
To explain our results, let be a positive integer and let be a list of positive integers. We are concerned with (the non-asymptotic) analysis of products of independent rectangular random matrices of sizes with real entries:
The specific matrix ensembles we study depend on a parameter and a distribution on . We define:
where are diagonal matrices
and are independent random matrices for which the entries are drawn i.i.d. from a fixed distribution on satisfying the following four conditions:
When the matrices are the identity. In contrast, when , the matrix product naturally arises in connection to the input-output Jacobian matrix for neural nets with nonlinearity and layers with widths initialized with random weights drawn from . In particular, the following equality in distribution holds when :
so that, when the singular values of are equal in distribution to those of This is a consequence of Proposition 2 below, which opens the door to a rigorous study of the so-called exploding and vanishing gradient problem for nets at finite depth and width. This refines the approach of the first author in [hanin2018neural], and we refer the reader to Section 1.5 for precise definitions an extended discussion of this point. Our main result concerns the distribution of
As explained in Section 1.4 below, can be thought of as a line-to-line partition function in a disordered medium given by the computation graph underlying the matrix product defining , with corresponding to a kind of initial condition. The diagonal matrices then correspond to valued spins on the vertices of this graph, restricting the allowed paths of the directed random polymer. With this interpretation, our main result, Theorem 1, shows that the analogue of the free energy, namely
is Gaussian up to an error that tends to zero when tend to infinity.
Fix , and a distribution satisfying (i)-(iv) above. Let be some fixed unit vector, and for any choice of , , set
Let be as in (2). Then, the norm of the vector is approximately log-normal distributed:
This approximation holds both in the sense of distribution and of moments. More precisely, with denoting Kolomogov-Smirnov distance,
where the implicit constant is uniform for in a compact subset of , and in a compact set bounded away from . Moreover, for every satisfying we have
where the implicit constant depends on and the moments of but not on
In the proof of equation (5), we actually show that is bounded above by
for any choice of and where the constants depend only on the moments of and . By taking and restricting the in a compact set, the result claimed in Theorem 1 holds. Moreover, if we take , then we will actually prove instead the sharper result that
The two conclusions, equation (7) and equation (6), of Theorem 1 are proven separately and have independent proofs. We prove equation (6) by a path-counting type argument in Section 3. The argument in Section 4 for equation (7), in contrast, uses a central limit theorem for martingales.
1.1 Joint scaling limits
achieved for instance when are equal and proportional to This asymptotic normality for cannot be seen by taking the limits and to infinity one after the other. Indeed, consider the case when and the standard Gaussian measure. A simple computation using the rotational invariance of i.i.d. Gaussian matrices shows the equality in distribution
where is a chi-squared random variable with degrees of freedom and the terms in the product are independent. In the limit where is fixed and we have and so
On the other hand, if the are uniformly bounded, then we have
In fact, for fixed, converges only with an addition scaling:
In particular, we have
making (8) an interesting regime for . The non-commutativity of the limits is well-known [gaussian_RMT_6, DeiftOpenProblems, joint_limit] and is related to the fact that the local statistics of the singular values of are sensitive to the order in which the limits above are taken. Remaining in the simple case of and Gaussian, a simple application of the central limit results show that when all the are equal and are related to by , then the exact chi-squared representation of equation (9) gives the convergence in distribution:
which is of course consistent with Theorem 1. Part of the content of Theorem 1 is therefore that this result is essentially independent of the parameter and the measure according to which the entries of the matrices are distributed. See Section 1.3 for more discussion on the novel aspects of Theorem 1.
1.2 Connection to previous work in Random Matrix Theory
The literature on products of random matrices is vast. Much of the previous work concerns products of i.i.d. random matrices, each of size . Such ensembles have been well studied in two distinct regimes: (a) when is fixed and and (b) when is fixed and . Case (a) is related to multiplicative ergodic theory and the study of Lyapunov exponents. The seminal articles in this regime are the results of Furstenberg and Kesten [furstenberg1960], which gives general conditions for the existence of the top Lyapunov exponent
and the multiplicative ergodic theorem of Osceledets [oseledets], which gives conditions for almost sure (deterministic) values for all the Lyapunov exponents. Many more recent works characterize the Lyapunov exponents under more specific assumptions, most notably for matrices which are rotationally invariant or which have entries that are real or complex Gaussians, see e.g. [gaussian_RMT_6, gaussian_RMT_1, gaussian_RMT_2, gaussian_RMT_3, gaussian_RMT_4, gaussian_RMT_5] as well as the survey [RMT_product_survey] and references therein.
Case (b), where is fixed and , falls into the setting of free probability. Indeed, one of the great successes of free probability is the idea of “asymptotic freeness”: in the limit , a collection of independent random matrices behave like a collection of freely independent random variables on a non-commutative probability space (see e.g. [AGZ] Chapter 5 or [mingo_speicher] Chapter 1 and 4). Therefore, case (b) is closely related to a product of freely independent random variables; precise results are obtained in [d_fixed_1]. Earlier results [d_fixed_3, d_fixed_2] examine case (b) without explicit use of free probability. The problem of first taking and afterwards taking can also be handled using the tools of free probability in the case of Gaussian matrices, see [tucci].
As explained in the Introduction and in Section 1.1, the regimes (a), (b) are asymptotically incompatible in the sense that the limits , and do not generally commute on the level of the local behavior of the singular value distribution. Indeed, the problem of understanding what happens when both are scaled simultaneously is mentioned as an open problem in [gaussian_RMT_6]. To explain this further, we note that the work of Newman [gaussian_RMT_7, Thm. 1] in regime (a) shows that when and is fixed, the density of the Lyapunov exponents of converges in the limit when first and then to the triangular density
The work of Tucci [tucci, Thm. 3.2, Ex. 3.4] shows that for Gaussian ensembles related to one obtains the same global limit in the regime (b) when first and then However, as explained in [gaussian_RMT_6] Section 5, while the global density of all the Lyapunov exponents is the triangular law in both cases, the local behavior (e.g. the fluctuations of the top Lyapunov exponent) is observed to be different depending on the order of the limits even in the exactly solvable special case of products of complex Ginibre matrices.
From this spectral point of view, Theorem 1 gives information about certain averages of the Lyapunov exponents. To see this, fix and let and tend to infinity in accordance with (8). Note we we specifically do not take to infinity. Denote by the non-zero singular values of , and by the corresponding left-singular vectors. Then we have
In many situations of interest we can expect that the inner products satisfy . This happens for example if the vector is chosen uniformly at random on the -sphere or when is a fixed vector and the matrix is invariant under right multiplication by an orthogonal matrix. In this setting
Hence, Theorem 1 can be interpreted as the statement that the logarithm of the average of the non-zero singular values for is a Gaussian with mean and variance in the limit (8). These non-trivial corrections in can be seen as a finite temperature correction to the maximal entropy regime of Tucci [tucci] in which first and then . For more on this point of view, we refer the reader to [gaussian_RMT_6, Section 3.2].
Finally, in the specific case where the random matrices are complex Ginibre matrices (i.e. the matrix entries are iid complex Gaussian), very recent work [dsl_2, joint_limit] looks at the limiting spectrum under the joint scaling limit , where the ratio is fixed or going to . This work analyzes exact determinental formulas for the joint distribution of singular values available in the case of complex Ginibre matrices. The analogous formulas for real Gaussian matrices given in [gaussian_RMT_5] are significantly more complicated and such an explicit analysis appears to be much more difficult.
1.3 Contribution of the Present Work
In the context of these previous random matrix results, let us point out four novel aspects of Theorem 1. First, it deals with the joint limit for a large class of non-Gaussian matrices with real entries. There is no integrable structure to our matrix ensembles, and we rely instead on a sum-over-paths approach to analyze the moments (6) and a martingale CLT approach for obtaining the KS distance estimates (7).
Second, the ensembles in Theorem 1 include the somewhat unusual diagonal matrices as part of model. Our original motivation for including these is the connection to neural networks explained in Section 1.5. In essence, the matrices can be interpreted as adding iid valued spins to the usual sum over paths approach to moments of products of matrices. Only “open” paths that have spin on every vertex contribute to the sum, causing open paths to be correlated. Previously, Forrester [gaussian_RMT_2] and Tucci [tucci] considered the case when were deterministic positive definite matrices.
An additional novelty of Theorem 1 is it proves the distribution of is (mostly) universal: it does not depend on the higher moments of the distribution beyond the mean and variance, with the exception of the fourth moment appearing in in the term . In the regime and , this term is a correction.
The fourth and final novelty of Theorem 1 we would like to emphasize is that our results are non-asymptotic, i.e. we obtain an explicit error term of the form . This is particularly useful when using Theorem 1 for studying gradients in randomly initialized neural networks (see Section 1.5).
Finally, we remark that Theorem 1 only studies for a fixed vector , and therefore leaves several questions open: for instance the joint law of for a list of vectors and more generally the limiting spectral distribution of the matrices . We plan to address these questions in forthcoming work.
1.4 Connection to Random Polymers
The matrix ensembles studied in this article, in the case , are related to directed random polymers on the complete graph of size . This model were recently explored in detail (c.f. e.g. [random_polymers]). A key object for these polymers is the line-to-line partition function
where is the temperature of the model, and are i.i.d. mean zero random variables that make up the underlying disordered environment. When the sum over is written via products of matrices of size , the disordered environment can be viewed as a multipartite graph made of vertex clusters of size with (directed) edges from all vertices in to all vertices in The edges of this graph are then decorated with the corresponding matrix entries which are strictly positive, making a sum over paths from the input to the output of this graph. Each path is weighted by its energy, given by the product of weights along the path.
The fact that the weights are positive makes the analysis of the partition function of this traditional random polymer model different than the analysis of the matrix product defined in (1). In particular, no cancellation is possible between the terms in the definition of above, causing to be exponential in . The fixed and limit of of the partition function in the case of these positive weights is the object of study in [random_polymers]. As explained in Section 1.3, Theorem 1 studies a different regime where both at the same time. The fact that our weights are mean zero, gives rise to significant cancellation in the terms of from (4), so that the partition function in our mean zero model does not grow exponentially with provided grows with as in (8). Additionally, if in our model, the effect of the diagonal Bernoulli matrices is to close every vertex with probability . The sum over paths in our partition function then becomes the sum only over those paths that pass through vertices that are open.
1.5 The Case as Gradients in Random Neural Nets
One of our motivations for studying the ensembles is that, as we prove in Proposition 2 below, the case corresponds exactly to the input-output Jacobian in randomly initialized neural networks with activations. To explain this connection, fix . A neural network with activations, depth , and layer widths is any function of the form
where are affine
and for and any vector we write
The matrices and vectors are called, respectively, the weights and biases of at layer while collectively define the architecture of . We will write for an input to and will define
to be the vectors of activities before and after applying at the neurons at layer .
In practice, the weights and biases in a neural network are first randomly initialized and then optimized by (stochastic) gradient descent on a task-specific loss that depends only on the outputs of A single gradient descent update for a trainable parameter (i.e. an entry in one of the weight matrices ) is
where is the learning rate. An important practical impediment to gradient based learning is the exploding and vanishing gradient problem (EVGP), which occurs when gradients are numerically unstable:
making the parameter update (10) too small to be meaningful or too large to be precise. An important intuition is that the EVGP will be most pronounced at the start of training, when the weights and biases of are random and the implicit structure of the data being processed has not yet regularized the function computed by .
As explained below, the EVGP for a depth net with hidden layer widths is essentially equivalent to having large fluctuations for the entries (or, in the worst case, for the singular values) of the Jacobians of the transformations between various layers:
The next result shows the singular value distribution of is that same as that of since
Proposition 2 also shows that, for any collection of vectors we have the following equality in distribution when :
Let be a net with depth and layer widths . Fix Suppose the weights of are , which are drawn iid from the measure as in the original definition (2). Then, writing for the dimensional -Bernoulli random vector, whose entries are independent and take the values with probability , we have
where are diagonal -Bernoulli matrices as in (2) with parameter and denotes equality in distribution.
Before proving Proposition 2, let us explain why the functions that we study in Theorem 1 are related to the EVGP. Due to the compositional nature of the function computed by we may use the chain rule to write, for the weight connecting neuron to neuron in layer
where is the column of Therefore, fluctuations of the gradient descent update are captured precisely by fluctations of bi-linear functionals of various layer to layer Jacobians in . We study in this article and obtain in Theorem 1 precise distribution and moment estimates on these quantities. For instance, Theorem 1 combined with Proposition 2 immediately yields the following
Let be a fully connected depth net with hidden layer width and randomly initialized weights drawn i.i.d. from the measure and scaled to have variance as in (2). Suppose also that the biases of are drawn i.i.d. from any measure satisfying same assumptions as the measure Fix any with and write for the input-output Jacobian of . We have,
and the implicit constant is uniform when ranges over a compact subset of .
For more information about the EVGP, statistics of gradients in random nets, and distribution of the singular values of the input-output Jacobian we refer the interested reader to [hanin2018neural, pennington2017resurrecting, PenningtonSG18, pennington2017nonlinear] for more details.
2 Proof of Proposition 2
2.1 Idea behind the proof
The essential idea behind Proposition 2 is to notice that the derivative of the function is , so when doing the chain rule to compute , we find the following -valued diagonal matrices naturally appearing:
Since the random weights and biases are symmetrically distributed around (i.e. ) and have no atoms, it is easily verified that each entry in is equally likely to be positive or negative regardless of the value of . Hence the matrix in equation (12) is equal in distribution to the Bernoulli matrix when . This informally explains the connection between and .
It remains to see that these diagonal matrices are independent of each other (since the outputs of the previous layer are fed into to subsequent layers, so are not a priori independent). This again will be a consequence of the fact the underlying random variables are symmetrically distributed, and will be formally verified by conjugating the weights and biases of the network by random random variables. This doesn’t change the distribution of the network, but will allow us to see the independence between layers in a more concrete way.
2.2 Proof of Proposition 2
Proof of Proposition 2.
Fix a neural net as in the statement of proposition 2 and denote its weights and biases at layer by and . For each let
be an i.i.d. collection of random variables that each take values with probability . We will also define
Consider the neural net with weights and biases defined by changing the signs of the weights and biases of as follows:
We will denote by the activations and input-output Jacobian for both computed at the same fixed input
Note that since we’ve assumed that have distributions that are symmetric around , we have
Hence, since the weights of the two networks are identically distributed,
On the other hand, the chain rule yields the following recursion for
and we’ve used that diagonal matrices commute. Note that apriori the matrices depend on the weights and biases for since . However, we will now verify the following claim about the collection of matrices and variables :
and that moreover the collection is independent and that each is distributed like a diagonal matrix with independent diagonal entries taking the values of with probability Once we have proven this, since , then equation (13) shows that and , a diagonal -Bernoulli random variables independent of everything else. This will complete the proof of the present proposition since this is exactly the recurrence for the matrices .