Mean Field Residual Networks: On the Edge of Chaos

Mean Field Residual Networks: On the Edge of Chaos

Greg Yang
Microsoft Research AI
gregyang@microsoft.com
&
Samuel S. Schoenholz
Google Brain
schsam@google.com
Work done while at Harvard University
Abstract

We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the “edge of chaos” hypothesis, these subexponential and polynomial laws allow residual networks to “hover over the boundary between stability and chaos,” thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind.

\LetLtxMacro\oldchiχ\LetLtxMacro\oldgammaγ\LetLtxMacro\oldlambdaλ\LetLtxMacro\olddaleth

1 Introduction

Previous works Poole et al. (2016); Daniely et al. (2016); Schoenholz et al. (2017) have shown that randomly initialized neural networks exhibit a spectrum of behavior with depth, from stable to chaotic, which depends on the variance of the initializations: the cosine distance of two input vectors converges exponentially fast with depth to a fixed point in [0, 1]; if this fixed point is 1, then the behavior is stable; if this fixed point is 0, then the behavior is chaotic. It has been argued in many prior works Bertschinger and Natschläger (2004); Poole et al. (2016) that effective computation can only be supported by a dynamical behavior that is on the edge of chaos. Too much stability prevents the neural network from telling apart two different inputs. While some chaotic behavior can increase the expressivity of a network, too much chaos makes the neural network think two similar inputs are very different. At the same time, the same initialization variances also control how far gradient information can be propagated through the network; the networks with chaotic forward dynamics will tend to suffer from exploding gradients, while networks with stable forward dynamics will tend to suffer from vanishing gradients.

These works have focused on vanilla (fully connected) feedforward networks. Here we consider residual networks He et al. (2016a, b) (with fully-connected layers and without batchnorm), which are a family of recently proposed neural network architectures that has achieved state-of-the-art performance on image recognition tasks, beating all other approaches by a large margin. The main innovation of this family of architectures is the addition of a passthrough (identity) connection from the previous layer to the next, such that the usual nonlinearity computes the “residual” between the next-layer activation and the previous-layer activation.

In this work, we seek to characterize randomly initialized residual networks. One of our main results is that random residual networks for many nonlinearities such as live on the edge of chaos, in that the cosine distance of two input vectors will converge to a fixed point at a polynomial rate, rather than an exponential rate, as with vanilla tanh networks. Thus a typical residual network will slowly cross the stable-chaotic boundary with depth, hovering around this boundary for many layers. In addition, for most of the nonlinearities considered here, the mean field estimate of the gradient grows subexponentially with depth. In fact, for -ReLU, the th-power of ReLU, for , the gradient grows only polynomially. These theoretical results provide some theoretical justification for why residual networks work so well in practice. In our experiments, we are also able to predict surprisingly well the relative performances of trained residual networks based only on their initialization hyperparameters, in a variety of settings. In particular, we find that the quality of initialization for tanh resnets is determined by trainability (how much gradient explosion on average) while that for (-)ReLU resnets is determined by expressivity (how far can two different input vectors be pulled apart) (see Section 6). To the best of our knowledge, this is the first time that a quantity other than gradient explosion/vanishing has been found to control the quality of initialization. We establish theoretically and empirically that the best initialization variances for residual networks depend on the depth of the network (contrary to the feedforward case Schoenholz et al. (2017)), so that common initialization schemes like Xavier Glorot and Bengio (2010) or He He et al. (2015) cannot be optimal. In fact, even the rationale of He initialization is incorrect for ReLU residual networks because it tries to control gradient dynamics rather than expressivity. However we want to emphasize that we study a simplified model of residual networks in this work, with no batchnorm or convolutional layers, so that these results are not necessarily indicative of the MSRA residual network used in practice He et al. (2016a).

In the body of this paper, we give account of general intuition and/or proof strategy when appropriate for our theoretical results, but we relegate all formal statements and proofs to the appendix.

2 Background

Consider a vanilla feedforward neural network of layers, with each layer having neurons; here layer 0 is the input layer. For the ease of presentation we assume all hidden layer widths are the same for all . Let be the input vector to the network, and let for be the activation of layer . Then a neural network is given by the equations

where {enumerate*}[label=()]

is the pre-activation at layer ,

is the weight matrix,

is the bias vector, and

is a nonlinearity, for example or ReLU, which is applied coordinatewise to its input.

To lighten up notation, we suppress the explicit layer numbers and write

where implicitly denotes , and denotes (and analogously, denotes ).

A series of papers Poole et al. (2016); Raghu et al. (2016); Schoenholz et al. (2017) investigated the “average behavior” of random neural networks sampled via , for fixed parameters and , independent of . Consider the expectation of , the normalized squared length of , over the sampling of and . Poole et al. (2016) showed that this quantity converges to a fixed point exponentially fast for sigmoid nonlinearities. Now suppose we propagate two different vectors and through the network. Poole et al. (2016) also showed that the expectation of the normalized dot product converges exponentially fast to a fixed point. The ratio between the normalized squared length and the normalized dot product is the cosine distance between and . Thus these two exponential convergence results show that the cosine distance converges exponentially fast to a fixed point as well. Intuitively, this means that a vanilla feedforward network “forgets” the geometry of the input space “very quickly,” after only a few layers.

In addition, Schoenholz et al. (2017), under certain independence assumptions, showed that the expected normalized squared norm of the gradient also vanishes or explodes in an exponential fashion with depth, with the ”half-life” controlled by and . They verified that this theoretical ”half-life” correlates in practice with the maximal number of layers that are admissible to good performance.

At the same time, Daniely et al. (2016) published work of similar nature, but phrased in the language of reproducing kernel Hilbert spaces, and provided high probability estimates that are meaningful for the case when the width is finite and the depth is logarithmic in . However, they essentially fixed the variance parameters , and furthermore, their framework (for example the notion of a “skeleton”) does not immediately generalize to the residual network case.

In this work, we show that residual networks have very different dynamics from vanilla feedforward networks. In most cases, the cosine distance convergence rate and the gradient growth rate are subexponential in a residual network, and in most cases, these rates may be polynomial.

3 Preliminaries

Residual networks were first introduced by He et al. (2016a) and later refined by He et al. (2016b), and they are now commonplace among deployed neural systems. The key innovation there is the addition of a shortcut connection from the previous layer to the next. We define the following idealized architectures for ease of analysis. Note that we only consider fully-connected affine layers instead of convolutional layers. A reduced residual network (RRN) has the recurrence

A (full) residual network (FRN) in addition has an affine connection given by weights and biases from the nonlinearity to the next layer:

We are interested in the “average behavior” of these network when the weights and biases, , and are sampled i.i.d. from Gaussian distributions resp. with standard deviations and , independent from . Here we take the variance of to be so that the variance of each is , assuming each is fixed (similarity for ). Such an initialization scheme is standard in practice.

We make several key “physical assumptions” to make theoretical computations tractable:

Axiom \thethm (Symmetry of activations and gradients).

(a) We assume and for any . (b) We also assume that the gradient with respect to the loss function satisfies for any .

One can see that Section 3(a) is satisfied if the input and Section 3(b) is satisfied if Section 3 below is true and the gradient at the last layer . But in general it is justified both empirically and theoretically as an approximation, because stays about constant with , but and grow rather quickly at the same pace with (as will be seen later in calculations), so that their additive difference becomes negligible; similarly for and .

Axiom \thethm (Gradient independence).

(a) We assume the we use a different set of weights for backpropagation than those used to compute the network outputs, but sampled i.i.d. from the same distributions. (b) For any loss function , we assume that the gradient at layer , , is independent from all activations and from the previous layer.

Section 3(a) was first made in Schoenholz et al. (2017) for computing the mean field theory of gradients for feedforward tanh networks. This is similar to the practice of feedback alignment Lillicrap et al. (2016). Even though we are the first to explicitly formulate Section 3(b), in fact it was already applied implicitly in the gradient calculations of Schoenholz et al. (2017). Note that a priori Section 3(b) is not true, as depends on for every , which depend on for each , and which depends on for every . Nevertheless, in practice both subassumptions hold very well.

Now we define the central quantities studied in this paper. Inevitably, our paper involves a large amount of notation that may be confusing for the first-time reader. We have included a glossary of symbols (Table A.1) to ameliorate notation confusion. {defn} Fix an input . Define the length quantities and for and . Here the expectations are taken over all random initialization of weights and biases for all layers , as (large width limit). Note that in our definition, the index does not matter by Section 3. {defn} Fix two inputs and . We write to denote a quantity with respect to the input . Then define the correlation quantities and for and , where the expectations are taken over all random initialization of weights and biases for all layers , as (large width limit). Again, here the index does not matter by Section 3. By metric expressivity, we mean . Additionally, define the cosine distance quantities and , and we will also call angular expressivity. In this paper, for the ease of presentation, we assume . Then, as we will see, for all , and as a result, and .

{defn}

Fix an input and a gradient vector of some loss function with respect to the last layer . Then define the gradient quantities for , and for . Here the expectations are taken with Section 3 in mind, over both random initialization of forward and backward weights and biases, as (large width limit). Again, the index or does not matter by Section 3.

Asymptotic notations.

The expressions have their typical meanings, and iff . We take to mean for some (this is slightly different from the standard usage of ), and We introduce a new notation: if and , as , for any . All asymptotic notations are sign-less, i.e. can indicate either positive or negative quantities, unless stated otherwise.

4 Overview

The primary reason we may say anything about the average behavior of any of the above quantities is the central limit theorem: every time the activations of the previous layer pass through an affine layer whose weights are sampled i.i.d., the output is a sum of a large number of random variables, and thus follows approximately Gaussian distributions. The mean and variance of these distributions can be computed by keeping track of the mean and variances of the activations in the previous layer.

In what follows, we use this technique to derive recurrence equations governing for different architectures and different activation functions. We use these equations to investigate the dynamics of and , the key quantities in the forward pass, and the dynamics of , the key quantity in the backward pass.

The cosine distance in some sense measures the angular geometry of two vectors. If , then the vectors are parallel; if , then they are orthogonal. Just as in Poole et al. (2016) and Schoenholz et al. (2017), we will show that in all of the architectures and activations we consider in this paper, converges to a fixed point as 11endnote: 1Under simplified conditions, Daniely et al. (2016) showed that there exists a fixed point for any “well-behaved” activation function in a feedforward net. However, this result does not apply to architectures with residual connections.. Thus, on the average, as vectors propagate through network, the geometry of the original input space, for example, linear separability, is “forgotten” by residual networks as well as by vanilla networks. But we will prove and verify experimentally that, while Poole et al. (2016) and Schoenholz et al. (2017) showed that the convergence rate to is exponential in a vanilla network, the convergence rate is rather only polynomial in residual networks, for tanh and -ReLU (Section 5) nonlinearities; see Section B.1.1, Section B.1.2, Section B.2.1, and Section B.2.1. This slow convergence preserves geometric information in the input space, and allows a typical residual network to “hover over the edge of chaos”: Even when the cosine distance converges to 0, corresponding to “chaos”, (resp. 1, corresponding to “stability”), for the number of layers usually seen in practice, will reside well away from 0 (resp. 1).

Similarly, the quantity measures the metric geometry of two vectors. The evolution of with tells us the ability of the average network to separate two input points in terms of Euclidean distance. Again, for tanh and -ReLU () nonlinearities, varies only polynomially with .

On the other hand, measures the size of gradient at layer , and through it we track the dynamics of gradient backpropagation, be it explosion or vanishing. In contrast to vanilla tanh networks, which can experience both of these two phenomenon depending on the initialization variances, typical residual networks cannot have vanishing gradient, in the sense of vanishing as ; see Section B.1.1 and Section B.1.2. Furthermore, while vanilla tanh networks exhibit exponentially vanishing or exploding gradients, all of the activation/architecture pairings considered here, except the full residual network with ReLU, have subexponential gradient dynamics. While tanh residual networks (reduced or full) has (Section B.1.2), -ReLU residual networks for have (Section B.2.1). Instead of , we may also consider the size of gradients of actual trainable parameters. For tanh and -ReLU with , they are still subexponential and polynomial (Section B.2.1). On the other hand, while for a ReLU resnet, its weight gradients have size independent of layer, within (Section B.2.1)! This is the only instance in this paper of gradient norm being completely preserved across layers.

The above overviews the theoretical portion of this paper. Through experiments, we discover that we can very accurately predict whether one random initialization leads to better performance than another on the test set, after training, by leveraging this theory we build. Residual networks of different nonlinearities have different controlling quantities: for resnets with tanh, the optimal initialization is obtained by controlling the gradient explosion ; whereas for ReLU and -ReLU, the optimal initialization is obtained by maximizing without running into numerical issues (with floating point computation). See Section 6 for details.

Over the course of our investigation of -ReLU, we derived several new identities involving the associated kernel functions, first defined in Cho and Saul (2009), which relate them to the zeroth Bessel functions (Sections C.7.1, C.7.1, C.7.1 and C.7.1).

5 Theoretical Results

In what follows in the main text, we assume for all ; in the appendix, the formal statement of each main theorem will contain results for other cases. We are interested in the two major categories of nonlinearities used today: tanh-like and rectified units. We make the following formal definitions as a foundation for further consideration. {defn} We say a function is tanh-like if is antisymmetric (), for all , , and monotonically increases to 1 as . {defn} Define the -ReLU if and 0 otherwise. 22endnote: 2 Note that in practice, to avoid the diverging gradient as , we can use a tempered version of -ReLU, defined by on and 0 otherwise, for some small . The conclusions of this paper on should hold similarly for as well.

Antisymmetric/RRN Any/FRN
Theorems B.1.1, B.1.1, B.1.1 Theorems B.1.2, B.1.2, B.1.2
Table 1: Main Recurrences

By applying the central limit theorem as described in the last section, we derive a set of recurrences for different activation/architecture pairs, shown in Table 1 (see appendix for proofs). They leverage certain integral transforms 33endnote: 3Daniely et al. (2016) called the version of with fixed the “dual function” of . as in the following {defn} Define the transforms and by and .

These recurrences are able to track the corresponding quantities in practice very well. For example, Fig. 1 compares theory vs experiments for the tanh/FRN pair. The agreement is very good for tanh/RRN (not shown, but similar to the case of tanh/FRN with and ) and -ReLU/FRN as well (see Fig. A.1).

As mentioned in previous sections, we seek to characterize the long term/high depth behavior of all of the quantities defined in Section 2. To do so, we solve for the asymptotics of the recurrences in Table 1, where is instantiated with tanh or -ReLU. Our main dynamics results are summarized in Table 2.

Tanh/RRN Tanh/FRN ReLU/FRN -ReLU/FRN,
, B.1.1 , B.1.2 , B.2.1 , B.2.1
, B.1.1 , B.1.2 , B.2.1 , B.2.1
, B.1.1 , B.1.2 , B.2.1 , B.2.1
, B.1.1 , B.1.2 , B.2.1 , B.2.1
Table 2: Summary of Main Dynamics Results. Note that while is exponential for ReLU/FRN, the gradients with respect to weight parameters have norms ( and ) constant in (Section B.2.1). Also, the entry for -ReLU is for only

5.1 Tanh

Figure 1: Our equations predict the relevant quantities very well in practice. These plots make the comparison between prediction and measurements for the full resnet with tanh activation, with , , , . Left-to-right: (a) and against layer for 200 layers. (b) against for 200 layers. Both (a) and (b) trace out curves for different initial conditions. (c) Different gradient quantities against for 50 layers. From left to right the layer number decreases, following the direction of backpropagation. Notice that the gradient increases in norm as . All three figures exhibit smooth curves, which are theoretical estimates, and irregular curves with shades around them, which indicate empirical means and standard deviations (both of which taken in regular scale, not log scale). (a) and (b) are made with 20 runs of resnets of width 1000. (c) is made with 25 runs of resnets of width 250.
Forward dynamics.

When , and increase as in either RRN or FRN (Section B.1.1), as one might expect by observing that as so that, for example in the RRN case, the recurrence becomes . This is confirmed graphically by the black lines of the leftmost chart of Fig. 1. We carefully verify that this intuition is correct in its proof in the appendix, and find that in fact in the RRN case and in the FRN case.

What about ? The middle chart of Fig. 1 shows that over time, contracts toward the center of the interval , but from the looks of it, it is not clear whether there is a stable fixed point of or not. We prove that, in fact, all trajectories of not starting at 1 do converge to a single fixed point, but only at a polynomial rate, in both the RRN and FRN cases (Section B.1.1 and Section B.1.2); we can even explicitly compute the fixed point and the rate of convergence: For FRN, there is a unique stable fixed point determined by the equation

and decreases like , where

Since , The case of RRN can be viewed as a special case of the above, setting and , which yields and . We observe that both and only depend on the ratio , so in creftype 4 we graph these two quantities as a function of . and both increase with and asymptotically approach 1 and respectively from below. When , and . Thus the rate of convergence at its slowest for tanh/FRN is , where asymptotically the network tends toward a chaotic regime , corresponding to a large weight variance and a small bias variance; it at its fastest is , where asymptotically the network tends toward a stable regime , corresponding to a large bias variance and small weight variance. We verify by comparing to in log-log scale. If , then and should obtain the same slope as as . The middle figure of creftype 4 ascertains that this is indeed the case, starting around layer number 400.

Figure 2: Left-to-right: (a) Plots of and against . (b) In log-log scale: the dashed line is , and the colored lines are for different initial conditions . That they become parallel at about on verifies that . 44endnote: 4A more natural visualization is to graph versus , but because of floating point precision, doesn’t converge to 0, but a small number close to 0, so that the log-log plot wouldn’t look like what is expected. (c) In log-log scale: The dashed line is ( given in Section B.1.2), and the colored lines are for . That they all converge together starting around indicates that the approximation in Section B.1.2 is very good for large .
Backward dynamics.

Finally, we show that the gradient is approximated by

()

where in the RRN case and in the FRN case (Section B.1.1 and Section B.1.2). The rightmost plot of creftype 4 verifies that indeed, for large , this is a very good approximation. This demonstrates that the mean field assumption of independent backpropagation weights is very practical and convenient even for residual networks.

Note that in the FRN case, the constant can be decomposed into . Consider the ratio . If , then (Fig. C.17), meaning that the typical network essentially computes a constant function, and thus unexpressive; at the same time, large makes small, and thus ameliorating the gradient explosion problem, making the network more trainable. On the other hand, if , then (Fig. C.17), the typical network can tease out the finest differences between any two input vectors, and a final linear layer on top of such a network should be able to express a wide variety of functions Poole et al. (2016); at the same time, small increases , worsening the gradient explosion problem, making the network less trainable. This is the same expressivity-trainability tradeoff discussed in Schoenholz et al. (2017).

5.2 -ReLU

Forward dynamics.

As with the tanh case, to deduce the asymptotic behavior of random -ReLU resnets, we need to understand the transforms and . Fortunately, has a closed form, and has been studied before Cho and Saul (2009). In particular, if , then , where is a constant with a closed form given by Section B.2. In addition, by Cho and Saul (2009), we know that for given in Section C.7.1. Fig. C.17 shows a comparison of for different s along with the identity function.

Substituting in for , we get a difference equation governing the evolution of . This should be reminiscent of the differential equation , which has solution for , and when . And indeed, the solutions to these difference equations behave asymptotically exactly like so (Section B.2.1). Thus ReLU behaves very explosively compared to -ReLU with . In fact, in simulations, for and , the ReLU resnets overflows into infs after around 100 layers, while there’s no problem from any other kind of networks we consider.

Regardless, -ReLU for all massages toward a fixed point that depends on . When , the standard ReLU, converges to 1 asymptotically as for an explicit constant depending on and only (Section B.2.1), so that When for , then converges to the nonunit fixed point of at a rate of , where is independent of the variances (Section B.2.1), so that . These rates are verified in Fig. A.2.

Backward dynamics.

Finally, we have also characterized the rate of gradient growth for any .55endnote: 5Our derivations actually apply to all , where at , the expected norm of the gradient diverges within our mean field formalism. However, at , the variance of the gradient already diverges (Section B.2.1), so we cannot expect the empirical values to agree with our theoretical predictions. But in fact, empirically our theoretical predictions seem to form an upper bound on the gradient norms (see Fig. A.1). In the case of , the dynamics of is exponential, the same as that of , where . For , the dynamics is polynomial, but with different exponent in general from that of the forward pass: for , where the constants in do not depend on or . This exponent is minimized on at , where (but on it is minimized at , where ); see Fig. B.8. These exponents are verified empirically in Fig. A.2.

Looking only at and the gradients against the biases, it seems that ReLU suffers from a dramatic case of exploding gradients. But in fact, because gains a factor of moving backwards while loses a factor of , the gradient norm (and similarly for ) is independent of how far, , the gradient has been propagated (Section B.2.1) — this is certainly the best gradient preservation among all of the models considered in this paper. Thus strangely, random ReLU FRN exhibits both the best (constant for and ) and the worse (exponential for and ) gradient dynamics. This begs the question, then, is this a better deal than other -ReLU for which for any learnable parameter we have at most a polynomial blowup with depth in its gradient? Our experiments (discussed below) show that -ReLU is useful to the extent that smaller avoids numerical issues with exponentiating forward and backward dynamics, but the best performance is given by the largest that avoids them (Fig. 3(c, d)); in fact, the metric expressivity , determines performance, not gradient explosion (see -ReLU experiments).

6 Experimental Results

Figure 3: From left to right, top to bottom: (a) and (b): , , and test set accuracy of a grid of tanh reduced (left) and full (right) resnets trained on MNIST. Color indicates performance, with ligher colors indicating higher accuracy on test set. Other than the values on the axes, we have fixed and . The white dotted lines are given by , where on the left and on the right. We see that both dotted lines accurately predict the largest optimal for each depth . (c) Varying the ratio while fixing , and thus fixing , the leading constant of . (d) in log-log scale: Heatmap gives the test accuracies of ReLU FRN for varying and . Curves give level sets for the log ratios . (e) Red heatmap shows the test accuracies of a grid of -ReLU FRN with varying and as shown, but with all s fixed. The white dashed curve gives a typical contour line of , where The yellow-to-blue curves form a set of level curves for , with yellow curves corresponding to higher levels.

Our experiments show a dichotomy of what matters in initialization: for tanh resnets, quality of an initialization is determined by how much gradient explosion there is (measured by ); for (-)ReLU resnets, it is determined by how expressive the random network is (measured by the metric expressivity ). We hypothesize this is because in tanh resnets, the gradient dynamics is much more explosive than the expressivity dynamics ( vs ), whereas for ReLU it’s somewhat the opposite ( vs ).

Tanh, vary .

We train a grid of reduced and full tanh resnets on MNIST, varying the variance and the number of layers (for FRN we fix ). The results are indicated in Fig. 3(a, b). We see that in either model, deeper resnets favor much smaller than shallower ones. The white dotted lines in Fig. 3(a, b) confirm our theory: according to Eq. , for the same gradient ratio , we want . Indeed, the white dotted lines in Fig. 3(a, b) trace out such a level curve and it remarkably pinpoints the largest that gives the optimal test set accuracy for each depth . Why isn’t the best initialization given by ? We believe that when and/or is small, gradient dynamics no longer dominates the initialization quality because it has “less room to explode,” and expressivity issues start to dampen the test time performance.

Tanh, vary .

As suggested in the analysis of Eq. , the ratio determines the fixed point and its convergence rate by itself while also contributes to the rate of gradient explosion in tanh FRN. We seek to isolate its effect on forward dynamics by varying with such that is kept constant, so that the leading term of the log gradient ratio is kept approximately equal for each and . Fig. 3(c) shows the test accuracies of a grid of tanh FRN initialized with such an ensemble of s. What stands out the most is that performance is maximized essentially around a fixed value of regardless of , which shows that indeed gradient dynamics determines the initialization quality in tanh resnets. There is also a minor increase in performance with increasing regardless of ; this is counterintuitive as increasing means “decreasing expressivity.” It is currently not clear what accounts for this effect.

ReLU, vary

We train a grid of ReLU FRN on MNIST, varying while fixing . The resulting test set accuracies are shown in Fig. 3(d). The dark upper region signifies failure of training caused by numerical issues with exploding activation and gradient norms: This corresponds to the region where , which is a measure of the mean magnitude of an neuronal activation in layer , becomes too big. We see that the best test accuracies are given by depths just below where these numerical issues occur. However, if we were to predict that the optimal init is the one minimizing , then we would be wrong — in fact it is exactly the opposite. In this case, the dynamics of , and are approximately the same (all with the same hidden constants), and optimal performance corresponds to the highest , , and without running into infs.

-ReLU, vary .

We similarly trained a grid of -ReLU FRN on MNIST, varying only and the depth, fixing all . Fig. 3(e) shows their test accuracies. We see similar behavior to ReLU, where when the net is too deep, numerical issues doom the training (black upper right corner), but the best performance is given by just below where this problem occurs. In this case, if we were to predict optimality based on minimizing gradient explosion, we would be again wrong, and furthermore, the contour plot of (white dashed line) now gives no information at all on the test set accuracy. In contrast, the contours for succeeds remarkably well at this prediction (yellow/green lines).66endnote: 6the contour for is similar, but its slopes are slightly off from the heatmap contours. By interpolation, this suggests that indeed in the ReLU case, it is expressivity, not trainability, which determines performance at test time.

In all of our experiments, we did not find dynamics to be predictive of neural network performance.

7 Conclusion

In this paper, we have extended the mean field formalism developed by Poole et al. (2016); Raghu et al. (2016); Schoenholz et al. (2017) to residual networks, a class of models closer to practice than classical feedforward neural networks as were investigated earlier. We proved and verified that in both the forward and backward passes, most of the residual networks discussed here do not collapse their input space geometry or the gradient information exponentially. We found our theory incredibly predictive of test time performance despite saying nothing about the dynamics of training. In addition, we overwhelmingly find, through theory and experiments, that an optimal initialization scheme must take into account the depth of the residual network. The reason that Xavier Glorot and Bengio (2010) or He He et al. (2015) scheme are not the best for residual networks is in fact not that their statistical assumptions are fragile — theirs are similar to our mean field theoretic assumptions, and they hold up in experiments for large width — but rather that their structural assumptions on the network break very badly on residual nets.

Open Problems.

Our work thus have shown that optimality of initialization schemes can be very unstable with respect to architecture. We hope this work will form a foundation toward a mathematically grounded initialization scheme for state-of-the-art architectures like the original He et al. residual network. To do so, there are still two major components left to study out of the following three: {enumerate*}

Residual/skip connection

Batchnorm

Convolutional layers. Recurrent architectures and attention mechanisms are also still mostly unexplored in terms of mean field theory. Furthermore, many theoretical questions still yet to be resolved; the most important with regard to mean field theory is: why can we make Sections 3 and 3 and still be able to make accurate predictions? We hope to make progress on these problems in the future and encourage readers to take part in this effort.

Acknowledgments

Thanks to Jeffrey Ling for early exploration experiments and help with the initial draft. Thanks to Felix Wong for offering his wisdom and experience working in statistical physics.

References

\theendnotes
\appendixpage

Appendix A Additional Figures

In figures appearing in the appendix, means (due to legacy reasons).

Figure A.1: Empirical vs theoretical dynamics for , and different gradient quantities for -ReLU, with format similar to Fig. 1. We refer to each figure on each row from left to right as (a), (b), and (c). Note that in the case, figure (a) ( and for different initial values) has log scale y-axis and (a) and (b) have x-axis ranging from 1 to 50, while for other , (a) has normal y-axis and (a) and (b) have x-axis ranging from 1 to 200. We do so because the norm of the activation vector in a typical ReLU resnet blows up into NaN at around layer 90, while this is not a problem for . Our theoretical predictions track the average of empirical values closely for forward quantities and for all , but variance is extremely large for at ; it also predicts the average gradient norm accurately for to (despite the fact that we should not expect so for due to exploding variance (Section B.2.1)), although variance is large for at earlier layers (i.e. later layers w.r.t backpropagation). However it consistently and significantly overestimates the average gradient norm for to , where the variance is so large that one standard deviation below the mean results in negative values. All plots are made with parameters ; only is varied. All figures exhibit smooth curves, which are theoretical estimates, and irregular curves with shades around them, which indicate empirical means and standard deviations (both of which taken in regular scale, not log scale). For each , figures (a) and (b) are made with 20 runs of resnets of width 1000. (c) is made with 25 runs of resnets of width 250.
Figure A.2: We verify the exponents of the forward and backward dynamics for -ReLU FRN. For each row, the figures are labeled (a) and (b) from left to right. The format is the same as in Fig. C.17. All figures are in log-log scale. (a) We exhibit our theoretical dynamics of the cosine distance based on the recurrences Section B.1.2 and Section B.1.2 for different initial conditions . We draw for each of these dynamics in colored solid lines. We predict that each dynamic is , where , and the dashed line gives (Section B.2.1), shifted vertically to better compare the slope in log scale (i.e. the exponent of the polynomial dynamics). (See footnote 4 for why we plot the dynamics this way). We see that the our asymptotic prediction is very accurate for the sequence of that starts with , the closest to for each , while other lines only slowly converge to the same exponent (which is the slope in the log-log plot). This is to be expected based on the proof of Section B.2.1. For , the line upticks at around and then turn into NaNs due to numerical instability. (b) Colored lines are for (we are not taking logs in addition to plotting in log-log scale like in Fig. C.15). The dashed lines are our asymptotic predictions for the dynamics with corresponding colors, based on Section B.2.1, again shifted appropriately to easily compare slope visually. We see that for every alpha our asymptotic predictions are highly accurate. For both (a) and (b), we did not show case as ReLU FRN runs into numerical issues quickly (i.e. with even for 100 layers) because of exponential explosions in and as predicted by Sections B.2.1 and B.2.1, so we cannot expect to empirically verify the precise predicted asymptotics. All plots are made with parameters ; only is varied.
Symbol Meaning Ref
standard deviation of trainable parameter
activation vector/input vector
hidden vector
width (same across all layers)
m.n. squared length of activation vector 3
m.n. squared length of hidden vector 3
m.n. dot product 3
m.n. dot product 3
m.n. squared distance 3
cosine distance 3
limit value of as
cosine distance 3
m.n. gradient squared norm w.r.t. 3
m.n. gradient squared norm w.r.t. trainable parameter 3
variable nonlinearity
-ReLU 5
variance integral transform 5
covariance integral transform 5
converges like in tanh FRN B.1.2
leading coeff of in tanh FRN B.1.2
for -ReLU B.2.1
kernel function of -ReLU C.7.1
Table A.1: Glossary of Symbols. “Mean normalized” is abbreviated “m.n.”

Appendix B A Listing of Main Theorems

b.1 Tanh

b.1.1 Reduced Residual Network

{restatable}

lemmapqrecurrence Suppose is antisymmetric. Then in an RRN, and satisfy the recurrence

{restatable}

thmpqlinear Suppose is tanh-like. Assume RRN architecture.

  • If , then and .

  • If , and . If , then we can obtain more terms of the asymptotic expansions:

    as , where .

{restatable}

thmlambdagammarecurrence Suppose is antisymmetric. Then in an RRN, and satisfy the recurrence

{restatable}

thmedynamics Suppose is a tanh-like nonlinearity in an RRN. Assume .

  • If , then and , so that and . As a result,

  • If , then