Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

Wojciech Tarnowski wojciech.tarnowski@uj.edu.pl M. Smoluchowski Institute of Physics, Jagiellonian University, PL–30–348 Kraków, Poland    Piotr Warchoł piotr.warchol@uj.edu.pl M. Smoluchowski Institute of Physics, Jagiellonian University, PL–30–348 Kraków, Poland    Stanisław Jastrzębski stanislaw.jastrzebski@uj.edu.pl Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland    Jacek Tabor jcktbr@gmail.com Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland    Maciej A. Nowak maciej.a.nowak@uj.edu.pl M. Smoluchowski Institute of Physics and Mark Kac Complex Systems Research Center, Jagiellonian University, PL–30–348 Kraków, Poland
July 30, 2019
Abstract

We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespectively of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by analyzing the signal propagation in the artificial neural network. We corroborate our results with numerical simulations of both random matrices and ResNets applied to the CIFAR-10 classification problem. Moreover, we study the consequence of this universal behavior for the initial and late phases of the learning processes. We conclude by drawing attention to the simple fact, that initialization acts as a confounding factor between the choice of activation function and the rate of learning. We propose that in ResNets this can be resolved based on our results, by ensuring the same level of dynamical isometry at initialization.

pacs:
02.10.Yn, ..

Deep Learning has achieved unparalleled success in fields such as object detection and recognition, language translation and speech recognition LeCBH (). At the same time, models achieving these state-of-the-art results are increasingly deep and complex CCP (), which often leads to optimization challenges such as vanishing gradients. Many solutions to this problem have been proposed. In particular, Residual Neural Networks remedy this problem to some extent HZRS (); VWB () by using skip connections in the network architecture, which improve gradient flow. As a result, Residual Neural Networks outmatched other competing models in the 2015 ILSVRC and COCO competitions. Yet another approach towards solving this problem is to tailor fit the networks weight initialization to facilitate training, for example by ensuring dynamical isometry PSG1 (). In this latter case, the insights are based on an analysis of the statistical properties of information propagation in the network and a study of the full singular spectrum of a particular matrix, namely the input-output Jacobian, via the techniques of Free Probability and Random Matrix Theories (FPT & RMT). This perspective has recently led to successfully training a 10000 layer vanilla convolutional neural network XBSSP ().

RMT is a versatile tool that, since its inception, saw a substantial share of applications, form the earliest in nuclear physics Wigner () to the latest in game theory DysonGame () (see Oxford () for some of the use cases discovered in the mean time). It is thus not surprising that it found its way to be used to understand artificial neural networks. In particular, to study their loss surface CHMAL (); PB (), the associated Gram matrix LLC (); PW () and in the case of single layer networks, their dynamics LC (). Our main contribution is extending the theoretical analysis of PSG1 (); SGGS (); PSG2 () to residual networks. In particular, we find that in contrast to feedforward networks, residual networks can obtain dynamical isometry for many different activation functions. These theoretical results are supported by an empirical investigation on the popular CIFAR-10 benchmark.

.1 Related work

For a short discussion of relevant results in pure RMT and FPT we relegate the reader to subsection II.1. Here, we focus on particular developments related to our work on neural networks. The framework of dynamical mean field theory, we will apply to study signal propagation in neural networks, was first used in this context in PLRSG (). There, the authors showed the existence of an order-to-chaos expressivity transition for deep feed forward neural networks with random initial weights, on the plane spanned by the variances of the network weights and biases. This in turn led to the insight of SGGS (), that arbitrary deep networks can be trained as long as they are close to the criticality associated with that transition. The techniques developed in these works, together with methods of FPT and RMT, allowed, for the first time, to analytically compute the singular value distribution of the input-output Jacobian of a deep feed forward network with nonlinear activation function and at criticality PSG1 (). Finally, the authors of PSG2 () showed, that for feed-forward neural networks, in their large depth limit and at a special point of the above mentioned critical line, the singular spectrum of the Jacobian is given by a universal distribution depending on the form of the activation function used. In particular they distinguish the Bernoulli and the smooth universality classes corresponding to piecewise linear and some non-linear activation functions. In fact, in this paper, we take the approach of that last work and apply it to fully connected residual neural networks. We find a single universality class for this architecture.

Let us also mention some recent, important developments in the area of residual neural network initialization. One of the earlier developments, is the introduction of layer-sequential unit-variance (LSUV) initialization in MM (). The two step process involved normalizing the outputs of the neurons on the first forward run and showed promising results. In another, very relevant paper Taki1 (), Masako Taki, analyzing the signal propagation in a similar manner to that mentioned in the paragraph above, shows for ResNets, with piecewise linear, symmetric as well as ReLU activation functions, that the proper variance for network weight initialization is of order , where is the number of layers and the number of neurons in each layer. We corroborate this result with our analysis. Finally, we want to highlight HR (), which identifies two failure modes for successful early training, which are influenced by initialization and architecture of networks with ReLU activation functions. The authors suggest that appropriately rescaling the residual modules of the network allows for efficient training of deep networks. This conclusion is in agreement with an earlier, similar result in BFLLMM (). Notably, there, the authors demonstrate that the gradients of feed forward, rectifier networks become white noise-like at large depth, and show that this problem is less prevalent with networks containing skip connections.

When finishing this manuscript, we have learned of a recent paper tackling the same problem of ResNets initialization by studying the singular spectrum properties of the Jacobian with the tools of Free Probability. While the analysis in LQ () is related, it is crucial to note that the authors do not observe the universal character of the singular spectrum - the main result of our paper, and treat only piecewise linear activation functions. It is also worth mentioning that, similar to us, they rediscover the importance of scaling of Balduzzi et al. BFLLMM () and Taki Taki1 ().

.2 Our results

Our contributions are the following. We show that the singular spectrum of the input-output Jacobian, in the the networks large width and depth limit, is given by a universal formula - with the dependence on the type of activation function encapsulated in a single parameter. Furthermore, we calculate the layer dependent statistical properties of the pre-activations for a variety of activation functions. All together, this gives the associated singular spectra of the Jacobian, which we compare with random matrix and artificial neural network numerical simulations corroborating our theoretical results. Even though the final results of the theoretical calculations are derived in the large and limit, the numerical experiments match them already for (with ). As a practical application of our work and the universality property it uncovers, we propose a framework for setting up weight initialization in experiments with residual neural networks.

.3 Structure of the paper

We follow this introductory section by defining the model of ResNets we will work with and with a short note on the relevance of the input-output Jacobian. Then, in subsection II.1, we derive the equation governing the Green’s function and hence the spectrum of the Jacobian, which depends on a single parameter we denote by . Proceeding is the analysis of the propagation of the information in the network via an analysis of the probability density function describing the pre-activations across the layers at network initialization. This allows us to calculate for many different activation functions in subsection II.3. We close the second section of the paper by revealing the random matrix experiments confirming our results. III.2 is devoted to the outcome of associated residual neural network numerical calculations. There, we showcase the resulting, experimental, universal spectrum of the Jacobian and the outcomes of the learning processes. This is followed by a short comment, in IV, on the influence of batch normalization on the presented setup. We close the paper with a short discussion section. Finally, in the appendices, we show the results of numerical experiments validating the signal propagation recurrence relations and some baseline (based on using the same weight matrix variances, irrespective of the choice of activation functions) simulations of the learning process.

I The model

In this paper, we consider a deep, residual network of layers of a constant width of neurons. We follow the typical nomenclature of the literature and therefore the real-valued, synaptic matrix for the -th layer is denoted by , whereas the real-valued bias vectors are . The information propagates in this network according to:

(1)

where and are pre- and post-activations respectively and is the activation function itself, acting entry-wise on the vector of pre-activations. We have introduced the parameter to track the influence of skip connections in the calculations, however we do not study its influence on the Jacobian’s spectrum or learning in general. By we denote the input of the network and by its output. Our primary interest will lay in exploring the singular value spectral properties of the input-output Jacobian:

(2)

known to be useful in studying initialization schemes of neural networks at least since the publishing of GB (). It in particular holds the information on the severity of the exploding gradients problem, as we explain at the beginning of the next section.

i.1 Relevance of the input-output Jacobian

To understand why we are interested in the Jacobian, consider the neural network adjusting its weights during the learning process. In a simplified, example by example scenario, this happens according to

(3)

where is the error function depending on - the output of the network, - the correct output value associated with that example and, implicitly through , on the parameters of the model, namely the weights and biases. is the learning rate. Applying the chain rule we can rewrite this as:

(4)

For the learning process to be stable, all three terms need to be bounded. Out of those, the middle one can become problematic if a poor choice of the initialization scheme is made. We can rewrite it as:

(5)

and see the larger the difference between and l, the more terms we have in the product, and (in general) the less control there is over its behavior. Here is an identity matrix and, is a diagonal matrix such that . Indeed, it was first proposed in GB (), that learning in deep feed-forward neural networks can be improved by keeping the mean singular value of the Jacobian associated with layer (in our setup ), close to for all ’s. It is also important for the dynamics of learning to be driven by data, not by the random initialization of the network. The latter may take place if the Jacobian to the -th layer possesses very large singular values which dominate the learning or very small singular values suppressing it. In the optimal case all singular values should be concentrated around 1 regardless of how deep is the considered layer. One therefore examines the case of , namely - the input-output Jacobian, as the most extreme object of (5).The feature that in the limit of large depth all singular values of concentrate around 1, irrespective of the depth of the network, was coined as dynamical isometry in  Saxe ().

Note, that the spectral problem for the full Jacobian

(6)

belongs to the class of matrix-valued diffusion processes GJJN (); JW (), leading to a complex spectrum. We note one of us has derived the large limit, eigenvalue, spectral density properties of a generic model of (6) with , and different symmetry classes of , already in GJJN (). Due to non-normality of the Jacobian, singular values cannot be easily related to eigenvalues. Therefore we follow PSG1 (); PSG2 () and tackle the full singular spectrum of the Jacobian (or equivalently the eigenvalue spectrum of ), extending these works to the case of the Residual Neural Network model.

Ii Spectral properties of the Jacobian

ii.1 Spectral analysis

Free Probability Theory, or Free Random Variable (FRV) Theory VOICULESCU (), is a powerful tool for the spectral analysis of random matrices in the limit of their large size. It is a counterpart of the classical Probability Theory for the case of non-commuting observables. For a pedagogical introduction to the subject, see SpeicherMingo () - here we start by laying out the basics useful in the derivations of this subsection. The fundamental objects of the theory are the Green’s functions (a.k.a. Stieltjes transforms in mathematical literature):

(7)

which generate spectral moments. The eigenvalue density can be recovered via the Sochocki-Plemelj formula

(8)

The associated free cumulants are generated by the so-called -transform, which plays the role of the logarithm of the characteristic function in the classical probability. By this correspondence, the -transform is additive under addition, i.e. for mutually free, but non-commuting random ensembles and . Moreover, it is related to via the functional equations

(9)

On the other hand, the so-called -transform facilitates calculations of the spectra of products of random matrices, as it satisfies , provided or is positive definite. If additionally, the ensemble has a finite mean, the S-transform can be easily obtained from the -transform, and vice versa, through a pair of the following, mutually inverse maps and  BJN (). Explicitly:

(10)

Denoting now the Jacobian across layers and , one can write the recursion relation . The latter matrix is isospectral to , which leads to the equation for the -transform . Proceeding inductively, we arrive at

(11)

The proof can be seen for example in BNS (). To find the -transform of the single layer Jacobian, we will first consider its Green’s function

(12)

with the averaging over the ensemble of weight matrices . To facilitate the study of , in particular to cope with , one linearizes the problem by introducing matrices of size

(13)

with . Another crucial ingredient is the block trace operation (bTr), which is the trace applied to each block. The generalized Green’s function is defined as a block trace of the generalized resolvent

(14)

Remarkably, the Green’s function of is the entry of the generalized Green’s function. This construction is a slight modification of the quaternionization approach to large non-Hermitian matrices developed in JNPZ (); QUAT (), therefore we adapt these concepts here for calculations in the large width limit of the network. Furthermore, the generalized Green’s function (14) is given implicitly by the solution of the Schwinger-Dyson equation

(15)

Here is the generalized -transform of FRV theory. This construction is a generalization of standard FRV tools to the matrix-valued functions, dubbed as freeness with amalgamation AMAL (). In particular, (15) is such a generalization of (9) to matrices.

To study two common weight initializations, Gaussian and scaled orthogonal, on the same footing, we assume that belongs to the class of biunitarily invariant random matrices, i.e. its pdf is invariant under multiplication by two orthogonal matrices, for . In the large limit these matrices are known in free probability as -diagonal operators SPEICHER (). A product of -diagonal operator with an arbitrary operator remains -diagonal NS (), thus the matrix is -diagonal too.

The generalized -transform of -diagonal operators takes a remarkably simple form NowTarn ()

(16)

Here, is the determining sequence, which generates cumulants . For the later use we mention a simple relation for the determining sequence of a scaled matrix , which generalizes the Hermitian case or, equivalently, .

We derive the equation for the Green’s function (with ) by substituting the -transform (16) into (15) and eliminating irrelevant variables. It thus reads:

(17)

In the next step, let us substitute and use (9) to obtain

(18)

Then, we substitute and use (10), which leads us to

(19)

This equation is exact. To incorporate the additional scaling of weights variances by in our considerations, as proposed in Taki1 (); BFLLMM (), we rescale and since we are interested in deep networks, we keep only the leading term in (see also JW () for similar calculations in the unitary matrix models). This leads to , which simplifies (19) to a quadratic equation for . Choosing the appropriate branch of the solution, we see that

(20)

Here

(21)

is the squared spectral radius of the matrix . In general, can vary across the depth of the network due to non-constant variance of preactivations. Assuming that this variability is bounded, we can consider the logarithm of (11) and write:

(22)

where in the last equality we defined the effective cumulant . This allows us to deduce the form of the -transform, assuming that does not scale with

(23)

Substituting and using we thus obtain

(24)

Then, we substitute and use , to finally get

(25)

an equation for the Green’s function characterizing the square singular values of the Jacobian, which can be solved numerically. We do that for a range of different activation functions and present the results with numerical simulations to corroborate them in subsection III.1, using the fact that the solution of the above equation can be expressed in terms of the Lambert function, the properties of which are well documented LAMBERT ().

One of the measures of the spread of the spectrum, describing the departure from the isometry can be the ratio between largest and smallest singular values of , which we denote by . We shall not provide the detailed derivation, we just present the final, large limit, formula:

(26)

We close this section with a remark that the above analysis is not restricted only to the model (1), but analogous reasoning can be performed for networks in which skip connections bypass more than one fully connected block.

ii.2 Signal propagation

The formulas we have derived until now were given in terms of a single parameter , which is the squared derivative of the activation function averaged within each layer and across the depth of the network. Thus, we now need to address the behavior of preactivations. In the proceeding paragraph, we closely follow a similar derivation done in SGGS (), for fully connected feed forward networks.

For the simplicity of our arguments, we consider here and as independent identically distributed (iid) Gaussian random variables with mean and variances and , respectively. Here, is of order one, and the additional scaling is meant to reflect those introduced in the previous paragraphs. At the end of this section we provide an argument that the same results hold for scaled orthogonal matrices.

In this subsection, we will denote the averaging over variables and , at a given layer , by . By we denote the sample average, of some variable in the -th layer: . Note that the width () is independent of the layer number, however the derivation can be easily generalized to the opposite case, when the architecture is more complicated. Unless stated otherwise explicitly, all integrals are calculated over real line.

We are interested in the distribution of in our model, depending on the input vectors and the probability distributions of and . If we assume they are normal (as can be argued using the Central Limit Theorem), we just need the first two moments. It is clear that . Furthermore, we assume ergodicity, i.e. that averaging some quantity over a layer of neurons is equivalent to averaging this quantity for one neuron over an ensemble of neural networks with random initializations. We assume this is true for , , and . Thus, we can say that and moreover, as we work in the limit of wide networks, can be replaced with an averaging over a normal distribution of variance . This is the crux of the dynamical mean field theory approach developed in PLRSG () for feed-forward neural networks. We have in particular:

(27)

where . Thus we obtain:

(28)

Therefore, to calculate the effective cumulant, we need to know how the variance of the distribution of the preactivations changes as the input information propagates across consecutive layers of the network.

With the scalings of from section II.1 made explicit, we have . Now, based on (1) and the above considerations, we have

(29)

and

(30)

Let us assume the factorization holds in the large limit. This is justified, as the input to comes from all the many elements of . We can rewrite (30) as

(31)

Turning to , based on (1), we have:

(32)

For the recurrence yields

(33)

Thus, (31), with a shift in , turns into

(34)

Finally, we use (29) to obtain

(35)

which is a closed recursive equation for . We note that for , the known, feed-forward network recursion relation is recovered. Furthermore, for the case of , in contrast to the feed-forward architecture, the biases do not influence the statistical properties of the pre-activations. Moreover, for ResNets, this recursive relation is iteratively additive, namely each is a result of adding some terms to the previous . In all the examples studied below, the first term is positive and the second term is non-negative. This in turn means that the variance of pre-activations grows with the networks depth and there are no non-trivial fixed points of this recursion equation. Finally, here we can see the importance of the scaling of , without which, would grow uncontrollably with .

We remark here that the above reasoning concerning signal propagation holds also when the weights are scaled orthogonal matrices, i.e. . In such a case and  CollSn () and the entries of can be approximated as independent Gaussians Chatterjee ().

ii.3 Results for various activation functions

We now investigate particular examples of activation functions. For simplicity, we consider purely residual networks (we set ). The numerical verifications of the results presented here will follow in the next subsection.

  1. Linear

    In the case of the linear activation function and there is no need to consider the way the pre-activations change across the network. Thus we can proceed to calculating the cumulant which yields .

  2. Rectified Linear Unit

    The example of ReLU is only slightly more involved. Now we have , where is the Heaviside theta function, and thus

    (36)
  3. Leaky ReLU

    The activation function interpolating between the first two examples is with . In this case

    (37)

All together, this leads to the following equation for the Green’s function

(38)

where corresponds to the linear activation function and to ReLU. Equation (38) can be easily solved numerically for the spectral probability density of the Jacobian. For completeness, we write down the recursion relation (35) in these three cases:

(39)

For the linear activation function (), its solution is readily available and reads

(40)

which explicitly shows the importance of the rescaling introduced earlier.

  1. Hard hyperbolic tangent

    The hard tanh activation function is defined by for and elsewhere. Thus:

    (41)

    The resulting recurrence equation for the variance of the preactivations reads:

    (42)

    and can be easily solved.

In the preceding examples we dealt with piecewise linear activation functions. For other nonlinear activation functions to obtain the cumulants, we need to use the recurrence relation describing the signal propagation in the network.

  1. Hyperbolic tangent

    When , the activation function is antisymmetric and the last term of (35) vanishes. Thus the recurrence takes the form:

    (43)

    In the large limit, we can write

    (44)

    where we assume . Expanding this recursively around for decreasing and keeping only the leading term, as long as for any , , we obtain :

    (45)

    Therefore, the solution of the recursion is:

    (46)

    The variance grows linearly with (this is verified in appendix A, see Fig. 5). Cumulants and thus are obtained with numerical integration.

In fact, in the above calculations we have only used the antisymmetry property of the hyperbolic tangent activation function and the properties of its behavior near . Therefore, these results are valid for other antisymmetric activation functions like .

  1. Sigmoid

    The sigmoid activation function, is the first example we encounter, for which , thus it deserves special attention. In particular, in this case one needs to additionally address the last term in (35). It turns out, that

    (47)

    irrespective of . Therefore, the recurrence relation becomes:

    (48)

    Thus, we can see, that due to the non-zero first moment of the activation function (47) the mean and the second moment of post-activations grow with depth. Similarly, the variance of pre-activations increases as the signal propagates, which causes quick saturation of the sigmoid nonlinearity. This in turn precludes training of deep networks GB ().

    Analogically to the previous case, one can derive an approximation to the solution of the recursion relation. In this case it becomes:

    (49)

    We verify this result in Fig. 5 in Appendix A

  2. Scaled Exponential Linear Units

    Our final example is the SELU activation function, one introduced recently in KUM () with the intent to bypass the batch normalization procedure. In this case, we have for and for . Thus, it is not antisymmetric and is nonlinear for negative arguments. It turns out, that

    (50)

    Moreover:

    (51)

    and

    (52)

    These yield the recursion relation for via (35). One can check that for and , the results for ReLu are recovered.

These theoretical predictions for the recursion relations are tested with numerical simulations using Mathematica. The results are relegated to Appendix A.

ii.4 Random matrix experiments with Mathematica

To throughly test the theoretical predictions of section II, we run numerical simulations using Mathematica. The initial condition, input vector of length , filled with iid Gaussian random variables of zero mean and unit variance, is propagated according to the recurrence (1), for various activation functions. The network weights and biases are generated from normal distribution of zero mean and and variance, respectively, with (see plots for values of other parameters). The propagation of variance of pre-activations, post-activations as well as the second cumulant , across the network, is presented in Appendix A. All numerical simulations corroborate our theoretical results. Here, for clarity and as a generic example, in figure 1 (left), we show the distribution of singular values of the input-output Jacobian (defined in 6) for the tanh nonlinearity for various network depths. In this example the Jacobian in not independent of the signal propagation, as it is the case of piecewise linear activation functions. Similarly, in the right panel of figure 1, we showcase the outcome of numerical experiments and the associated, matching, theoretical results for the most popular ReLU activation function, for various initializations resulting in different values of the effective cumulant .

Figure 1: (Left) density of singular values of the input-output Jacobian for the residual network with tanh nonlinearity. Note that the asymptotic theoretical result describes remarkably well not very deep () networks. (right) Asymptotic distribution of singular values for various values of parameter (dashed) juxtaposed with the numerical simulations for ReLU nonlinearity (solid). Note that histograms were calculated from a single random initialisation. The smaller , the narrower the spectrum and the closer to the Jacobian isometry.

Iii Experiments on image classification

The goal of this section is to test our theoretical predictions on real data. To this end we will use a single representation fully connected residual network, see Fig. 2. This simplified version of the model of (HZRS, ) does not use (i) multiple stages with different dimension of hidden representation, and (ii) two layers within residual block. Further, we will restrict our focus to image classification via the popular CIFAR-10 benchmark CIFAR10 (). Finally, we focus on fully connected residual networks and leave studying convolutional layers for future work.

Figure 2: Residual network architecture used in the paper.

iii.1 Achieving dynamical isometry for any activation function

Perhaps the most interesting prediction of our theory is that ResNets, in contrast to fully connected networks, can achieve dynamical isometry for many different activation functions. We will study this empirically by looking at , at initialization, for different activation functions and number of residual blocks. Please note that by we refer to Jacobian of the output of the last residual block with respect to the input of the first one.

We consider the following popular activation functions: ReLU NH (), Tanh, Hard Tanh, Sigmoid, SeLU (KUM, ) and Leaky ReLU (MHN, ) (with the leaking constant in ). For each activation function we consider the number of blocks to be and . All weights of the network are initialized from a zero-centered normal distribution whereas biases are initialized to zero. The weights of the residual blocks are initialized using standard deviation , other weights are initialized as in (Taki1, ). For the given activation function and the number of blocks , we set so that the effective cumulant , which ensures the concentration of eigenvalues of the Jacobian around one, and hence dynamical isometry (see Sec. II.3 for more details and Fig. 1 for the shape of the singular value density).

For each pair of activation function and number of blocks we compute the empirical spectrum of at initialization, the results are reported in Fig. 3. Indeed, we observe that upon scaling the initializations standard deviation, in such a way that is kept constant, the empirical spectrum of is independent of the number of residual blocks or the choice of activation functions.

Figure 3: Singular spectra obtained for various activation functions and depth . The network was fed with examples from CIFAR10 dataset.

iii.2 Learning dynamics are not the same under dynamical isometry

Figure 4: Training accuracy during first 100 iterations (left) and first 100 epochs (right) of residual networks with various activation functions. The weight initialization was chosen for each activation function in such a way that the effective cumulant is and the spectra of Jacobians are presented in Fig. 3. We set for leaky ReLU.

Our next objective is to investigate whether networks which achieve dynamical isometry share similar learning dynamics. While this is outside of the scope of our theoretical investigation, it is inspired by studies such as (PSG1, ), which demonstrate the importance of achieving dynamical isometry at the initialization for the subsequent optimization.

We consider the same set of experiments as in the previous section, and follow a similar training protocol to (HZRS, ). We train for epochs and drop the learning rate by a factor of after epochs , , and . We use batch-size and a starting learning rate of either or 111We use relatively low learning rates, largely because we omit batch normalization layers in the architecture..

First, we look at the learning dynamics on the training set. Most of the activation functions exhibit similar training accuracy evolution, see Fig. 4 (top). Using the sigmoid activation, led however to significantly slower optimization. This is due to a faster growth of the variances of post- and pre- activations (which can be observed in Fig. 5), which exacerbates the neuron saturation problem. The differences between the sigmoid and the rest of the activation functions is even more pronounced on the validation data set, see Fig. 4 (bottom left).

Overall our results suggest that the probability density of the singular spectrum of at initialization does not fully determine generalization and training performance irrespective of the activation function used. Nonetheless, setting the same effective cumulant for the experiments with different activation functions results in a markedly coinciding behavior of neural networks using activation functions of similar characteristics. This is in contrast to a set up in which the variances of the weight matrix entries are set to be equal. To demonstrate this we run another set of training experiments, this time with all standard deviations . The plots depicting the full results are relegated to Fig. 6 in appendix B. Here, in Fig. 4 (bottom right) we showcase the training accuracy during the first iterations for these two setups (excluding, for clarity, the networks with the sigmoid activation function). With different effective cumulants, the network learning dynamics, differs among experiments with different activation functions, especially at the beginning of learning.

This suggests that the spectrum of the input-output Jacobian at initialization can be treated as a confounding variable in such experiments. Ensuring that the level of dynamic isometry, and hence the value of the effective cumulant is kept the same, provides the possibility of a more meaningful comparison of the effect of activation functions on learning dynamics.

Iv Comment on batch normalization

A crucial component of practically used ResNets is batch normalization IS (). When it is used on pre-activations, between each layer, the propagation of the information in the network is described by:

(53)

where

(54)

are the mean and (regularized with some small ) standard deviation of the ’th mini batch inputs ’th coefficient in layer , namely of . and are parameters which are optimized during the learning process. Here, the formula for the Jacobian reeds:

(55)

where is a diagonal matrix such that . Therefore, the only difference in the spectral statistics derivation from the previous section, is that (21) becomes

(56)

Thus, the universal, large limit equation for the Greens function of the Jacobian (25) and the formula for the condition number (26) hold also when batch normalization is included. Again, and can be treated as random variables. Unfortunately, the evolution of their probability density functions across the layers is quite complicated and beyond the scope of this paper. We leave their analysis for future work.

V Synopsis and discussion

The main focus of this paper was the singular spectrum of the input-output Jacobian of a simple model of residual neural networks. We have shown that in the large network depth limit, it is described by a single, universal equation. This holds irrespective of the activation function used, for biunitarily invariant weight initializations matrices, a set covering Gaussian and scaled orthogonal initialization schemes. The singular value density depends on a single parameter called the effective cumulant, which can be calculated by considering the propagation of information in the network, via a dynamical mean field theory approach. This parameter depends on the activation function used, variance of biases and the entries of the weight matrices, and, for some activation functions, also on the depth of the network. We demonstrated the validity of our theoretical results in numerical experiments, both by generating random matrices adhering to the assumptions of the model and by evaluating the Jacobians of residual networks (at initialization) on the CIFAR10 dataset.

For a given activation function and/or network depth, it is always possible to set the weight matrix entries variances in such a way, that the resulting singular spectra of the Jacobians not only fulfill the conditions for dynamical isometry, but also are exactly the same, irrespective of the activation function used. This observation allows us to eliminate the singular spectrum of the Jacobian treated as a confounding factor in experiments with the learning process of simple residual neural networks for different activation functions. As an example of how this approach can be applied, we examined how accuracies of simple residual neural networks, employing a variety of activation functions, change during the learning process. When using the same variances of weight matrices entries, the learning curves of similar activation functions differed between each other more than when the networks were initialized with the same input-output Jacobian spectra. This allows, in our opinion, for a more meaningful comparison between the effects of choosing the activation function. We hope this observation will help with the research of deep neural networks.

Finally, we would like to note, that in hindsight, it is not surprising that Free Probability Theory, a generalization of classical Probability Theory to non-commuting variables, is useful in the study of deep neural networks. We expect to see more interesting results in this area, especially looking towards the applications of many new and emerging results for the eigenvalue spectra and (equally if not more important) eigenvectors of non-symmetric matrices.

Vi Acknowledgments

PW would like to acknowledge the funding of the Polish National Science Centre, through the project SONATA number 2016/21/D/ST2/01142. WT appreciates the financial support from the Polish Ministry of Science and Higher Education through ”Diamond Grant” 0225/DIA/2015/44. SJ was supported by the Polish National Science Centre through ETIUDA stipend No. 2017/24/T/ST6/00487.

References

  • (1) V. Nair, G. Hinton, and E. Geoffrey Rectified Linear Units Improve Restricted Boltzmann Machines, Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 2010
  • (2) Y. LeCun, Y. Bengio, and G. Hinton Deep Learning. Nature 521 (2015) 436.
  • (3) A. Canziani, A. Paszke, and E. Culurciello. An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678.
  • (4) K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • (5) A. Veit, M. Wilber, and S. Belongie. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. NIPS’16 Proceedings of the 30th International Conference on Neural Information Processing Systems. (2016) 550.
  • (6) J. Pennington, S. Schoenholz, and S. Ganguli Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, Curran Associates Inc. (2017) 4788.
  • (7) K. He, X. Zhang, S. Ren, and J. Sun Identity Mappings in Deep Residual Networks ECCV, 2016
  • (8) L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington. Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks. arXiv:1806.05393
  • (9) E. Wigner. Characteristic vectors of bordered matrices with infinite dimensions Annals of Mathematics 62 (1955) 548.
  • (10) R. Carmona, M. Cerenzia, and A. Zeff Palmer. The Dyson Game. arXiv:1808.02464.
  • (11) The Oxford Handbook of Random Matrix Theory. Edit by G. Akemann, J. Baik, and P. Di Francesco. Oxford University Press, 2011.
  • (12) A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surface of multilayer networks. JMLR, 38 (2015).
  • (13) J. Pennington, and Y. Bahri. Geometry of Neural Network Loss Surfaces via Random Matrix Theory. Proceedings of the 34’th International Conference on Machine Learning, PMLR 70, 2017.
  • (14) C. Louart, Z. Liao, and R. Couillet. A Random Matrix Approach to Neural Networks. The Annals of Applied Probability 28 (2018) 1190.
  • (15) J. Pennington, and P. Worah Nonlinear random matrix theory for deep learning. 31st Conference on Neural Information Processing Systems (NIPS 2017).
  • (16) Z. Liao, and R. Couillet. The Dynamics of Learning: A Random Matrix Approach. arXiv:1805.11917.
  • (17) S. Shoenholtz, J. Gilmer, S. Ganguli, and J. Sohl-DickStein. Deep information propagation. arXiv: 1611.01232.
  • (18) J. Pennington, S. Shoenholtz, and S. Ganguli. The emergence of spectral universality in deep neural networks. arXiv:1802.09979.
  • (19) Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models Proceedings of the 30th International Conference on Machine Learning, PMLR 66, 2013.
  • (20) B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S Ganguli Exponential expressivity in deep neural networks trough transient chaos. arXiv:1606.05340.
  • (21) D. Mishkin, and J. Matas. All you need is a good init. ICLR 2016.
  • (22) M. Taki. Deep Residual Networks and Weight Initialization. arXiv:1709.02956
  • (23) B. Hanin, and D. Rolnick. How to Start Training: The Effect of Initialization and Architecture. arXiv:1803.01719.
  • (24) D. Balduzzi, M. Frean, L. Leary, JP Lewis, K. Wan-Duo Ma, and B. McWilliams. The shattered gradients problem: if resnets are the answer, than what is the question? Proceedings of the 34 th International Conference on Machine Learning, PMLR 70, 2017.
  • (25) Z. Ling, and R. C. Qiu. Spectrum concentration in deep residual learning: a free probability approach. arXiv:1807.11694v1.
  • (26) X. Glorot, and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (2010) 249.
  • (27) A.M. Saxe, J.L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 (2013).
  • (28) E. Gudowska-Nowak, R. A. Janik, J. Jurkiewicz, M. A. Nowak. Infinite products of large random matrices and matrix-valued diffusion. Nucl. Phys. B 670 (2003) 479.
  • (29) R.A. Janik, W. Wieczorek. Multiplying unitary random matrices - universality and spectral properties. J. Phys. A: Math. Gen. 37 (2004) 6521.
  • (30) A.V. Voiculescu, K.J. Dykema, and A. Nica. Free random variables. CRM Monograph Series Vol. 1, AMS, Providence (1992).
  • (31) J. A. Mingo, and R. Speicher. Free probability and random matrices. Vol. 4. New York, NY, USA: Springer, (2017).
  • (32) Z. Burda, R.A. Janik, and M.A. Nowak. Multiplication law and S transform for non-hermitian random matrices. Phys. Rev. E84 (2011) 061125.
  • (33) Z. Burda, M.A. Nowak, and A. Swiech. Spectral relations between products and powers of isotropic random matrices. Phys. Rev. E86 (2012) 061137.
  • (34) R.A. Janik, M.A. Nowak, G. Papp, and I. Zahed, NonHermitian random matrix models, Nucl. Phys. B501 (1997) 603. R.A. Janik, M.A. Nowak, G. Papp, J. Wambach, and I. Zahed, NonHermitian random matrix models: a free random variable approach, Phys. Rev. E55 (1997) 4100. For a closely related approach known as "hermitization" see J. Feinberg, and A. Zee, NonHermitian random matrix theory: Method of Hermitian reduction, Nucl. Phys. B504 (1997) 579.
  • (35) A. Jarosz, and M.A. Nowak. Random hermitian versus random non-hermitian operators - unexpected links. J. Phys. A39 (2006) 10107.
  • (36) D. Voiculescu, in Operator Algebras and their Connections with Topology and Ergodic Theory, Lecture Notes in Math. 1132 (1985), Springer Verlag, 556.
  • (37) A. Nica, and R. Speicher. R-diagonal pairs - a common approach to Haar unitaries and circular elements. Fields Inst. Commun. 12 (1997) 149.
  • (38) A. Nica, and R. Speicher. Lectures on the combinatorics of free probability (Vol. 13). Cambridge University Press, 2006.
  • (39) M. A. Nowak, and W. Tarnowski. Complete diagrammatics of the single-ring theorem. Physical Review E 96 (2017) 042149.
  • (40) R. M. Corless et al. On the Lambert W function. Adv. Comp. Math 5 (1996) 329.
  • (41) B. Collins, and P. Śniady. Integration with Respect to the Haar Measure on Unitary, Orthogonal and Symplectic Group. Communications in Mathematical Physics 264(3) (2006) 773.
  • (42) S. Chatterjee, and E. Meckes. Multivariate normal approximation using exchangeable pairs. arXiv math/0701464 (2007).
  • (43) A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, chapter 3. University of Toronto, 2009.
  • (44) G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (NIPS) (2017).
  • (45) S. Ioffe, and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML (2015).

Appendix A Numerical verification of the recurrence relations

To validate the assumptions made and corroborate the theoretical results obtained in subsections II.2 and II.3, we simulate signal propagation in the studied residual neural networks with different activation functions. The outcomes of these experiments are depicted in figure 5.

Figure 5: Verification of the numerical solution to the recurrence equations for post-activations (31) (left column) and preactivations (35) (middle). Based on this signal propagation the effective cumulant for each layer was calculated (right column). The solid red lines represent the approximation (46) for tanh and hard tanh nonlinearity and (49) for sigmoid. Solutions of recurrences (solid) are confronted with the numerical simulation (dots) of residual fully connected networks with layers of width . Data points represent a single run of simulations. Weights are independently sampled from Gaussian distribution of zero mean and variance equal to . Biases and network input are sampled from standard normal distribution. A small variability of across the network justifies the assumption made for derivation of (22).

Appendix B Baseline

We advocate for setting the same value of the effective cumulant (and hence keeping the same spectrum of the input output Jacobian) when comparing the effects of using different activation function on the learning process. Thus, here, in Fig. 6, for comparison, we showcase the learning accuracy when instead of the effective cumulant, the weight matrices entries’ variances (equal to ) are kept the same across the networks.

Figure 6: Training accuracy during first 100 iterations (left) and first 100 epochs (right) of residual networks with various activation functions. The weight initialization was Gaussian with zero mean and variance. We set and for leaky ReLU (LReLU).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
283378
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description