Characterizing Wellbehaved vs. Pathological
Deep Neural Network Architectures
Abstract
We introduce a principled approach, requiring only mild assumptions, for the characterization of deep neural networks at initialization. Our approach applies both to fullyconnected and convolutional networks and incorporates the commonly used techniques of batch normalization and skipconnections. Our key insight is to consider the evolution with depth of statistical moments of signal and sensitivity, thereby characterizing the wellbehaved or pathological behaviour of inputoutput mappings encoded by different choices of architecture. We establish: (i) for feedforward networks with and without batch normalization, depth multiplicativity inevitably leads to illbehaved moments and distributional pathologies; (ii) for residual networks, on the other hand, the mechanism of identity skipconnection induces powerlaw rather than exponential behaviour, leading to wellbehaved moments and no distributional pathology.^{1}^{1}1Code to reproduce all results will be made available upon publication.^{†}^{†} Preprint. Work in progress.
medium1\scalebox1.1\BODY \NewEnvironmedium2\scalebox1.05\BODY \Urlmuskip=0mu plus 1mu
1 Introduction
Advances in the design of neural network architectures have more often come from the relentless race of practical applications rather than by principled approaches. Even after wide adoption, many common choices and rules of thumbs still await for theoretical validation. This is unfortunate since the emergence of a theory to fully characterize deep neural networks in terms of wanted and unwanted properties would enable to guide further improvements.
An important branch of research towards this theoretical characterization has focused on random networks at initialization, i.e. before any training has occured. The characterization of random networks offers valuable insights into the hypothesis space of inputoutput mappings reachable during training. In a Bayesian view, it also unveils the prior on the inputoutput mapping encoded by the choice of architecture (Neal, 1996a; Williams, 1997; Lee et al., 2017). Experimentally, wellbehaved inputoutput mappings at initialization were extensively found to be predictive of model trainability and even posttraining performance (Schoenholz et al., 2016; Balduzzi et al., 2017; Pennington et al., 2017; Yang & Schoenholz, 2017; Chen et al., 2018; Xiao et al., 2018).
Unfortunately, even this simplifying case of random networks is still challenging in various respects: (i) there is a complex interplay of different sources of randomness from input data and from model parameters; (ii) the finite number of units in each layer leads to additional complexity; (iii) convolutional neural networks – widely used in many applications – also lead to additional complexity. The difficulty (i) is typically circumvented by restricting input data to simplying cases: a onedimensional input manifold (Poole et al., 2016; Raghu et al., 2017), two input data points (Schoenholz et al., 2016; Balduzzi et al., 2017; Yang & Schoenholz, 2017; Chen et al., 2018; Xiao et al., 2018), a batch of input data points (Anonymous, 2019). The difficulty (ii) is commonly circumvented by either considering the simplifying case of infinite width with the convergence of neural networks to Gaussian processes (Neal, 1996a; Daniely et al., 2016; Lee et al., 2017; Matthews et al., 2018) or the simplifying case of typical activation patterns for ReLU networks (Balduzzi et al., 2017). Finally the difficulty (iii) is rarely tackled as most analyses only apply to fullyconnected networks. To the best of our knowledge, all attempts have thus far been limited in their scope or simplifying assumptions.
In this paper, we introduce a principled approach to characterize random deep neural networks at initialization. Our approach does not require any of the usual simplifications of infinite width, gaussianity, typical activation patterns, restricted input data. It further applies both to fullyconnected and convolutional networks, and incorporates batch normalization and skipconnections. Our key insight is to consider statistical moments of signal and sensitivity with respect to input data as random variables which depend on model parameters. By studying the evolution of these moments with depth, we characterize the wellbehaved or pathological behaviour of inputoutput mappings encoded by different choices of architectures. Our findings span the topics of onedimensional degeneracy, pseudolinearity, exploding sensitivity, exponential and powerlaw evolution with depth.
2 Propagation
We start by formulating the propagation for neural networks with neither batch normalization nor skipconnections, that we refer as vanilla networks. The formulation will be slightly adapted in Section 6 with batchnormalized feedforward nets, and in Section 7 with batchnormalized resnets.
Clean propagation. We consider a random tensorial input , spatially dimensional with extent in all spatial dimensions and channels. This input is fed into a dimensional convolutional neural network with periodic boundary conditions and fixed spatial extent equal to .^{1}^{1}1The assumptions of periodic boundary conditions and fixed spatial extent are made for simplicity of the analysis. Possible relaxations are discussed in Section C.2. For each layer , we denote the number of channels or width, the convolutional spatial extent, the weight tensors, the biases, the tensors of postactivations and preactivations. We further denote the spatial position, c the channel, and the activation function. Adopting the convention , the propagation at each layer is given by
where is the tensor with repeated version of at each spatial position. From now on, we refer to the propagated tensor as the signal.
Noisy propagation. Next we suppose that the input signal is corrupted by a small white noise tensor with independent and identically distributed components such that , with the Kronecker delta. The noisy signal is propagated into the same neural network and we keep track of the noise corruption with the tensor defined as and , with the neural network mapping from layer to . The simultaneous propagation of the signal and the noise is given by
(1)  
(2) 
where denotes the elementwise tensor multiplication and Eq. (2) is obtained by taking the derivative in Eq. (1). For given , the mapping from to induced by Eq. (2) is linear. The noise thus stays centered with respect to during propagation such that : .
To get rid of the dependence on , the random sensitivity tensor is introduced as the rescaling and . By linearity of Eq. (2), the sensitivity tensor is the result of the simultaneous propagation of and in Eq. (1) and (2). We also have and : . The sensitivity tensor encodes derivative information while avoiding the burden of increased dimensionality (see calculations in Appendix D.3 and Appendix D.2). This will prove very useful.
Scope. We require two very mild assumptions: (i) is not trivially such that ;^{2}^{2}2Whenever and c are considered as random variables they are supposed uniformly sampled among all spatial positions and all channels . (ii) the width is bounded.
More importantly, we restrict to the – most commonly used – ReLU activation function . Even though is not differentiable at , we still define and as the result of the simultaneous propagation of Eq. (1) and Eq. (2) with the convention . Section 3 implicitly relies on the assumption that remains differentiable, implying , almost surely (a.s.) with respect to (details and justification of this assumption in Appendix D.1).
Note that fullyconnected neural networks are still included in our analysis as the subcase .
3 Input data randomness
We now turn our attention to the distributions with respect to input data of signal and sensitivity , . To outline the importance of these distributions, we may express by layer composition the output of an layer neural network as , with and the upper neural network mapping from layer to . It means that and can be seen as input signal and noise of this upper neural network. It also means that the distributions and , or equivalently , must remain wellbehaved for this upper network to have a chance to accomplish its task successfully. We will return to this argument in Section 3.2 by detailing the penalizing effects of illbehaved , .
3.1 Characterizing distributions with respect to input data
First we need to introduce the following statistical quantities to characterize , :

The feature map vector and centered feature map vector associated with any random tensor ,
where is uniformly sampled in and the denotations , remind the randomness of both and .
In a similar vein as batch normalization and Xiao et al. (2018), feature maps and centered feature maps at different spatial positions and are implicitly seen as subsignals propagated together as part of tensorial metasignals. Statistically, we work at the granularity of subsignals and we treat equally the randomness from and .^{3}^{3}3In the context e.g. of note that propagation from layer to solely depends on the randomness of , the randomness of being only involved in the selection of the final position at layer . For simplicity, will still be referred as input data.

The noncentral moment and central moment of order associated with any random tensor perchannel and averaged over channels,

The effective rank associated with any random tensor (Vershynin, 2010),
where denotes the covariance matrix, the spectral norm, and the eigenvalues of . The effective rank is always greater than , and it intuitively measures the number of effective directions which concentrate the variance of .

The normalized sensitivity – our key metric – derived from the moments of and ,
(3) To better understand the definition of let us consider the classification task. Since the goal is to set apart different signals, the mean signal over is uninformative. Centered feature maps thus constitute the informative part of the signal and , normalize by signal ’informative content’ in Eq. (3). The normalized sensitivity exactly measures a root mean square sensitivity when neural network input and output signals and are rescaled to have unit signal ’informative content’: and (proof in Appendix D.2). If we denote the rescaled inputoutput mapping as , then this property is simply expressed as in the fullyconnected case with 1dimensional rescaled input and output signals and (, ). We provide an illustration in Fig. 1.
Figure 1: Illustration of the normalized sensitivity for fullyconnected networks of layers with 1dimensional input and output rescaled signals and . The distribution of the input is a mixture of two Gaussians. We show the result of the propagation in three different cases: (a) layer with sigmoid activation; (b) layer with linear activation; (c) randomly initialized layers with channels and ReLU activation for and linear activation for . Top: full inputoutput mapping (blue) and randomly sampled inputoutput data points (red circles). Bottom: histograms of inputs and outputs. The normalized sensitivity is directly connected to the Jacobian norm in the fullyconnected case and to sensitivity to signal perturbation in the general case, known in turn for being connected to generalization (Sokolic et al., 2017; Arora et al., 2018; Morcos et al., 2018; Novak et al., 2018; Philipp & Carbonell, 2018).^{4}^{4}4The coefficient defined in Philipp & Carbonell (2018) is equivalent to the normalized sensitivity in the fullyconnected case. Section D.3 provides details on this equivalence and the reasons for our change of terminology. The notion of sharpness which quantifies sensitivity to weight perturbation was similarly shown to be connected to generalization (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016; Smith & Le, 2017; Neyshabur et al., 2017). Sensitivity to signal perturbation and weight perturbation are tighly connected since the introduction of a noise on the weights amounts to the introduction of a noise and on the signal in Eq. (1) and Eq. (2).
3.2 Distributional pathologies
Using these statistical tools, we are able to characterize the distributional pathologies – with illbehaved , – that we will encounter:

Zerodimensional signal: . To understand this pathology, let us consider the following mean vectors and rescaling of the signal:
Then and the pathology implies (proof in Appendix D.4). It follows that becomes pointlike concentrated at the point of unit norm. In the limit of strict pointlike concentration, the upper neural network from layer to is limited to random guessing since it ’sees’ all inputs the same and cannot distinguish between them.

Onedimensional signal: . This pathology implies that has its variance which becomes concentrated in only one direction, and thus that becomes linelike concentrated. In the limit of strict linelike concentration, the upper neural network from layer to only ’sees’ a single feature from .

Exploding sensitivity: . To understand this pathology, let us push further the view of noisy propagation with the signaltonoise ratio and the noise factor :
(4) where Eq. (4) follows from and . In logarithmic decibel scale , indicating that measures how the neural network from layer to degrades () or enhances () the input signaltonoise ratio. The pathology implies and , meaning that the noisy signal becomes completely random for any level of input noise .^{5}^{5}5In this case, Eq. (2) eventually does not hold since the noise at layer becomes noninfinetesimal even for infinitesimal input noise . The upper neural network from layer to is then limited to random guessing.
4 Model parameters randomness
We now introduce model parameters as the second source of randomness. We consider random networks at initialization, which we suppose is standard: (i) weights and biases are initialized following He et al. (2015); (ii) when preactivations are batchnormalized, scale and shift batch normalization parameters are initialized with ones and zeros respectively.
Considering random networks at initialization is justified in two respects:

From a Bayesian perspective, the random distribution on the weights encodes a prior distribution on inputoutput mappings for a given choice of architecture (Neal, 1996a; Williams, 1997). This prior distribution can be combined with the likelihood of model parameters given training data in order to obtain a posterior distribution on inputoutput mappings, which in turn enables Bayesian predictions (Lee et al., 2017).

If distributional pathologies arise at initilization, it is likely that training will be difficult. Indeed if the distributional pathology is severe, it becomes very difficult for the neural network to adjust its parameters and undo the pathology. As an illustration, consider the pathology of zerodimensional signal: . In this case, the upper neural network from layer to must adjust its bias parameters very precisely in order to center the signal and distinguish between different inputs . In support of this argument, several works have shown that deep neural networks are precisely trainable when they are not subject to the pathology of different inputs being strictly correlated at initialization, i.e. nearly undistinguishable (Schoenholz et al., 2016; Xiao et al., 2018).
From now on, our methodology is to consider all momentrelated quantities, e.g. , , , , , , , as random variables depending on . We introduce the notation for the full set of parameters, and the notation for the conditional set of parameters when are considered as random and as given. We further denote the geometric increment .
Evolution with Depth. The evolution with depth of can be written as
(5) 
where Eq. (5) is obtained using and by expressing with telescoping terms. Denoting the multiplicatively centered increments, and using and , we obtain
(6)  
(7)  
(8) 
Discussion. First we note that and are random variables which depend on , while is a random variable which depends on . We also note that by logconcavity, and that is centered: .
Under standard initialization, each channel provides an independent contribution to . As a consequence, for large the relative increment has low expected deviation to , meaning with high probability that , , . In addition, is centered and noncorrelated at different so its sum scales as , whereas the sums of and scale as (see Lemma 10 in Appendix E.1). The term is thus doubly negligible. In summary, the evolution with depth is dominated by when this term is nonvanishing and by otherwise. The same analysis can be applied to other positive moments such as , , . It can also be applied to the ratio and thus to .
Further notation. From now on, the geometric increment of any quantity is denoted with . The definitions of , and in Eq. (6), (7) and (8) are extended to other central and noncentral moments of signal and sensitivity, as well as with , , .
We introduce the notation when , with , with high probability. And the notation when , with , with high probability. From now on, we assume that the width is large, implying and . We stress that this assumption is milder than the commonly used meanfield assumption of infinite width: . Indeed we still consider as random and as possibly nonGaussian, whereas meanfield would consider as nonrandom and as Gaussian.
5 Vanilla Networks
We are fully equipped to characterize neural network architectures at initialization. We start by analyzing vanilla networks corresponding to the equations of propagation introduced in Section 2.
Theorem 1.
Moments of vanilla networks. (proof in Appendix E.2)
There exist positive constants , , , , random variables , and events with probability such that under : and are centered and
Discussion. First we discuss the conditionality on which is necessary to exclude the collapse and (with undefined and ), occuring e.g. when all elements of are strictly negative (Lu et al., 2018). The complementary event has probability exponentially small in the width due to
implying . The conditionality on thus has highly negligible effect.
Next we discuss the evolution of and under . The particularity of the initilization He et al. (2015) is to keep stable and during propagation. To do so, it enforces and such that , vanish in Eq. (5) and , are subject to a slow diffusion with small negative drift terms: , , and small diffusion terms: , (see Section E.3 for details).
As evidenced by Fig. 1(a) and Fig. 1(b), the small negative drift and diffusion terms cause slowly decreasing negative expectation and increasing variance of , . The decreasing negative expectation and increasing variance of , can be seen as opposing forces which compensate in order to keep stable , during propagation. Indeed Fig. 1(a) and Fig. 1(b) show that , are nearly Gaussian, implying that and are nearly lognormal. And the expectation of a lognormal variable with and is equal to . Note that the diffusion happens in logspace since layer composition amounts to a multiplicative random effect in real space. Also note that this is a finitewidth effect since the terms also vanish in the limit of infinite width.
Theorem 2.
Normalized Sensitivity increments of vanilla networks. (proof in Appendix E.4) Denote and . The dominating term under in the evolution of is
(9) 
Discussion. An immediate consequence is that , i.e. that normalized sensitivity always increases with depth for random ReLU vanilla networks.
Now let us show that Theorem 1 and Theorem 2 only allow two possibilities of evolution, which both converge to distributional pathology:

If sensitivity is exploding: , then by definition of . Given that and that both and are limited to the slow diffusion of Theorem 1, this qualitatively means that becomes . Rigorously, if has an exponential drift stronger than diffusion and if we assume that and are lognormallydistributed as supported by Fig. 2, then a.s. (proof in Appendix E.5). Consequently, this case leads to double pathologies: exploding sensitivity and zerodimensional signal.

Otherwise, geometric increments are strongly limited and we may assume . If we further assume that moments of the unitvariance rescaled signal are bounded, then Theorem 2 implies the convergence to the pathology of onedimensional signal: (proof in Appendix E.6), as well as the convergence to neural network pseudolinearity where each layer becomes arbitrary well approximated by a linear function (proof in Appendix E.7)
Experimental verification. The evolution with depth of vanilla networks is shown in Fig. 3. From the possibilities (i) and (ii), it is case (ii) occuring with subexponential normalized sensitivity: , the convergence to the pathology of onedimensional signal: , and to neural network pseudolinearity. A typical symptom of neural network pseudolinearity is that signals ’cross’ the ReLU on the same side. Our analysis offers a novel insight into this coactivation phenomenon, previously observed experimentally (Balduzzi et al., 2017).
6 Batchnormalized feedforward nets
Next we incorporate batch normalization (Ioffe & Szegedy, 2015). For simplicity, we only consider the test mode which consists in subtracting and dividing by for each channel c in . The propagation is given by
(10)  
(11) 
where denotes batch normalization. Note that Eq. (10) and (11) explicitly formulate a finergrained subdivision of three different steps between layers and in the simultaneous propagation of .
Theorem 3.
Normalized Sensitivity increments of batchnormalized feedforward nets. (proof in Appendix F.1)
The dominating term in the evolution of can be decomposed as the sum of a term due to batch normalization and a term due to the nonlinearity :
(12)  
(13)  
(14) 
Effect of batch normalization. The term of Eq. (12) corresponds to the evolution of from at layer to just after . To understand this term qualitatively, the preactivation tensor can be seen as random projections of , while batch normalization can be seen as an alteration of the magnitude for each projection. Given that batch normalization uses