Characterizing Well-behaved vs. Pathological Deep Neural Network Architectures

# Characterizing Well-behaved vs. Pathological Deep Neural Network Architectures

Antoine Labatie
antoine.labatie@centraliens.net
###### Abstract

We introduce a principled approach, requiring only mild assumptions, for the characterization of deep neural networks at initialization. Our approach applies both to fully-connected and convolutional networks and incorporates the commonly used techniques of batch normalization and skip-connections. Our key insight is to consider the evolution with depth of statistical moments of signal and sensitivity, thereby characterizing the well-behaved or pathological behaviour of input-output mappings encoded by different choices of architecture. We establish: (i) for feedforward networks with and without batch normalization, depth multiplicativity inevitably leads to ill-behaved moments and distributional pathologies; (ii) for residual networks, on the other hand, the mechanism of identity skip-connection induces power-law rather than exponential behaviour, leading to well-behaved moments and no distributional pathology.111Code to reproduce all results will be made available upon publication. Preprint. Work in progress.

\NewEnviron

medium1\scalebox1.1\BODY \NewEnvironmedium2\scalebox1.05\BODY \Urlmuskip=0mu plus 1mu

## 1 Introduction

Advances in the design of neural network architectures have more often come from the relentless race of practical applications rather than by principled approaches. Even after wide adoption, many common choices and rules of thumbs still await for theoretical validation. This is unfortunate since the emergence of a theory to fully characterize deep neural networks in terms of wanted and unwanted properties would enable to guide further improvements.

An important branch of research towards this theoretical characterization has focused on random networks at initialization, i.e. before any training has occured. The characterization of random networks offers valuable insights into the hypothesis space of input-output mappings reachable during training. In a Bayesian view, it also unveils the prior on the input-output mapping encoded by the choice of architecture (Neal, 1996a; Williams, 1997; Lee et al., 2017). Experimentally, well-behaved input-output mappings at initialization were extensively found to be predictive of model trainability and even post-training performance (Schoenholz et al., 2016; Balduzzi et al., 2017; Pennington et al., 2017; Yang & Schoenholz, 2017; Chen et al., 2018; Xiao et al., 2018).

Unfortunately, even this simplifying case of random networks is still challenging in various respects: (i) there is a complex interplay of different sources of randomness from input data and from model parameters; (ii) the finite number of units in each layer leads to additional complexity; (iii) convolutional neural networks – widely used in many applications – also lead to additional complexity. The difficulty (i) is typically circumvented by restricting input data to simplying cases: a one-dimensional input manifold (Poole et al., 2016; Raghu et al., 2017), two input data points (Schoenholz et al., 2016; Balduzzi et al., 2017; Yang & Schoenholz, 2017; Chen et al., 2018; Xiao et al., 2018), a batch of input data points (Anonymous, 2019). The difficulty (ii) is commonly circumvented by either considering the simplifying case of infinite width with the convergence of neural networks to Gaussian processes (Neal, 1996a; Daniely et al., 2016; Lee et al., 2017; Matthews et al., 2018) or the simplifying case of typical activation patterns for ReLU networks (Balduzzi et al., 2017). Finally the difficulty (iii) is rarely tackled as most analyses only apply to fully-connected networks. To the best of our knowledge, all attempts have thus far been limited in their scope or simplifying assumptions.

In this paper, we introduce a principled approach to characterize random deep neural networks at initialization. Our approach does not require any of the usual simplifications of infinite width, gaussianity, typical activation patterns, restricted input data. It further applies both to fully-connected and convolutional networks, and incorporates batch normalization and skip-connections. Our key insight is to consider statistical moments of signal and sensitivity with respect to input data as random variables which depend on model parameters. By studying the evolution of these moments with depth, we characterize the well-behaved or pathological behaviour of input-output mappings encoded by different choices of architectures. Our findings span the topics of one-dimensional degeneracy, pseudo-linearity, exploding sensitivity, exponential and power-law evolution with depth.

## 2 Propagation

We start by formulating the propagation for neural networks with neither batch normalization nor skip-connections, that we refer as vanilla networks. The formulation will be slightly adapted in Section 6 with batch-normalized feedforward nets, and in Section 7 with batch-normalized resnets.

Clean propagation. We consider a random tensorial input , spatially -dimensional with extent in all spatial dimensions and channels. This input is fed into a -dimensional convolutional neural network with periodic boundary conditions and fixed spatial extent equal to .111The assumptions of periodic boundary conditions and fixed spatial extent are made for simplicity of the analysis. Possible relaxations are discussed in Section C.2. For each layer , we denote the number of channels or width, the convolutional spatial extent, the weight tensors, the biases, the tensors of post-activations and pre-activations. We further denote the spatial position, c the channel, and the activation function. Adopting the convention , the propagation at each layer is given by

 yl=ωl∗xl−1+βl,xl=ϕ(yl),

where is the tensor with repeated version of at each spatial position. From now on, we refer to the propagated tensor as the signal.

Noisy propagation. Next we suppose that the input signal is corrupted by a small white noise tensor with independent and identically distributed components such that , with the Kronecker delta. The noisy signal is propagated into the same neural network and we keep track of the noise corruption with the tensor defined as and , with the neural network mapping from layer to . The simultaneous propagation of the signal and the noise is given by

 yl =ωl∗xl−1+βl, xl =ϕ(yl), (1) dyl =ωl∗dxl−1, dxl =ϕ′(yl)⊙dyl, (2)

where denotes the element-wise tensor multiplication and Eq. (2) is obtained by taking the derivative in Eq. (1). For given , the mapping from to induced by Eq. (2) is linear. The noise thus stays centered with respect to during propagation such that : .

To get rid of the dependence on , the random sensitivity tensor is introduced as the rescaling and . By linearity of Eq. (2), the sensitivity tensor is the result of the simultaneous propagation of and in Eq. (1) and (2). We also have and : . The sensitivity tensor encodes derivative information while avoiding the burden of increased dimensionality (see calculations in Appendix D.3 and Appendix D.2). This will prove very useful.

Scope. We require two very mild assumptions: (i) is not trivially such that ;222Whenever and c are considered as random variables they are supposed uniformly sampled among all spatial positions and all channels . (ii) the width is bounded.

More importantly, we restrict to the – most commonly used – ReLU activation function . Even though is not differentiable at , we still define and as the result of the simultaneous propagation of Eq. (1) and Eq. (2) with the convention . Section 3 implicitly relies on the assumption that remains differentiable, implying , almost surely (a.s.) with respect to (details and justification of this assumption in Appendix D.1).

Note that fully-connected neural networks are still included in our analysis as the subcase .

## 3 Input data randomness

We now turn our attention to the distributions with respect to input data of signal and sensitivity , . To outline the importance of these distributions, we may express by layer composition the output of an -layer neural network as , with and the upper neural network mapping from layer to . It means that and can be seen as input signal and noise of this upper neural network. It also means that the distributions and , or equivalently , must remain well-behaved for this upper network to have a chance to accomplish its task successfully. We will return to this argument in Section 3.2 by detailing the penalizing effects of ill-behaved , .

### 3.1 Characterizing distributions with respect to input data

First we need to introduce the following statistical quantities to characterize , :

• The feature map vector and centered feature map vector associated with any random tensor ,

 φ(v,α)≡vα,:,^φ(v,α)≡vα,:−Ev,α[vα,:],

where is uniformly sampled in and the denotations , remind the randomness of both and .

In a similar vein as batch normalization and Xiao et al. (2018), feature maps and centered feature maps at different spatial positions and are implicitly seen as sub-signals propagated together as part of tensorial meta-signals. Statistically, we work at the granularity of sub-signals and we treat equally the randomness from and .333In the context e.g. of note that propagation from layer to solely depends on the randomness of , the randomness of being only involved in the selection of the final position at layer . For simplicity, will still be referred as input data.

• The non-central moment and central moment of order associated with any random tensor per-channel and averaged over channels,

 νp,c(v)≡Ev,α [φ(v,α)pc], μp,c(v)≡Ev,α [^φ(v,α)pc], νp(v)≡Ev,α,c [φ(v,α)pc], μp(v)≡Ev,α,c [^φ(v,α)pc].
• The effective rank associated with any random tensor (Vershynin, 2010),

 reff(v)≡TrC[φ(v,α)]||C[φ(v,α)]||=∑iλimaxiλi,

where denotes the covariance matrix, the spectral norm, and the eigenvalues of . The effective rank is always greater than , and it intuitively measures the number of effective directions which concentrate the variance of .

• The normalized sensitivity – our key metric – derived from the moments of and ,

 χl≡(μ2(sl)μ2(x0)μ2(xl))1/2. (3)

To better understand the definition of let us consider the classification task. Since the goal is to set apart different signals, the mean signal over is uninformative. Centered feature maps thus constitute the informative part of the signal and , normalize by signal ’informative content’ in Eq. (3). The normalized sensitivity exactly measures a root mean square sensitivity when neural network input and output signals and are rescaled to have unit signal ’informative content’: and (proof in Appendix D.2). If we denote the rescaled input-output mapping as , then this property is simply expressed as in the fully-connected case with 1-dimensional rescaled input and output signals and (, ). We provide an illustration in Fig. 1.

The normalized sensitivity is directly connected to the Jacobian norm in the fully-connected case and to sensitivity to signal perturbation in the general case, known in turn for being connected to generalization (Sokolic et al., 2017; Arora et al., 2018; Morcos et al., 2018; Novak et al., 2018; Philipp & Carbonell, 2018).444The coefficient defined in Philipp & Carbonell (2018) is equivalent to the normalized sensitivity in the fully-connected case. Section D.3 provides details on this equivalence and the reasons for our change of terminology. The notion of sharpness which quantifies sensitivity to weight perturbation was similarly shown to be connected to generalization (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016; Smith & Le, 2017; Neyshabur et al., 2017). Sensitivity to signal perturbation and weight perturbation are tighly connected since the introduction of a noise on the weights amounts to the introduction of a noise and on the signal in Eq. (1) and Eq. (2).

### 3.2 Distributional pathologies

Using these statistical tools, we are able to characterize the distributional pathologies – with ill-behaved , – that we will encounter:

1. Zero-dimensional signal: . To understand this pathology, let us consider the following mean vectors and rescaling of the signal:

 νl≡(ν1,c(xl))1≤c≤Nl,~xl≡1||νl||2xl,~νl≡(ν1,c(~xl))1≤c≤Nl.

Then and the pathology implies (proof in Appendix D.4). It follows that becomes point-like concentrated at the point of unit norm. In the limit of strict point-like concentration, the upper neural network from layer to is limited to random guessing since it ’sees’ all inputs the same and cannot distinguish between them.

2. One-dimensional signal: . This pathology implies that has its variance which becomes concentrated in only one direction, and thus that becomes line-like concentrated. In the limit of strict line-like concentration, the upper neural network from layer to only ’sees’ a single feature from .

3. Exploding sensitivity: . To understand this pathology, let us push further the view of noisy propagation with the signal-to-noise ratio and the noise factor :

 SNRl≡μ2(xl)μ2(dxl),Fl≡SNR0SNRl=μ2(sl)μ2(x0)μ2(xl)=(χl)2, (4)

where Eq. (4) follows from and . In logarithmic decibel scale , indicating that measures how the neural network from layer to degrades () or enhances () the input signal-to-noise ratio. The pathology implies and , meaning that the noisy signal becomes completely random for any level of input noise .555In this case, Eq. (2) eventually does not hold since the noise at layer becomes non-infinetesimal even for infinitesimal input noise . The upper neural network from layer to is then limited to random guessing.

## 4 Model parameters randomness

We now introduce model parameters as the second source of randomness. We consider random networks at initialization, which we suppose is standard: (i) weights and biases are initialized following He et al. (2015); (ii) when pre-activations are batch-normalized, scale and shift batch normalization parameters are initialized with ones and zeros respectively.

Considering random networks at initialization is justified in two respects:

1. From a Bayesian perspective, the random distribution on the weights encodes a prior distribution on input-output mappings for a given choice of architecture (Neal, 1996a; Williams, 1997). This prior distribution can be combined with the likelihood of model parameters given training data in order to obtain a posterior distribution on input-output mappings, which in turn enables Bayesian predictions (Lee et al., 2017).

2. If distributional pathologies arise at initilization, it is likely that training will be difficult. Indeed if the distributional pathology is severe, it becomes very difficult for the neural network to adjust its parameters and undo the pathology. As an illustration, consider the pathology of zero-dimensional signal: . In this case, the upper neural network from layer to must adjust its bias parameters very precisely in order to center the signal and distinguish between different inputs . In support of this argument, several works have shown that deep neural networks are precisely trainable when they are not subject to the pathology of different inputs being strictly correlated at initialization, i.e. nearly undistinguishable (Schoenholz et al., 2016; Xiao et al., 2018).

From now on, our methodology is to consider all moment-related quantities, e.g. , , , , , , , as random variables depending on . We introduce the notation for the full set of parameters, and the notation for the conditional set of parameters when are considered as random and as given. We further denote the geometric increment .

Evolution with Depth. The evolution with depth of can be written as

 logμ2(xl)−logμ2(x0)=∑k≤llogEθk[δμ2(xk)]¯¯¯¯m[μ2(xk)]+∑k≤lEθk[logδμ2(xk)]−logEθk[δμ2(xk)]m––[μ2(xk)]+∑k≤llogδμ2(xk)−Eθk[logδμ2(xk)]s–[μ2(xk)], (5)

where Eq. (5) is obtained using and by expressing with telescoping terms. Denoting the multiplicatively centered increments, and using and , we obtain

 ¯¯¯¯¯m[μ2(xk)]=logEθk[δμ2(xk)], (6) m––[μ2(xk)]=Eθk[logδ–μ2(xk)],. (7) s–[μ2(xk)]=logδ–μ2(xk)−Eθk[logδ–μ2(xk)]. (8)

Discussion. First we note that and are random variables which depend on , while is a random variable which depends on . We also note that by log-concavity, and that is centered: .

Under standard initialization, each channel provides an independent contribution to . As a consequence, for large the relative increment has low expected deviation to , meaning with high probability that , , . In addition, is centered and non-correlated at different so its sum scales as , whereas the sums of and scale as (see Lemma 10 in Appendix E.1). The term is thus doubly negligible. In summary, the evolution with depth is dominated by when this term is non-vanishing and by otherwise. The same analysis can be applied to other positive moments such as , , . It can also be applied to the ratio and thus to .

Further notation. From now on, the geometric increment of any quantity is denoted with . The definitions of , and in Eq. (6), (7) and (8) are extended to other central and non-central moments of signal and sensitivity, as well as with , , .

We introduce the notation when , with , with high probability. And the notation when , with , with high probability. From now on, we assume that the width is large, implying and . We stress that this assumption is milder than the commonly used mean-field assumption of infinite width: . Indeed we still consider as random and as possibly non-Gaussian, whereas mean-field would consider as non-random and as Gaussian.

## 5 Vanilla Networks

We are fully equipped to characterize neural network architectures at initialization. We start by analyzing vanilla networks corresponding to the equations of propagation introduced in Section 2.

###### Theorem 1.

Moments of vanilla networks. (proof in Appendix E.2) There exist positive constants , , , , random variables , and events with probability such that under : and are centered and

 logν2(xl)=−lml+√lsl+logν2(x0), mmin≤ml≤mmax, logμ2(sl)=−lm′l+√ls′l, mmin≤m′l≤m% max,

Discussion. First we discuss the conditionality on which is necessary to exclude the collapse and (with undefined and ), occuring e.g. when all elements of are strictly negative (Lu et al., 2018). The complementary event has probability exponentially small in the width due to

 −PΘl[Acl]≃log(1−PΘl[Acl])=logPΘl[Al]≥∑lk=1log(1−2−Nk+1)≃−∑lk=12−Nk+1,

implying . The conditionality on thus has highly negligible effect.

Next we discuss the evolution of and under . The particularity of the initilization He et al. (2015) is to keep stable and during propagation. To do so, it enforces and such that , vanish in Eq. (5) and , are subject to a slow diffusion with small negative drift terms: , , and small diffusion terms: , (see Section E.3 for details).

As evidenced by Fig. 1(a) and Fig. 1(b), the small negative drift and diffusion terms cause slowly decreasing negative expectation and increasing variance of , . The decreasing negative expectation and increasing variance of , can be seen as opposing forces which compensate in order to keep stable , during propagation. Indeed Fig. 1(a) and Fig. 1(b) show that , are nearly Gaussian, implying that and are nearly lognormal. And the expectation of a lognormal variable with and is equal to . Note that the diffusion happens in log-space since layer composition amounts to a multiplicative random effect in real space. Also note that this is a finite-width effect since the terms also vanish in the limit of infinite width.

###### Theorem 2.

Normalized Sensitivity increments of vanilla networks. (proof in Appendix E.4) Denote and . The dominating term under in the evolution of is

 δχl≃exp(¯¯¯¯¯m% vanilla[χl])=(1−Ec,θl|Al−1[ν1,c(yl,+)ν1,c(yl,−)μ2(xl−1)])−1/2. (9)

Discussion. An immediate consequence is that , i.e. that normalized sensitivity always increases with depth for random ReLU vanilla networks.

Now let us show that Theorem 1 and Theorem 2 only allow two possibilities of evolution, which both converge to distributional pathology:

1. If sensitivity is exploding: , then by definition of . Given that and that both and are limited to the slow diffusion of Theorem 1, this qualitatively means that becomes . Rigorously, if has an exponential drift stronger than diffusion and if we assume that and are lognormally-distributed as supported by Fig. 2, then a.s. (proof in Appendix E.5). Consequently, this case leads to double pathologies: exploding sensitivity and zero-dimensional signal.

2. Otherwise, geometric increments are strongly limited and we may assume . If we further assume that moments of the unit-variance rescaled signal are bounded, then Theorem 2 implies the convergence to the pathology of one-dimensional signal: (proof in Appendix E.6), as well as the convergence to neural network pseudo-linearity where each layer becomes arbitrary well approximated by a linear function (proof in Appendix E.7)

Experimental verification. The evolution with depth of vanilla networks is shown in Fig. 3. From the possibilities (i) and (ii), it is case (ii) occuring with subexponential normalized sensitivity: , the convergence to the pathology of one-dimensional signal: , and to neural network pseudo-linearity. A typical symptom of neural network pseudo-linearity is that signals ’cross’ the ReLU on the same side. Our analysis offers a novel insight into this coactivation phenomenon, previously observed experimentally (Balduzzi et al., 2017).

## 6 Batch-normalized feedforward nets

Next we incorporate batch normalization (Ioffe & Szegedy, 2015). For simplicity, we only consider the test mode which consists in subtracting and dividing by for each channel c in . The propagation is given by

 yl =ωl∗xl−1+βl, zl =BN(yl), xl =ϕ(zl), (10) tl =ωl∗sl−1, ul =BN′(yl)⊙tl, sl =ϕ′(zl)⊙ul, (11)

where denotes batch normalization. Note that Eq. (10) and (11) explicitly formulate a finer-grained subdivision of three different steps between layers and in the simultaneous propagation of .

###### Theorem 3.

Normalized Sensitivity increments of batch-normalized feedforward nets. (proof in Appendix F.1) The dominating term in the evolution of can be decomposed as the sum of a term due to batch normalization and a term due to the nonlinearity :

 exp(¯¯¯¯¯mBN[χl]) =(μ2(sl−1)μ2(xl−1))−1/2Ec,θl[μ2,c(tl)μ2,c(yl)]1/2, (12) exp(¯¯¯¯¯mϕ[χl]) =(1−2Ec,θl[ν1,c(zl,+)ν1,c(zl,−)])−1/2, (13) δχl≃exp(¯¯¯¯¯mBN−FF[χl]) =exp(¯¯¯¯¯mBN[χl]+¯¯¯¯¯mϕ[χl]). (14)

Effect of batch normalization. The term of Eq. (12) corresponds to the evolution of from at layer to just after . To understand this term qualitatively, the pre-activation tensor can be seen as random projections of , while batch normalization can be seen as an alteration of the magnitude for each projection. Given that batch normalization uses