Characterizing Well-Behaved vs. Pathological Deep Neural Networks
Characterizing Well-Behaved vs. Pathological Deep Neural Networks
We introduce a novel approach, requiring only mild assumptions, for the characterization of deep neural networks at initialization. Our approach applies both to fully-connected and convolutional networks and easily incorporates batch normalization and skip-connections. Our key insight is to consider the evolution with depth of statistical moments of signal and noise, thereby characterizing the presence or absence of pathologies in the hypothesis space encoded by the choice of hyperparameters. We establish: (i) for feedforward networks, with and without batch normalization, the multiplicativity of layer composition inevitably leads to ill-behaved moments and pathologies; (ii) for residual networks with batch normalization, on the other hand, skip-connections induce power-law rather than exponential behaviour, leading to well-behaved moments and no pathology.
medium1\scalebox1.1\BODY \NewEnvironmedium2\scalebox1.05\BODY \Urlmuskip=0mu plus 1mu
Machine Learning, ICML
The feverish pace of practical applications has led in the recent years to many advances in neural network architectures, initialization and regularization. At the same time, theoretical research has not been able to follow the same pace. In particular, there is still no mature theory able to validate the full choices of hyperparameters leading to state-of-the-art performance. This is unfortunate since such theory could also serve as a guide towards further improvement.
Amidst the research aimed at building this theory, an important branch has focused on networks at initialization. Due to the randomness of model parameters at initialization, characterizing networks at that time can be seen as characterizing the hypothesis space of input-output mappings that will be favored or reachable during training, i.e. the inductive bias encoded by the choice of hyperparameters. This view has received strong experimental support, with well-behaved input-output mappings at initialization extensively found to be predictive of trainability and post-training performance (Schoenholz et al., 2017; Yang & Schoenholz, 2017; Xiao et al., 2018; Philipp & Carbonell, 2018; Yang et al., 2019).
Yet, even this simplifying case of networks at initialization is challenging as it notably involves dealing with: (i) the complex interplay of the randomness from input data and from model parameters; (ii) the broad spectrum of potential pathologies; (iii) the finite number of units in each layer; (iv) the difficulty to incorporate convolutional layers, batch normalization and skip-connections. Complexities (i), (ii) typically lead to restricting to specific cases of input data and pathologies, e.g. exploding complexity of data manifolds (Poole et al., 2016; Raghu et al., 2017), exponential correlation or decorrelation of two data points (Schoenholz et al., 2017; Balduzzi et al., 2017; Xiao et al., 2018), exploding and vanishing gradients (Yang & Schoenholz, 2017; Philipp et al., 2018; Hanin, 2018; Yang et al., 2019), exploding and vanishing activations (Hanin & Rolnick, 2018). Complexity (iii) commonly leads to making simplifying assumptions, e.g. convergence to Gaussian processes for infinite width (Neal, 1996; Roux & Bengio, 2007; Lee et al., 2018; Matthews et al., 2018; Borovykh, 2018; Garriga-Alonso et al., 2019; Novak et al., 2019; Yang, 2019), “typical” activation patterns (Balduzzi et al., 2017). Finally complexity (iv) most often leads to limiting the number of hard-to-model elements incorporated at a time. To the best of our knowledge, all attempts have thus far been limited in either their scope or their simplifying assumptions.
As the first contribution of this paper, we introduce a novel approach for the characterization of deep neural networks at initialization. This approach: (i) offers a unifying treatment of the broad spectrum of pathologies without any restriction on the input data; (ii) requires only mild assumptions; (iii) easily incorporates convolutional layers, batch normalization and skip-connections.
As the second contribution, we use this approach to characterize deep neural networks with the most common choices of hyperparameters. We identify the multiplicativity of layer composition as the driving force towards pathologies in feedforward networks: either with the neural network having its signal shrunk into a single point or line; or with the neural network behaving as a noise amplifier with sensitivity exploding with depth. In contrast, we identify the combined action of batch normalization and skip-connections as responsible for bypassing this multiplicativity and relieving from pathologies in batch-normalized residual networks.
Our results can be fully reproduced with the source code available at https://github.com/alabatie/moments-dnns.
We start by formulating the propagation for neural networks with neither batch normalization nor skip-connections, that we refer as vanilla nets. We will slightly adapt this formulation in Section 6 with batch-normalized feedforward nets and in Section 7 with batch-normalized resnets.
Clean Propagation. We first consider a random tensorial input , spatially -dimensional with extent in all spatial dimensions and channels. This input is fed into a -dimensional convolutional neural network with periodic boundary conditions, fixed spatial extent , and activation function .111It is possible to relax the assumptions of periodic boundary conditions and constant spatial extent [B.5]. These assumptions, as well as the assumption of constant width in Section 7, are only made for simplicity of the analysis. At each layer , we denote the number of channels or width, the convolutional spatial extent, the post-activations and pre-activations, the weights, and the biases. Later in our analysis, the model parameters , will be considered as random, but for now they are considered as fixed. At each layer, the propagation is given by
with the convolution and the tensor with repeated version of at each spatial position. From now on, we refer to the propagated tensor as the signal.
Noisy Propagation. To make our setup more realistic, we next suppose that the input signal is corrupted by an input noise having small iid components such that , with and the Kronecker delta for multidimensional indices . We denote , with the neural network mapping from layer to , and we consider the simultaneous propagation of the signal and the noise . At each layer, this simultaneous propagation is given at first order by
with the element-wise tensor multiplication. The tensor resulting from the simultaneous propagation of in Eq. (1) and Eq. (2) approximates arbitrarily well the noise as [C.1]. For simplicity, we will keep the terminology of noise when referring to .
From Eq. (1) and Eq. (2), we see that , only depend on the input signal , and that depends linearly on the input noise when is fixed. As a consequence, stays centered with respect to such that : , where from now on the spatial position is denoted as and the channel as c.
Scope. We require two mild assumptions: (i) is not trivially zero: ;222Whenever and c are considered as random variables, they are supposed uniformly sampled among all spatial positions and all channels . (ii) the width is bounded.
Some results of our analysis will apply for any choice of , but unless otherwise stated, we restrict to the most common choice: . Even though is not differentiable at , we still define as the result of the simultaneous propagation of in Eq. (1) and Eq. (2) with the convention [C.2].
Note that fully-connected networks are included in our analysis as the subcase .
3 Data Randomness
Now we may turn our attention to the data distributions of signal and noise: , . To outline the importance of these distributions, the output of an -layer neural network can be expressed by layer composition as , with the mapping of the signal and noise by the upper neural network from layer to layer . The upper neural network thus receives as input signal and as input noise, implying that it can only have a chance to do any better than random guessing when: (i) is meaningful; (ii) is under control. Namely, when , are not affected by pathologies. We will make this argument as well as the notion of pathology more precise in Section 3.2 after a few prerequisite definitions.
3.1 Characterizing Data Distributions
– The feature map vector and centered feature map vector,
with the vectorial slice of at spatial position . Note that , aggregate both the randomness from which determines the propagation up to , and the randomness from which determines the spatial position in . These random vectors will enable us to circumvent the tensorial structure of .
– The non-central moment and central moment of order for given channel c and averaged over channels,
In the particular case of the noise , centered with respect to , feature map vectors and centered feature map vectors coincide: , such that non-central moments and central moments also coincide: and .
– The effective rank (Vershynin, 2010),
with the covariance matrix and the spectral norm. If we further denote the eigenvalues of , then . Intuitively, measures the number of effective directions which concentrate the variance of .
– The normalized sensitivity – our key metric – derived from the moments of and ,
To grasp the definition of , we may consider the signal-to-noise ratio and the noise factor ,
We obtain in logarithmic decibel scale, i.e. that measures how the neural network from layer to degrades () or enhances () the input signal-to-noise ratio. Neural networks with are noise amplifiers, while neural networks with are noise reducers.
Now, to justify our choice of terminology, let us reason in the case where is the output signal at the final layer. Then: (i) the variance is typically constrained by the task (e.g. binary classification constrains to be roughly equal to ); (ii) the constant rescaling leads to the same constrained variance: . The normalized sensitivity exactly measures the excess root mean square sensitivity of the neural network mapping relative to the constant rescaling [C.3]. This property is illustrated in Fig. 1.
As outlined, measures the sensitivity to signal perturbation, which is known for being connected to generalization (Rifai et al., 2011; Arpit et al., 2017; Sokolic et al., 2017; Arora et al., 2018; Morcos et al., 2018; Novak et al., 2018; Philipp & Carbonell, 2018). A tightly connected notion is the sensitivity to weight perturbation, also known for being connected to generalization (Hochreiter & Schmidhuber, 1997; Langford & Caruana, 2002; Keskar et al., 2017; Chaudhari et al., 2017; Smith & Le, 2018; Dziugaite & Roy, 2017; Neyshabur et al., 2017, 2018; Li et al., 2018). The connection is seen by noting the equivalence between a noise on the weights and a noise and on the signal in Eq. (1) and Eq. (2).
3.2 Characterizing Pathologies
We are now able to characterize the pathologies, with ill-behaved data distributions, , , that we will encounter:
– Zero-dimensional signal: . To understand this pathology, let us consider the following mean vectors and rescaling of the signal:
The pathology implies , meaning that becomes point-like concentrated at the point of unit norm: [C.4]. In the limit of strict point-like concentration, the upper neural network from layer to is limited to random guessing since it “sees” all inputs the same and cannot distinguish between them.
– One-dimensional signal: . This pathology implies that the variance of becomes concentrated in a single direction, meaning that becomes line-like concentrated. In the limit of strict line-like concentration, the upper neural network from layer to only “sees” a single feature from .
– Exploding sensitivity: for some . Given the noise factor equivalence of Eq. (4), the pathology implies , meaning that the clean signal becomes drowned in the noise . In the limit of strictly zero signal-to-noise ratio, the upper neural network from layer to is limited to random guessing since it only “sees” noise.
4 Model Parameters Randomness
We now introduce model parameters as the second source of randomness. We consider networks at initialization, which we suppose is standard following He et al. (2015): (i) weights are initialized with , biases are initialized with zeros; (ii) when pre-activations are batch-normalized, scale and shift batch normalization parameters are initialized with ones and zeros respectively.
Considering networks at initialization is justified in two respects. As the first justification, in the context of Bayesian neural networks, the distribution on model parameters at initialization induces a distribution on input-output mappings which can be seen as the prior encoded by the choice of hyperparameters (Neal, 1996; Williams, 1997).
As the second justification, even in the standard context of non-Bayesian neural networks, it is likely that pathologies at initialization penalize training by hindering optimization. Let us illustrate this argument in two cases:
– In the case of zero-dimensional signal, the upper neural network from layer to must adjust its bias parameters very precisely in order to center the signal and distinguish between different inputs. This case – further associated with vanishing gradients for bounded (Schoenholz et al., 2017) – is known as the “ordered phase” with unit correlation between different inputs, resulting in untrainability (Schoenholz et al., 2017; Xiao et al., 2018).
– In the case of exploding sensitivity, the upper neural network from layer to only “sees” noise and its backpropagated gradient is purely noise. Gradient descent then performs random steps and training loss is not decreased. This case – further associated with exploding gradients for batch-normalized or bounded (Schoenholz et al., 2017) – is known as the “chaotic phase” with decorrelation between different inputs, also resulting in untrainability (Schoenholz et al., 2017; Yang & Schoenholz, 2017; Xiao et al., 2018; Philipp & Carbonell, 2018; Yang et al., 2019).
From now on, our methodology is to consider all moment-related quantities, e.g. , , , , , , as random variables which depend on model parameters. We denote the model parameters as and use as shorthand for . We further denote the geometric increments of as .
Evolution with Depth. The evolution with depth of can be written as
where we used and expressed with telescoping terms. Denoting the multiplicatively centered increments of , we get [C.5]
Discussion. We directly note that: (i) and are random variables which depend on , while is a random variable which depends on ; (ii) by log-concavity; (iii) is centered with and .
We further note that each channel provides an independent contribution to , implying for large that has low expected deviation to and that , , with high probability. The term is thus dominating as long as it is not vanishing. The same reasoning applies to other positive moments, e.g. , .
Further Notation. From now on, the geometric increment of any quantity is denoted with . The definitions of , and in Eq. (5), (6) and (7) are extended to other positive moments of signal and noise, as well as with
We introduce the notation when with , with high probability. And the notation when with , with high probability. From now on, we assume that the width is large, implying
We stress the layer-wise character of this approximation, whose validity only requires , independently of the depth . This contrasts with the aggregated character (up to layer ) of the mean field approximation of as a Gaussian process, whose validity requires not only but also – as we will see – that the depth remains sufficiently small with respect to .
5 Vanilla Nets
We are fully equipped to characterize deep neural networks at initialization. We start by analyzing vanilla nets which correspond to the propagation introduced in Section 2.
Theorem 1 (moments of vanilla nets).
[D.3] There exist small constants , random variables , and events , of probabilities equal to such that
Discussion. The conditionality on , is necessary to exclude the collapse: , , with undefined , , occurring e.g. when all elements of are strictly negative (Lu et al., 2018). In practice, this conditionality is highly negligible since the probabilities of the complementary events , decay exponentially in the width [D.4].
Now let us look at the evolution of , under , . The initialization He et al. (2015) enforces and such that: (i) , are kept stable during propagation; (ii) , vanish and , are subject to a slow diffusion with small negative drift terms: , , and small diffusion terms: , [D.5].444Any deviation from He et al. (2015) leads, on the other hand, to pathologies orthogonal to the pathologies of Section 3.2, with either exploding or vanishing constant scalings of . The diffusion happens in log-space since layer composition amounts to a multiplicative random effect in real space. It is a finite-width effect since the terms , , , also vanish for infinite width.
Fig. 2 illustrates the slowly decreasing negative expectation and slowly increasing variance of , , caused by the small negative drift and diffusion terms. Fig. 2 also indicates that , are nearly Gaussian, implying that , are nearly lognormal. Two important insights are then provided by the expressions of the expectation: and the kurtosis: of a lognormal variable with . Firstly, the decreasing negative expectation and increasing variance of , act as opposing forces in order to ensure the stabilization of , . Secondly, , are stabilized only in terms of expectation and they become fat-tailed distributed as .
Theorem 2 (normalized sensitivity increments of vanilla nets).
[D.6] Denoting , the dominating term under in the evolution of is
Discussion. A first consequence is that always increases with depth. Another consequence is that only two possibilities of evolution which both lead to pathologies are allowed:
– If sensitivity is exploding: with exponential drift stronger than the slow diffusion of Theorem 1 and if , are lognormally distributed as supported by Fig. 2, then Theorem 1 implies the a.s. convergence to the pathology of zero-dimensional signal: [D.7].
– Otherwise, geometric increments are strongly limited. In the limit , if the moments of remain bounded, then Theorem 2 implies the convergence to the pathology of one-dimensional signal: [D.8] and the convergence to pseudo-linearity, with each additional layer becoming arbitrarily well approximated by a linear mapping [D.9].
Experimental Verification. The evolution with depth of vanilla nets is shown in Fig. 3. From the two possibilities, we observe the case with limited geometric increments: , the convergence to the pathology of one-dimensional signal: , and the convergence to pseudo-linearity.
The only way that the neural network can achieve pseudo-linearity is by having each one of its units either always active or always inactive, i.e. behaving either as zero or as the identity. Our analysis offers theoretical insight into this coactivation phenomenon, previously observed experimentally (Balduzzi et al., 2017; Philipp et al., 2018).
6 Batch-Normalized Feedforward Nets
Next we incorporate batch normalization (Ioffe & Szegedy, 2015), which we denote as . For simplicity, we only consider the test mode which consists in subtracting and dividing by for each channel c in . The propagation is given by
Theorem 3 (normalized sensitivity increments of batch-normalized feedforward nets).
[E.1] The dominating term in the evolution of can be decomposed as
Effect of Batch Normalization. The batch normalization term is such that , with defined as the increment of in the convolution and batch normalization steps of Eq. (9) and Eq. (10). The expression of holds for any choice of .
This term can be understood intuitively by seeing the different channels c in as random projections of and batch normalization as a modulation of the magnitude for each projection. Since batch normalization uses as normalization factor, directions of high signal variance are dampened, while directions of low signal variance are amplified. This preferential exploration of low signal directions naturally deteriorates the signal-to-noise ratio and amplifies owing to the noise factor equivalence of Eq. (4).
Now let us look directly at in Theorem 3. If we define the event under which the vectorized weights in channel c have norm equal to : , then spherical symmetry implies that variance increments in channel c from to and from to have equal expectation under :
On the other hand, the variance of these increments depends on the fluctuation of signal and noise in the random direction generated by . This depends on the conditioning of signal and noise, i.e. on the magnitude of , . If we assume that is well-conditioned, then can be treated as a constant and by convexity of the function :
which in turn implies . The worse the conditioning of , i.e. the smaller