Large Deviation Analysis of Function Sensitivity in Random Deep Neural Networks

# Large Deviation Analysis of Function Sensitivity in Random Deep Neural Networks

## Abstract

Mean field theory has been successfully used to analyze deep neural networks (DNN) in the infinite size limit. Given the finite size of realistic DNN, we utilize the large deviation theory and path integral analysis to study the deviation of functions represented by DNN from their typical mean field solutions. The parameter perturbations investigated include weight sparsification (dilution) and binarization, which are commonly used in model simplification, for both ReLU and sign activation functions. We find that random networks with ReLU activation are more robust to parameter perturbations with respect to their counterparts with sign activation, which arguably is reflected in the simplicity of the functions they generate.

Keywords: large deviation theory, path integral, deep neural networks, function sensitivity

## 1 Introduction

Learning machines realized by deep neural networks (DNN) have achieved impressive success in performing various machine learning tasks, such as speech recognition, image classification and natural language processing [LeCun2015]. While DNN typically have numerous parameters and their training comes at a high computational cost, their applications have been extended also to include devices with limited memory or computational resources, such as mobile devices, thanks to compressed networks and reduced parameter precision [Cheng2018]. Most supervised learning scenarios are of DNN functions representing some input-output mapping, on the basis of input-output example patterns. DNN parameter estimation (training) aims at obtaining a network that approximates well the underlying mapping. Despite their profound engineering success, a comprehensive understanding of the intrinsic working mechanism [Zeiler2014, Yosinski2015] and the generalization ability [Chiyuan2017, Chaudhari2017, Neyshabur2017, Bartlett2017] of DNN are still lacking. The difficulty in analyzing DNN is due to the recursive nonlinear mapping between layers they implement and the coupling to data and learning dynamics.

A recent line of research utilizes the mean field theory in statistical physics to investigate various DNN characteristics, such as expressive power [Poole2016], Gaussian process-like behaviors of wide DNN [Duvenaud2014, Daniely2016, Lee2018], dynamical stability in layer propagation and its impact on weight initialization [Schoenholz2017, Yang2017, Pretorius2018] and function similarity and entropy in the function space [BoLi2018]. By assuming large layer-width and random weights, such techniques harness the specific type of nonlinearity used and many degrees of freedom to provide valuable analytical insights. The Gaussian process perspectives of infinitely wide DNN also facilitates the analysis of training dynamics and generalization by employing established kernel methods [Jacot2018, Arora2019].

To study the entropy of functions realized by DNN [BoLi2018], we adopted similar assumptions but employed the generating functional analysis [Mozeika2009, Mozeika2010], which is more general and can be applied to sparse and weight-correlated networks. The analysis of function error incurred by weight perturbations exhibits an exponential growth in error for DNN with sign activation functions, while networks with ReLU activation function are more robust to perturbations. We have also found that ReLU activation induces correlations among variables in random convolution networks [BoLi2018]. The robustness of random networks with ReLU activation is related to the simplicity of the functions they compute [Valle-Perez2018, Palma2019], which may converge to a constant function in the large depth and width limit [Pretorius2018], although, in principle, they admit high capacity with arbitrary weights. However, DNN used in practice are of finite size and finite depth, therefore it is essential to analyze the deviation of finite-size systems with respect to the typical mean field behavior, and characterize its rate of convergence with increasing size. An example of a recent study along these lines [Antognini2019] investigates the deviation in performance of finite size neural networks with a single hidden layer from the Gaussian process behavior.

In this work, we adopt the large deviation approach and the path integral formalism of [BoLi2018] to derive the deviation of function sensitivity of finite systems from their infinite system counterparts, which is applicable to a range of DNN structures. We analyze the effect of sparsifying (diluting) and binarizing DNN weights, commonly used for model simplification [LeCun1989, Hubara2016, Rastegari2016, Hou2017]. Although the dependence on data and training are not considered, the analysis of random DNN provides valuable insights and baseline comparisons. We will also investigate the sensitivity of functions to input perturbation [Poole2016, Schoenholz2017], which is related to function complexity and generalization [Franco2006, Novak2018, Valle-Perez2018, Palma2019]. The paper is organized as follows. In Sec. 2 and 3, we introduce the random DNN model and review the basic results of generating functional analysis,respectively. In Sec. 4 and 5, we derive the large deviation of function sensitivity to weight and input perturbations, respectively, based on the path integral formalism. Finally, in Sec. 6, we discuss the results and their implications.

## 2 The model

Following [BoLi2018], we consider two coupled fully-connected DNN. One of them serves as the reference function under consideration, and the other as its perturbed counterpart, either in the weights or input variables. As shown in Fig. 1, each network consists of layer; layer has neurons, which can be layer dependent. The reference network is parameterized by the weight variables1 , while the perturbed network is parameterized with . Similarly, variables with a circumflex are associated with the reference network. In the following, represents the weight matrix at layer , and represents the dimensional weight vector of the th perceptron at layer . Denoting the input dimension as , we assume the sizes of all layers scale linearly with as .

A deterministic feed-forward network is defined by the recursive mapping

 hli=1√Nl−1Nl−1∑j=1wlijsl−1j, (1) sli=ϕl(hli), (2)

where are the weights, and are pre- and post-activation field and variable, respectively, and is the activation/transfer function at layer . The scaling factor of in Eq. (1) is introduced for normalization. We primarily focus on networks with either sign or ReLU activation functions in the hidden layers, and consider binary input and output variables by applying the sign activation function at the output layer for a fair comparison across architectures. The resulting feed-forward DNN implements a Boolean mapping , where each output node computes a Boolean function. In the following, we call the two architectures sign-DNN and relu-DNN respectively, keeping in mind that sign activation function is always applied in the output layer.

To facilitate a path integral calculation, we consider stochastic dynamics between successive layers. For the layer with sign activation function, the activation is disturbed by thermal noise according to the following probability

 P(sli|hli(wl,sl−1))=exp(βslihli(wl,sl−1))2cosh(βhli(wl,sl−1)), (3)

while for relu activation function, is disturbed by additive Gaussian noise

 P(sli|hli(wl,sl−1))=√β2πexp{−β2[sli−ϕ(hli(wl,sl−1))]2}. (4)

In the limit , we recover the deterministic model. The evolution of the two systems follows the joint distribution

 P({^sli,sli})=P(^s0,s0)L∏l=1Nl∏i=1P(^sli|^hli(^wl,^sl−1))P(sli|hli(wl,sl−1)). (5)

To probe the difference between the functions implemented by the two networks, we feed in the same single input to the two systems such that , and study the resulting output difference due to parameter perturbation. For continuous weight variables, one useful choice for the weight perturbation is

 wlij=√1−(ηl)2^wlij+ηlδwlij, (6)

which ensures that has the same variance of as long as follows the same distribution of , and effectively rotates the high dimensional vector by an angle as demonstrated schematically in Fig. 2.

In probing the sensitivity of a function due to input perturbations, the weights of two networks are kept the same and a fixed fraction of input variables are flipped randomly. The resulting output difference of the two systems reflects the sensitivity and complexity of the underlying DNN.

## 3 Generating functional analysis for typical behavior

Viewing the weights as quenched random variables, a generating functional analysis has been proposed [BoLi2018] to derive the typical behavior of DNN. It starts with computing the disorder-averaged generating functional

 ¯¯¯¯Γ(^ψ,ψ)=E^w,wE^s,sexp(−i∑l,i(^ψli^sli+ψlisli)), (7)

where the average is taken with respect to the joint probability Eq. (5). Assume the layer widths are the same for all . Upon averaging over the disorder , the generating functional can be expressed through a set of macroscopic order parameters such as the overlaps and magnetizations as

 ¯¯¯¯Γ=∫{dqdQ...}exp[NΨ(q,Q,...)]. (8)

where is the conjugate variable of the order parameter . In the large system size limit , the generating functional is dominated by the saddle point of the potential function . It gives rise to typical overlaps that dominate in probability, which facilitates analytical studies of random DNN.

Assume the weight perturbation follows the form of Eq. (6), and both weight and perturbation are independent of each other and follow a Gaussian distribution . It is found that for the layer with sign activation function in the limit , the overlap evolves as [BoLi2018]

 ql=2πsin−1(√1−(ηl)2ql−1),1≤l≤L. (9)

Similarly, for ReLU activation function in the deterministic limit, if the weight standard deviation is chosen as , the magnitude of the activations remains stable and the overlap evolves as

 ql= 1π{√1−[1−(ηl)2](ql−1)2 (10) +√1−(ηl)2ql−1[π2+sin−1(√1−(ηl)2ql−1)]},

while the output layer follows Eq. (9) due to the use of the sign activation function. The restriction leads to in both cases.

## 4 Large deviations in parameter sensitivity of functions

The generating functional analysis above gives typical behaviors of random DNN in the limit . However, practical DNN always have finite sizes. Therefore, it is worthwhile to understand the deviation to the most probable behaviors under finite . In the following, we adopt the large deviation analysis to tackle this problem. An introduction of large deviation theory and its application to statistical mechanics can be found in [Touchette2009]. In essence, a continuous observable in a system of size (assumed to be large) is said to satisfy the large deviation principle if the probability of finding follows

 ProbN(O∈[x,x+dx])≃e−NI(x)dx, (11)

where is the rate function of the observable. It implies that the probability density of scales as , which is concentrated at the minimum of the rate function in large systems and the profile of quantifies the fluctuation of the observable.

In this work the overlap of the output layer is at the focus of our study. The path integral techniques adopted in the generating functional framework [BoLi2018] can be adapted to tackle the large deviation analysis. We start with computing the probability density 2

 P(qL)=⟨δ(1NL∑i^sLisLi−qL)⟩ =E^w,wTr^s,sP(^s0)N0∏i=1δs0i,^s0iL∏l=1P(^sl|^wl,^sl−1)P(sl|wl,sl−1)δ(1NL∑i^sLisLi−qL), (12)

where the operation is understood as an integration or summation depending on the nature of variables. The input distribution follows . To deal with the non-linearity of the pre-activation fields in the conditional probability, we introduce auxiliary fields through the integral representation of delta-function

 1=∫∞−∞d^hlid^xli2πei^xli(^hli−1√Nl−1∑j^wlij^sl−1j),1=∫∞−∞dhlidxli2πeixli(hli−1√Nl−1∑jwlijsl−1j), (13)

which allows us to express the quenched random variables and linearly in the exponents, leading to

 P(qL) =E^w,wTr^s,sδ(1NL^sLisLi−qL)N0∏i=1P(^s0i)δs0i,^s0i∫L∏l=1Nl∏i=1d^hlid^xli2πdhlidxli2π (14) ×exp⎡⎣L∑l=1Nl∑i=1(logP(^sli|^hli)+logP(sli|hli)+i^xli^hli+ixlihli)⎤⎦ ×exp⎡⎣−L∑l=1i√Nl−1Nl∑i=1Nl−1∑j=1(^wlij^xli^sl−1j+wlijxlisl−1j)⎤⎦.

Assuming self-averaging [Dominicis1978] we exchange the order of summation and integration, and first carry out the average over the disorder variables. Specifically, we consider the weights of the reference network to be independent and follow a Gaussian distribution as before, and three types of perturbations

1. rotation of the weight vector following Eq. (6);

2. sparsification of the weight matrix by randomly dropping connections with probability and rescaling the remaining weights by to ensure the same weight strength

 wlij=⎧⎨⎩0,with probability pl,1√1−pl^wlij, with probability 1−pl, (15)
3. binarization of weight element

 wlij=sgn(^wlij)σw, (16)

where is introduced for keeping the variance of the same as .

### 4.1 Macroscopic order parameters

For perturbation of type (i), the disorder average of the third line of Eq. (14) yields

 ∏l,iexp⎧⎨⎩−σ2w[12(^xli)2∑j(^sl−1j)2Nl−1+12(xli)2∑j(sl−1j)2Nl−1+√1−(ηl)2^xlixli∑j^sl−1jsl−1jNl−1]⎫⎬⎭, (17)

To decouple Eqs. (14) and (17) over sites we introduce three sets of order parameters by inserting the identity

 1=∫d^Vld^vl2π/NleiNl^Vl[^vl−1Nl∑j(^slj)2],1=∫dVldvl2π/NleiNlVl[vl−1Nl∑j(slj)2], 1=∫dQldql2π/NleiNlQl[ql−1Nl∑j^sljslj],∀l≠L, (18)

and by expressing the output constraint as

 δ(1NLNL∑i=1^sLisLi−qL)=∫dQL2π/NLeiNLQL[qL−1NL∑j^sLjsLj]. (19)

Upon introducing these macroscopic order parameters, Eq. (17) becomes with the covariance matrix

 Σl:=σ2w[^vl−1√1−(ηl)2ql−1√1−(ηl)2ql−1vl−1]. (20)

The probability density in Eq. (14) involves identical integration and summation at each layer , which can be performed individually [BoLi2018], yielding

 P(qL)= ∫dQL2π/NLL−1∏l=0d^Vld^vl2π/NldVldvl2π/NldQldql2π/Nl (21) ×e∑L−1l=0Nl(i^Vl^vl+iVlvl+iQlql)+NLiQLqLe−N0(i^V0+iV0+iQ0) ×⎡⎣∫dHLe−12(HL)⊤Σ−1LHL√(2π)2|ΣL|Tr^sL,sLP(^sL|^hL)P(sL|hL)e−iQL^sLsL⎤⎦NL,

where we have integrated out the auxiliary fields and introduced the field doublet . We further write as

 P(qL)=∫dQL2π/NLL−1∏l=0d^Vld^vl2π/NldVldvl2π/NldQldql2π/Nlexp[−NΦ(Q,q,^V,^v,V,v|qL)], (22)

where is equal to the logarithm of the integrand in Eq. (21). Similar to the analysis in [BoLi2018], the probability density is dominated by the saddle point of the potential function in the large limit ( with as a constant)

 P(qL)≈exp[−NΦ(Q∗,q∗,...|qL)], (23)

where is the desired rate function.

While this set-up is based on computing the deviation in function similarity with a single input , one may argue that it requires testing on more than one input for obtaining a robust estimation, e.g.,

 ~qL:=1NLMM∑μ=1NL∑i=1^sL,μisL,μi, (24)

where is the number of independent patterns used. Assuming that representation of different patterns are uncorrelated, we show in C that for small , the rate function is approximately related to the single input case through a simple scaling

 I(~qL)≈MΦ(Q∗,q∗,...|~qL). (25)

This assumption is valid for sign-DNN but not for relu-DNN. We also confirm this scaling relation by numerical experiments (see below and in C).

### 4.2 Unifying three types of weight perturbations

The other two types of perturbations can be treated similarly. For network sparsification (15), the disorder average of Eq. (14) has the following form in the large limit (see A for details)

 ∏l,iexp⎧⎨⎩−σ2w[12(^xli)2∑j(^sl−1j)2Nl−1+12(xli)2∑j(sl−1j)2Nl−1+√1−pl^xlixli∑j^sl−1jsl−1jNl−1]⎫⎬⎭, (26)

which has the same form of Eq. (17) when is replaced by . Introducing the same order parameters, we obtain the covariance of the fields and in the form of

 Σsl:=σ2w[^vl−1√1−plql−1√1−plql−1vl−1]. (27)

Hence, diluting connections with probability at layer in a random DNN corresponds to rotating each of the weight vector by an angle .

Similarly, for network binarization in Eq. (16), the disorder average of Eq. (14) yields (see B for details)

 ∏l,iexp⎧⎨⎩−σ2w[12(^xli)2∑j(^sl−1j)2Nl−1+12(xli)2∑j(sl−1j)2Nl−1+√2π^xlixli∑j^sl−1jsl−1jNl−1]⎫⎬⎭, (28)

which corresponds to the covariance matrix of the fields and to be in the form

 Σbl:=σ2w⎡⎢⎣^vl−1√2πql−1√2πql−1vl−1⎤⎥⎦. (29)

Comparing to type (i) perturbation, one finds that binarizing weight elements in a random DNN corresponds to rotating each of the weight vectors by a fixed angle . This phenomenon has been observed in [Anderson2018] and is linked to the practical success of binary DNN. It is argued [Anderson2018] that is a very small angle in high dimensional spaces where two randomly sampled vectors are typically orthogonal to each other; therefore weight binarization approximately preserves the directions of the high dimensional weight vectors, which contributes to the success of binary DNN.

Therefore, we establish that the three types of perturbations on random DNN can be unified in the same framework developed in Sec. 4.1.

For networks with a generic activation function, the large deviation potential function can be express as

 Φ=−α0[i^V0(^v0−1)+iV0(v0−1)+iQ0(q0−1)]−L−1∑l=1αl(i^Vl^vl+iVlvl+iQlql) −iQLqL−L∑l=1αllog∫d^hldhlTr^sl,slMl(^sl,sl,^hl,hl), (30) Ml(^sl,sl,^hl,hl):=e−12(Hl)⊤Σ−1lHl√(2π)2|Σl|P(^sl|^hl)P(sl|hl)e−i^Vl(^sl)2−iVl(vl)2−iQl^slsl,1≤l

where since .

Setting the derivatives with respect to the conjugate order parameters , , to zero yields the saddle point equations

 ^v0=v0=1,q0=1, (33) (34) ql=∫d^hldhlTr^sl,sl(^slsl)Ml(^sl,sl,^hl,hl)∫d^hldhlTr^sl,slMl(^sl,sl,^hl,hl)=⟨^slsl⟩Ml,1≤l≤L, (35)

in which bears the meaning of an effective measure [Coolen2001]. Notice that is an input parameter imposing a nonlinear end point constraint on , which differs from the generating functional analysis calculation of typical behaviors [BoLi2018], where is a dynamical variable and at the saddle point.

Setting to zero yields the saddle point equations for the conjugate order parameters

 iQl−1=αlαl−1∫d^hldhlTr^sl,sl∂∂ql−1Ml(^sl,sl,^hl,hl)∫d^hldhlTr^sl,slMl(^sl,sl,^hl,hl),1≤l≤L. (36)

Similar relations holds for and . While the conjugate order parameters are defined on the real axis, they can be extended to the complex plane and evaluated on the imaginary axis in the saddle point approximation, in which case are real variables. Other observables can be computed by resorting to the effective measure once the saddle point is obtained, e.g., the mean activations are given by [Coolen2001]

 ^ml=⟨^sl⟩Ml,ml=⟨sl⟩Ml. (37)

Since the covariance matrix depends on the order parameters of layer , the effective measure at layer depends on the order parameters of the previous layer, while it depends on the conjugate order parameters of the current layer. We then observe that the order parameters propagate forward in layers, while encoding the randomness leading to the desired deviation propagate backward, which resembles the structure in optimal control problem [Grafke2019]. Therefore, we solve the saddle point equations in a forward-backward iteration manner until convergence. Another feature to notice in Eq. (36) is the dependence of the saddle point solution on the layer-shape parameters , which does not play a role in the mean field solutions where all the conjugate order parameters vanish [BoLi2018].

### 4.4 Explicit solutions for sign and ReLU activation functions

For networks with sign activation function the order parameters satisfy , such that the only meaningful order parameters are . The potential function can be computed analytically, taking the form

 Φ(Q,q|qL) =−α0iQ0(q0−1)−L∑l=1αliQlql (38) −L∑l=1αllog[cosh(iQl)−sinh(iQl)2πsin−1(√1−(ηl)2ql−1)],

while the saddle point equations become

 q0=1, (39) ql=−sinh(iQl)+cosh(iQl)2πsin−1(√1−(ηl)2ql−1)cosh(iQl)−sinh(iQl)2πsin−1(√1−(ηl)2ql−1),∀1≤l≤L, (40) iQl−1=2πsinh(iQl)cosh(iQl)−sinh(iQl)2πsin−1(√1−(ηl)2ql−1) ×αl√1−(ηl)2αl−1√1−[1−(ηl)2](ql−1)2,∀1≤l≤L. (41)

Note that in Eq. (40) is an input parameter.

For networks with ReLU activation function the potential function also admits an explicit expression

 −L−1∑l=1αl(i^Vl^vl+iVlvl+iQlql