A Correspondence Between Random Neural Networks and Statistical Field Theory

A Correspondence Between Random Neural Networks and Statistical Field Theory

S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein
Google Brain
{schsam, jpennin, jaschasd}@google.com
Abstract

A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad hoc. In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics. We argue that several previous investigations of stochastic networks actually studied a particular factorial approximation to the full lattice model. For random linear networks and random rectified linear networks we show that the corresponding lattice models in the wide network limit may be systematically approximated by a Gaussian distribution with covariance between the layers of the network. In each case, the approximate distribution can be diagonalized by Fourier transformation. We show that this approximation accurately describes the results of numerical simulations of wide random neural networks. Finally, we demonstrate that in each case the large scale behavior of the random networks can be approximated by an effective field theory.

1 Introduction

Machine learning methods built on deep neural networks have had unparalleled success across a dizzying array of tasks ranging from image recognition (Krizhevsky et al., 2012) to translation (Wu et al., 2016) to speech recognition and synthesis (Hinton et al., 2012). Amidst the rapid progress of machine learning at large, a theoretical understanding of neural networks has proceeded more modestly. In part, this difficulty stems from the complexity of neural networks, which have come to be composed of millions or even billions Shazeer et al. (2017) of parameters with complicated topology.

Recently, a number of promising theoretical results have made progress by considering neural networks that are, in some sense, random. For example, Choromanska et al. (2015) showed that random rectified linear neural networks could, with approximation, be mapped onto spin glasses; Saxe et al. (2014) explored the learning dynamics of randomly initialized networks; Daniely et al. (2016) and  Daniely et al. (2017) studied an induced duality between neural networks with random pre-activations and compositions of kernels; Raghu et al. (2017) and  Poole et al. (2016) studied the expressivity of deep random neural networks; and Schoenholz et al. (2017) studied information propagation through random networks. Work on random networks in the context of Bayesian neural networks has a longer history (Neal, 1996, 2012; Cho and Saul, 2009). Overall it seems increasingly likely that statements about randomly initialized neural networks might be able to inform practical design questions.

In a seemingly unrelated context, the past century has witnessed significant advances in theoretical physics, many of which may be attributed to the development of statistical, classical, and quantum field theories. Field theory has been used to understand a remarkably diverse set of physical phenomena, ranging from the standard model of particle physics (Weinberg, 1967), which represents the sum of our collective knowledge about subatomic particles, to the codification of phase transitions using Landau theory and the renormalization group (Chaikin and Lubensky, 2000). Consequently, an extremely wide array of tools have been developed to understand and approximate field theories.

In this paper we elucidate an explicit connection between random neural networks and statistical field theory. We demonstrate how well-established techniques in statistical physics can be used to study random neural networks in a quantitative and robust way. We begin by constructing an ensemble of random neural networks that we believe has a number of appealing properties. In particular, one limit of this ensemble is equivalent to studying neural networks after random initialization while another limit corresponds to probing the statistics of minima in the loss landscape. We then introduce a change of variables which shows that the weights and biases may be integrated out analytically to give a distribution over the pre-activations of the neural network alone. This distribution is identical to that of a statistical lattice model and this mapping is exact. We examine an expansion of our results as the network width grows large, and obtain concise and interpretable results for deep linear networks and deep rectified linear networks. We show that there exist well-defined mean field theories whose fluctuations may be fully characterized, and that there exist corresponding continuum field theories that govern the long wavelength behavior of these random networks. We compare our theory to simulations of random neural networks and find exceptional agreement. Thus we show that the behavior of wide random networks can be very precisely characterized.

This work leaves open a wide array of avenues that may be pursued in the future. In particular, the ensemble that we develop allows for a loss to be incorporated in the randomness. Looking forward, it seems plausible that statements about the distribution of local optima could be obtained using these methods. Moreover early training dynamics could be investigated by treating the small loss limit as a perturbation. Finally, generalization to arbitrary neural network architectures and correlated weight matrices is possible.

2 Background

We now briefly discuss lattice models in statistical physics and their corresponding effective field theories, using the ubiquitous Ising model as an example. Many materials are composed of a lattice of atoms. Magnets are such materials where the electrons orbiting the atoms all spin in the same direction. To model this behavior, physicists introduced a very simple model that involves “spins” placed on vertices of a lattice. In our simple example we consider spins sitting on a one-dimensional chain of length L at sites indexed by l. The spins can be modeled in many ways, but in the simplest formulation of the problem we take z^{l}\in\{-1,+1\}. This represents spins that are either aligned or anti-aligned. For ease of analysis we consider a periodic chain defined so that the first site is connected to the last site and z^{L+1}=z^{0}.

The statistics of the spins in such a system are determined by the Boltzmann distribution which gives the probability of a configuration of spins to be given by P(\{z^{l}\})=e^{-\beta H(\{z^{l}\})}/Q, where \beta is the reciprocal of the thermodynamic temperature. Here,

H(\{z^{l}\})=-\frac{J}{2}\sum_{l}z^{l}z^{l+1} (1)

is the “energy” of the system, where J is a coupling constant. This energy is minimized when all of the sites point in the same direction. The normalization constant Q is the partition function, and is given by

Q=\sum_{z^{0}\in\{-1,1\}}\cdots\sum_{z^{L}\in\{-1,1\}}e^{-\beta H}. (2)

Despite the relative simplicity of the Ising model it has had enormous success in predicting the qualitative behavior of an extremely wide array of materials. In particular, it can successfully explain transitions of physical systems from disordered to ordered states.

In general, lattice models are often unwieldy, and their successful predictions are sometimes surprising given how approximately they treat interactions present in real materials. The resolution to both of these concerns lies in the realization that we often are most concerned with the behavior of systems at very large distances relative to the atomic separations. For example, in the case of magnets we care much more about the behavior of the whole material rather than how any two spins in the material behave.

This led to the development of Effective Field Theory (EFT) where we compute a field u(x) by averaging over many z^{l} using, for example, a Gaussian window centered at x. The field u(x) is defined at every point in space and so x here corresponds to a continuous relaxation of l. It turns out that a number of features of the original model, such as symmetries and locality, survive the averaging process. In EFT we study a minimal energy that captures these essential (or long range) aspects of our original theory.

The energy describing the EFT is typically a complicated function of u and its derivatives, \beta H[u(x),\nabla u(x),\nabla^{2}u(x),\cdots]. However, it has been shown that the long wavelength behavior of the system can successfully be described by considering a low-order expansion of \beta H. For the Ising model, for example, the effective energy is,

\beta H=\frac{1}{2}\int dx\left[mu^{2}(x)+vu^{4}(x)+K(\nabla u(x))^{2}\right]. (3)

It can be shown, using the theory of irrelevant operators, that as long as v>0 higher order powers of u do not change qualitative aspects of the resulting theory. Only even powers are allowed because the Ising model has global z^{l}\to-z^{l} symmetry and the gradient term encodes the propensity of spins to align with one another. EFTs such as this one have been very successful at describing the large scale behavior of lattice models such as the Ising model. A great triumph of modern condensed matter physics was the realization that phases of matter could be characterized by their symmetries in this way.

A very large effort has been devoted to developing techniques to analyze lattice models and EFTs. Consequently, any theory that can be written as a lattice model or EFT has access to a wide array of approximate analytic and numerical techniques that can be leveraged to study it. We will employ several of these techniques, such as the saddle point approximation, here. This paper is therefore a way of opening the door to use this extensive toolset to study neural networks.

3 An Exact Correspondence

Consider a fully-connected feed-forward neural network, f:\mathbb{R}^{N_{0}}\to\mathbb{R}^{N_{L}}, with L layers of width N_{l} parametrized by weights W_{\alpha\beta}^{l}, biases b^{l}_{\alpha}, and nonlinearities \phi^{l}:\mathbb{R}\to\mathbb{R}. The network is defined by the equation,

f(x)=\phi^{L+1}(W^{L}\phi^{L}(\cdots\phi^{0}(W^{0}x+b^{0})\cdots)+b^{L}). (4)

We equip this network with a loss function, \ell(f(x),t), where x\in\mathbb{R}^{M} is the input to the network and t\in\mathbb{R}^{N} is a target that we would like to model. Given a dataset (inputs together with targets) given by \{(x_{i},t_{i}):i\in\mathcal{M}\} we can define a “data” piece to our loss,

\mathcal{L}_{D}(f)=\sum_{i\in\mathcal{M}}\ell(f(x_{i}),t_{i}). (5)

Throughout this text we will use Roman subscripts to specify an input to the network and Greek subscripts to denote individual neurons. We then combine this with an L^{2} regularization term on both the weights and the biases to give a total loss,

\mathcal{L}(f)=\frac{J_{D}}{2}\mathcal{L}_{D}(f)+\sum_{l=0}^{L}\left(\frac{N_{% l}}{2\sigma_{w}^{2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}+\frac{1}{2\sigma_{b% }^{2}}b^{l}_{\alpha}b^{l}_{\alpha}\right). (6)

Here we have introduced a parameter J_{D} that controls the relative influence of the data on the loss. It can also be understood as the reciprocal of the regularization parameter. In this equation and in what follows, we adopt the Einstein summation convention in which there is an implied summation over repeated Greek indices.

To construct a stochastic ensemble of networks we must place a measure on the space of networks. As the objective we hope to minimize is the total loss, \mathcal{L}, with reference to Jaynes (1957) we select the maximum entropy distribution over f subject to a measurement of the expected total loss,

\langle\mathcal{L}\rangle=\int dfP(f)\mathcal{L}(f). (7)

This gives a probability of finding a network f to be given by, P(f)=e^{-\mathcal{L}(f)}/Q where Q is the partition function,

Q=\int[dW][db]e^{-\mathcal{L}}. (8)

Here we have introduced the notation [dW]=\prod_{l}\prod_{\alpha\beta}dW_{\alpha\beta}^{l} and [db]=\prod_{l}\prod_{\alpha}db^{l}_{\alpha} for simplicity.

While a choice of ensemble in this context will always be somewhat arbitrary, we argue that this particular ensemble has several interesting features that make it worthy of study. First, if we set J_{D}=0 then this amounts to studying the distribution of untrained, randomly initialized, neural networks with weights and biases distributed according to W_{\alpha\beta}^{l}\sim\mathcal{N}(0,\sigma_{w}^{2}N_{l}^{-1}) and b^{l}_{\alpha}\sim\mathcal{N}(0,\sigma_{b}^{2}) respectively. This situation also amounts to considering a Bayesian neural network with a Gaussian distributed prior on the weights and biases as in Neal (2012). When J_{D} is small, but nonzero, we may treat the loss as a perturbation about the random case. We speculate that the regime of small J_{D} should be tractable given the work presented here. Studying the case of large J_{D} will probably require methodology beyond what is introduced in this paper; however, if progress could be made in this regime, it would give insight into the distribution of minima in the loss landscape.

Our main result is to rewrite eq. (8) in a form that is more amenable to analysis. In particular, the weights and biases may be integrated out analytically resulting in a distribution that depends only on the pre-activations in each layer. This formulation elucidates the statistical structure of the network and allows for systematic approximation. By a change of variables we arrive at the following theorem (for proof appendix 8.1).

Main Result.

Through the change of variables, z^{l}_{\alpha;i}=W^{l}_{\alpha\beta}\phi^{l}(z^{l-1}_{\beta;i})+b_{\alpha}^{l}, the distribution over weights and biases defined by eq. (8) can be converted to a distribution over the pre-activations of the neural network. When N_{l}\gg|\mathcal{M}| the distribution over the pre-activations is described by a statistical lattice model defined by the partition function,

\displaystyle Q=\int[dz]\exp\left[-\frac{J_{D}}{2}\sum_{i\in\mathcal{M}}\ell(% \phi^{L+1}(z^{L}_{i}),t_{i})-\frac{1}{2}\sum_{l=0}^{L}\left((\bm{z}^{l}_{% \alpha})^{T}(\bm{\Sigma}^{l})^{-1}\bm{z}^{l}_{\alpha}+\ln|\bm{\Sigma}^{l}|% \right)\right]. (9)

Here (\bm{z}^{l}_{\alpha})^{T}=(z^{l}_{\alpha;1},\cdots,z^{l}_{\alpha,|\mathcal{M}|}) is a vector whose components are the pre-activations corresponding different inputs to the network and \bm{\Sigma}^{l}_{ij}=\sigma_{w}^{2}N_{l}^{-1}\phi^{l}(z^{l-1}_{\alpha;i})\phi^% {l}(z^{l-1}_{\alpha;j})+\sigma_{b}^{2} is the correlation matrix between activations of the network from different inputs. The lattice is a one-dimensional chain indexed by layer l and the “spins” are z^{l}_{i}\in\mathbb{R}^{N_{l}}. We term this class of lattice model the Stochastic Neural Network.

Eq. (9) is the full joint-distribution for the pre-activations in a random network with arbitrary activation functions and layer widths with no reference to the weights or biases. We see that this lattice model features coupling between adjacent layers of the network as well as between different inputs to the network. Finally, we see that the loss now only features the pre-activation of the last layer of the network. The input to the network and the loss therefore act as boundary conditions on the lattice.

There is a qualitative as well as methodological similarity between this formalism and the use of replica theory to study spin glasses Mézard et al. (1987). Qualitatively, we notice that the replicated partition function in spin glasses involves the overlap function which measures the correlation between spins in different replicas while eq. (9) is naturally written in terms of \Sigma^{l} which measures the correlation between activations due to different inputs to the network. Methodologically, when using the replica trick to analyze spin-glasses, one assumes that the interactions between spins are Gaussian distributed and shared between different replicas of the system; by integrating out the couplings analytically, the different replicas naturally become coupled. In this case the weights play a similar role to the interactions and their integration leads to a coupling between different the signals due to different inputs to the network.

Samples from a stochastic network with \phi(z)=\tanh(z) can be seen in fig. (1).

Figure 1: Samples of the norm of pre-activations, |z^{l}_{\alpha}|, from an L=100 layer stochastic neural network with \phi(z)=\tanh(z), J_{D}=0, N_{l}=500, and \sigma_{b}^{2}=0.001. The weight variance was changed from 0.1 (blue) to 1.5 (red). Dashed lines show the corresponding mean-field prediction.

In this framework we can see that the mean field approximation of Poole et al. (2016) amounts to the replacement of \phi^{l}(z_{\alpha}^{l})\phi^{l}(z_{\alpha}^{l}) by \langle\phi^{l}(z_{\alpha}^{l})\phi^{l}(z_{\alpha}^{l})\rangle where \langle\rangle denotes an expectation. This procedure decouples adjacent layers and replaces the complex joint distribution over pre-activations by a factorial Gaussian distribution. As a result, this approximation is unable to capture any cross-layer fluctuations that might be present in random neural networks. We can see this in fig. (1) where the black dashed lines denote the prediction of this particular mean field approximation. Note that while changes to the variance are correctly predicted, fluctuations are absent. Both Poole et al. (2016) and  Schoenholz et al. (2017) study this particular factorial approximation to the full joint distribution, eq. (9). Additionally, the composition kernels of Daniely et al. (2016, 2017) can be viewed as studying correlation functions in this mean-field formalism over a broader class of network topologies.

The mean field theory of Poole et al. (2016) is analytically tractable for arbitrary activation function and so it is interesting to study. However, the explicit independence assumption makes it an uncontrolled approximation, especially when generalizing to neural network topologies that are not fully connected feed-forward networks. Additionally, there are many interesting questions that one might wish to ask about correlations between pre-activations in different layers of random neural networks. Finally, it is unclear how to move beyond a mean field analysis in this framework. To overcome these issues, we pursue a more principled solution to eq. (9) by considering a controlled expansion for large N_{l}.

To allow tractable progress, we limit the study in this paper to the case of a single input such that |\mathcal{M}|=1. With this restriction, eq. (9) can be written explicitly as (see appendix 8.1),

\displaystyle Q=\int[dz]\exp\Bigg{[}-\frac{1}{2}\Bigg{\{}J_{D}\ell(\phi^{L+1}(% z^{L}),t) \displaystyle+\frac{z_{\alpha}^{0}z_{\alpha}^{0}}{\sigma_{w}^{2}N_{0}^{-1}x_{% \beta}x_{\beta}+\sigma_{b}^{2}}
\displaystyle+\sum_{l=1}^{L}\frac{z_{\alpha}^{l}z_{\alpha}^{l}}{\sigma_{w}^{2}% N_{l}^{-1}\phi^{l}(z^{l-1}_{\beta})\phi^{l}(z^{l-1}_{\beta})+\sigma_{b}^{2}}
\displaystyle+\sum_{l=1}^{L}N_{l}\log(\sigma_{w}^{2}N_{l}^{-1}\phi^{l}(z^{l}_{% \alpha})\phi^{l}(z^{l}_{\alpha})+\sigma_{b}^{2})\Bigg{\}}\Bigg{]}. (10)

While results involving the distribution of pre-activations resulting from a single input are an interesting first step we know from Poole et al. (2016); Schoenholz et al. (2017); Daniely et al. (2016) that correlations between the pre-activations due to different inputs is important when analyzing notions of expressivity and trainability. We therefore believe that extending these results to nontrivial datasets will be fruitful. To this end, it might be useful to take inspiration from the spin-glass community and seek to rephrase eq. (9) in terms of an overlap and to look for replica-symmetry breaking.

4 The Stochastic Neural Network On A Ring

With the stochastic neural network defined in eq. (9), we consider a specific network topology that is unusual in machine learning but is commonplace in physics. In particular, as in the Ising model described above, we consider a stochastic network whose final layer feeds back into its first layer. Since this topology is incompatible with a loss defined in terms of network inputs and outputs, we set J_{D}=0 in this case.

Figure 2: A schematic showing the topology of the Stochastic Neural Network on a ring.

A schematic of this network can be seen in fig. (2). The substantial advantage of considering this periodic topology is that we can neglect the effect of boundary conditions and focus on the “bulk” behavior of the network. The boundary effects can be taken into account once a theory for the bulk has been established. This method of dealing with lattice models is extremely common. We additionally set N_{l}=N and \phi^{l}=\phi independent of layer.

The stochastic network on a ring is described by the energy,

\displaystyle\mathcal{L}(\{z^{l}\})= \displaystyle\frac{1}{2}\sum_{l=0}^{L}\Bigg{\{}\frac{z^{l}_{\alpha}z^{l}_{% \alpha}}{\sigma_{w}^{2}N_{l}^{-1}\phi(z^{l-1}_{\beta})\phi(z^{l-1}_{\beta})+% \sigma_{b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}\phi(z^{l}_{\beta})\phi(z^{l}_{\beta% })+\sigma_{b}^{2})\Bigg{\}} (11)

subject to the identification z^{L}_{\alpha}=z^{-1}_{\alpha}. We will call this lattice model the stochastic neural network on a ring. For the remainder of this paper we will consider systematic approximations to eq. (11).

5 Linear Stochastic Neural Networks

To gain intuition for the stochastic network on a ring we will begin by considering a linear network with \phi(z)=z. In this case it is clear that the energy in eq. (11) is isotropic. It is therefore possible to change variables into hyper-spherical coordinates and integrate out the angular part explicitly (which will give a constant factor that may be neglected). Consequently, the energy for the stochastic linear network is given by (see appendix 8.2),

\displaystyle\mathcal{L}(\{r^{l}\})= \displaystyle\frac{1}{2}\sum_{l=0}^{L}\Bigg{\{}\frac{(r^{l})^{2}}{\sigma_{w}^{% 2}N^{-1}(r^{l-1})^{2}+\sigma_{b}^{2}}-N\log\Bigg{(}\frac{(r^{l})^{2}}{\sigma_{% w}^{2}N^{-1}(r^{l})^{2}+\sigma_{b}^{2}}\Bigg{)}\Bigg{\}}. (12)

where (r^{l})^{2}=z^{l}_{\alpha}z^{l}_{\alpha}.

A controlled approximation to eq. (12) as N\to\infty can be constructed using the Laplace approximation (sometimes called the saddle point approximation). The essence of the Laplace approximation is that integrals of the form I=\int dxe^{-Af(x)} can be approximated by I\approx e^{-Af(x^{*})}\int dxe^{-A(x-x^{*})^{T}f^{\prime\prime}(x^{*})(x-x^{*% })} as A\to\infty where x^{*} minimizes f(x). Consequently, we first seek a minimum of eq. (12) to expand around.

We make the ansatz that there is a uniform configuration, r^{l}=r^{*} independent of layer, that minimizes eq. (12). Under this assumption we find that for \sigma_{w}^{2}<1 there is an optimum when (see appendix 8.2),

r^{*}=\sqrt{\frac{N\sigma_{b}^{2}}{\Delta_{w}}}\,, (13)

where \Delta_{w}=1-\sigma_{w}^{2} measures the distance to criticality. This solution can be tested by generating many instantiations of stochastic linear networks and then computing the average norm of the pre-activations after the transient from the input has decayed.

Figure 3: The fixed point of the norm, r^{*}, for a stochastic linear network with \sigma_{b}^{2}=0.01, L=1024, N=200. Measurements from instantiations of the network at different \sigma_{w}^{2} are shown in red circles. The theoretical prediction is overlaid in black dashed lines.

In fig. (3) we plot the empirical norm measured in this way against the theoretical prediction. We see excellent agreement between the numerical result and the theory111Note that while we are measuring the average norm of the linear stochastic network, we are predicting r^{*} which is the mode of the distribution. However, these quantities are equal in the large N limit of the Laplace approximation..

Nonuniform fluctuations around the minimum can now be computed. Let r^{l}=r^{*}+\epsilon^{l} and expand the energy to quadratic order in \epsilon^{l}. Writing U(\{\epsilon^{l}\})=\mathcal{L}(\{r^{*}+\epsilon^{l}\})-\mathcal{L}(\{r^{*}\}) we find that (see appendix 8.2),

U(\{\epsilon^{l}\})=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{l=0}^{L}\left[(1+% \sigma_{w}^{4})(\epsilon^{l})^{2}-2\sigma_{w}^{2}\epsilon^{l}\epsilon^{l-1}% \right]. (14)

As in the work of Poole et al. (2016), here we also approximate the behavior of the full joint distribution by a Gaussian. However, the Laplace approximation retains the coupling between layers and therefore is able to capture inter-layer fluctuations.

Together eq. (13) and eq. (14) fully characterize the behavior of the linear stochastic network as N_{l}\to\infty. By expanding to beyond quadratic order, corrections of order N_{l}^{-1} can be computed. One application of this would be to reprise the analysis of signal propagation in deep networks in Schoenholz et al. (2017), but for networks of finite rather than infinite width.

As our network is topologically equivalent to a ring, we can perform a coordinate transformation of eq. (14) to the Fourier basis by writing \epsilon^{l}=\sum_{q}\epsilon_{q}e^{-iql}. To respect the periodic boundary conditions of the ring, q will be summed from 0 to 2\pi in units of 2n\pi/L. It follows that (see appendix 8.2),

U(\{\epsilon_{q}\})=\frac{L\Delta_{w}}{\sigma_{b}^{2}}\sum_{q}\left\{(1+\sigma% _{w}^{4})-2\sigma_{w}^{2}\cos q\right\}|\epsilon_{q}|^{2}. (15)

The Fourier transformation therefore diagonalizes eq. (14) and so we predict that the different Fourier modes ought to be distributed as independent Gaussians. Since the variance of each mode is positive for \sigma_{w}^{2}<1, the optimum that we identified in eq. (13) is indeed a minimum.

This calculation gives very precise predictions about the behavior of pre-activations in wide, deep stochastic networks.

Figure 4: The statistics of the Fourier transform of fluctuations in deep linear stochastic neural networks. This figure offers a comparison between the fluctuations sampled from stochastic neural networks (colored lines) and our theoretical predictions (white lines). The networks are of depth L=1024, width N=200, and \sigma_{b}^{2}=0.01. The colors denote different values of \sigma_{w}^{2} in the set 0.02 (blue), 0.18, 0.34, 0.5, 0.66, 0.82, 0.98 (red).

To test these predictions we generate M=200 samples from linear stochastic networks of width N=200 and depth L=1024. For each sample we take the norm of the pre-activations in the last 512 layers of the network and compute the fluctuation of the pre-activation around r^{*} (eq. (13)). For each sample we then compute the FFT of the norm of the pre-activations. Finally, we compute the variance of each Fourier mode (for more details and plots see appendix 8.3). We plot the results of this calculation in fig. (4) for different values of \sigma_{w}^{2}. In each case we see strong agreement between our numerical experiments and the prediction of our theory. Note that the factorial Gaussian approximation discussed briefly above is unable to capture these fluctuations.

The long wavelength behavior of fluctuations in the deep linear network is well described by an effective field theory. This effective field theory can be constructed by expanding eq. (15) to quadratic order in q, approximating sums by integrals and differences by derivatives. We find that the effective field theory is defined by the energy (see appendix 8.2),

U[\epsilon(x)]=\frac{\Delta_{w}}{\sigma_{b}^{2}}\int dx\left[\Delta_{w}^{2}(% \epsilon(x))^{2}+\sigma_{w}^{2}\left(\frac{\partial\epsilon(x)}{\partial x}% \right)^{2}\right]. (16)

We note that this field theory features explicitly \epsilon(x)\to-\epsilon(x) as well as x\to-x symmetry. Perhaps expectedly this implies that information can equally travel forward and backwards through the network.

Both the effective field theory and the lattice model have long wavelength fluctuations that are given by the q\to 0 limit of eq. (15),

U(\{\epsilon_{q}\})\approx\frac{L\Delta_{w}\sigma_{w}^{2}}{\sigma_{b}^{2}}\sum% _{q}\left\{\frac{\Delta_{w}^{2}}{\sigma_{w}^{2}}+q^{2}\right\}|\epsilon_{q}|^{% 2}. (17)

Given this equation we can read off the length-scale governing fluctuations to be \xi=\sigma_{w}/\Delta_{w}. We therefore see that stochastic linear networks feature a phase transition at \Delta_{w}=0 with an accompanying diverging depth-scale in the fluctuations.

6 Rectified Linear Stochastic Neural Networks

Having discussed the linear stochastic neural network we now move on to the more complicated case of the stochastic neural network on a ring with rectified linear activations, \phi(z)=\max(0,z). Again we seek to construct the Laplace approximation to eq. (11).

In this case we notice that the norm squared of any z^{l} decomposes into two terms, (z^{l})^{2}=(z^{l}_{+})^{2}+(z^{l}_{-})^{2}. Here, z^{l}_{+} and z^{l}_{-} are the vectors of positive and negative components of z^{l} respectively. With this decomposition, the energy for the rectified linear stochastic neural network can be written as,

\displaystyle\mathcal{L}(\{z^{l}\})= \displaystyle\frac{1}{2}\sum_{l=0}^{L}\Bigg{\{}\frac{(z^{l}_{+})^{2}+(z^{l}_{-% })^{2}}{\sigma_{w}^{2}N^{-1}(z^{l}_{+})^{2}+\sigma_{b}^{2}}+N\log(\sigma_{w}^{% 2}N^{-1}(z^{l}_{+})^{2}+\sigma_{b}^{2})\Bigg{\}}. (18)

The integral over each z^{l} can be decomposed as a sum of integrals over each of the 2^{N} different orthants. In each orthant, the set of positive and negative components of z^{l} is fixed; Consequently, we may apply independent hyperspherical coordinate transformations to z^{l}_{+} and to z^{l}_{-} within each orthant.

With this in mind, let k_{l} be the number of positive components of z^{l} in a given orthant with the remaining N-k_{l} components being negative. It is clear that the number of orthants with k_{l} positive components will be {N\choose k_{l}}. The partition function for the rectified linear network can therefore be written as (see appendix 8.4),

\displaystyle Q \displaystyle=2\left(\frac{\sqrt{\pi}}{2}\right)^{N}\prod_{l}\sum_{k_{l}=0}^{N% }{N\choose k_{l}}\frac{1}{\Gamma\left(\frac{N-k_{l}}{2}\right)\Gamma\left(% \frac{k_{l}}{2}\right)}\int dr^{l}_{+}dr^{l}_{-}(r^{l}_{+})^{N-k_{l}-1}(r^{l}_% {-})^{k_{l}-1}
\displaystyle           \times\exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left(\frac{% (r_{+}^{l})^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l-1}_{+})^{2}+\sigma_% {b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+\sigma_{b}^{2})\right)% \right]. (19)

Here r^{l}_{+} and r^{l}_{-} is the norm of the positive and negative components of the pre-activations respectively. In the N\to\infty limit, the sum over orthants can be converted into an integral and the \Gamma functions can be approximated using Stirling’s formula. We therefore see that, unlike in the case of linear networks, the lattice model for rectified linear networks contains three interaction fields, r^{l}_{+}, r^{l}_{-}, and k_{l}.

As in the linear case, we can now construct the Laplace approximation for this network. We first make an ansatz of a constant solution, r^{l}_{+/-}=r^{*}_{+/-} and k^{l}=k^{*}, independent of the layer l. Solving for the minimum of the energy we arrive at the following saddle point conditions (see appendix 8.4),

\displaystyle r^{*}_{+/-}=\sqrt{\frac{N\sigma_{b}^{2}}{2(1-\sigma_{w}^{2}/2)}}\,, \displaystyle k^{*}=\frac{N}{2}. (20)

Perhaps this result should not be surprising given the symmetry of the random weights. We expect that in the N\to\infty limit the network will settle into a state where half the pre-activations are negative and half the pre-activations are positive.

Figure 5: The fixed point of the positive (red) and negative (blue) components of the norm, r^{*}_{+/-}, for a stochastic rectified linear network with \sigma_{b}^{2}=0.01, L=1024, N=200. Measurements from instantiations of the network at different \sigma_{w}^{2} are shown. The theoretical prediction is overlaid in black dashed lines. The inset shows that measured values for k^{*} (black) compared with the theoretical prediction (dashed white).

We can test the results of this prediction in fig. (5) by sampling M=200 instances of 1024 layer deep rectified linear stochastic neural networks with \sigma_{w}^{2}\in(0,2). As in the case of the deep linear stochastic network we see excellent agreement between theory and numerical simulation.

Nonuniform fluctuations around the saddle point can once again be computed. To do this we write r^{l}_{+/-}=r^{*}_{+/-}+\epsilon^{l}_{+/-} and k^{l}=k^{*}+\epsilon_{k}^{l}. We now expand the energy and make the substitutions \tilde{\epsilon}^{l}_{+/-}=\sqrt{2(1-\sigma_{w}^{2}/2)/\sigma_{b}^{2}}\epsilon% ^{l}_{+/-} and \tilde{\epsilon}^{l}_{k}=\epsilon^{l}_{k}/\sqrt{N} to find an energy cost for fluctuations (see appendix 8.4),

\displaystyle U=\frac{1}{2}\sum_{l=0}^{L}\Bigg{(} \displaystyle(1+\sigma_{w}^{4}/2)(\tilde{\epsilon}^{l}_{+})^{2}+(\tilde{% \epsilon}^{l}_{-})^{2}+3(\tilde{\epsilon}^{l}_{k})^{2}+\tilde{\epsilon}^{l}_{k% }(\tilde{\epsilon}^{l}_{+}-\tilde{\epsilon}^{l}_{-})-\sigma_{w}^{2}\tilde{% \epsilon}_{+}^{l-1}(\tilde{\epsilon}_{+}^{l}+\tilde{\epsilon}_{-}^{l})\Bigg{)}. (21)

We can understand some of these fluctuations in an intuitive way, for example fluctuations in the norm of the fraction of positive components and the norm of the negative components are anti-correlated. But in general rectified linear networks have subtle and interesting fluctuations, and to our knowledge this work presents the first quantitative theoretical description of the statistics of random rectified linear networks. We note in passing that that the fully factorial mean field theory would not be able to capture any of the anisotropy in the fluctuations identified here.

As in the linear case, the layer-layer coupling can be diagonalized by moving into Fourier space. In the rectified linear case, however, this transformation retains covariance between the different fluctuations. In particular, we can write the energy in Fourier space as U=\frac{1}{2}\sum_{q}\bm{\epsilon}^{\dagger}(q)\bm{\Sigma}^{-1}(q)\bm{\epsilon% }(q) where \bm{\epsilon}^{\dagger}(q)=\begin{pmatrix}\epsilon_{+}^{-q}&\epsilon_{-}^{-q}&% \epsilon_{k}^{-q}\end{pmatrix} is a vector of fluctuations and

\displaystyle\bm{\Sigma}^{-1}(q) \displaystyle=\begin{pmatrix}1+\sigma_{w}^{4}/2-\sigma_{w}^{2}\cos q&-\frac{1}% {2}\sigma_{w}^{2}e^{-iq}&\frac{1}{2}\\ -\frac{1}{2}\sigma_{w}^{2}e^{iq}&1&-\frac{1}{2}\\ \frac{1}{2}&-\frac{1}{2}&3\end{pmatrix} (22)

is the Fourier space inverse covariance matrix between different fields (see appendix 8.4).

We can compare our theoretical predictions for the covariance matrix against numerical results generated in an analogous manner to the linear case.

Figure 6: The statistics of the Fourier transform of fluctuations in deep rectified stochastic neural networks. This figure offers a comparison between the fluctuations sampled from stochastic neural networks (colored lines) and our theoretical predictions (white lines). The networks are of depth L=1024, width N=200, and \sigma_{b}^{2}=0.01. Different colored curves denote different values of \sigma_{w}^{2}. In particular we show \sigma_{w}^{2}=0.02 (blue), 0.34, 0.66, 0.98, 1.3, 1.62, 1.94 (red). Different components of the covariance matrix are shown in different subplots.

The results of this comparison can be seen in fig. (6) for different elements of the covariance matrix and different values of \sigma_{w}^{2}\in(0,2). As in the linear case we see excellent agreement between the theoretical predictions and the numerical simulations. Finally, we can complete our analysis by computing an effective field theory that governs long wavelength fluctuations (see appendix 8.4).

Once again we can identify an effective field theory that governs long wavelength fluctuations. We find that it is given by (see appendix 8.4),

\displaystyle U=\frac{1}{2}\int dx\Bigg{[} \displaystyle(1-\sigma_{w}^{2}+\sigma_{w}^{4}/2)(\epsilon_{+}(x))^{2}+(% \epsilon_{-}(x))^{2}+3(\epsilon_{k}(x))^{2}+\sigma_{w}^{2}\epsilon_{+}(x)% \epsilon_{-}(x)
\displaystyle+\epsilon_{k}(x)(\epsilon_{+}(x)-\epsilon_{-}(x))+\sigma_{w}^{2}% \left(\frac{\partial\epsilon_{+}(x)}{\partial x}\right)^{2}+\sigma_{w}^{2}% \frac{\partial\epsilon_{+}(x)}{\partial x}\epsilon_{-}(x)\Bigg{]}. (23)

Note that unlike in the case of the stochastic linear network both the \epsilon\to-\epsilon and x\to-x symmetries are broken when acting on any given field. This symmetry breaking makes sense since the network treats the different fields quite asymmetrically and the forward and backward propagation dynamics are quite different. In physics, the symmetries and symmetry breaking have been shown to dictate the behavior of systems over large regions of their parameters. Thus, as in Landau theory, many systems are classified based on the symmetries they possess. The presence of this symmetry breaking between linear networks and rectified-linear networks suggests that such an approach might be fruitfully applied to neural networks. As with the deep linear network, the long-wavelength limit of the effective field theory and the lattice model agree.

7 Discussion

Here we have shown that for fully-connected feed forward neural networks there is a correspondence between random neural networks and lattice models in statistical physics. While we have not discussed it here, this correspondence actually holds for a very large set of neural network topologies. Lattice models can also be constructed for ensembles of random neural networks that have weights and biases whose distributions are more complicated than factorial Gaussian. In general, the effect of nontrivial network topology and correlations between weights will be to couple spins in the lattice model. Thus, the topology of the neural network will generically induce a topology of the corresponding lattice model. For example, convolutional networks will have corresponding lattice models that feature interactions between the set of all the pre-activations in a given layer that share a filter.

As in physics, it seems likely that lattice models for complex neural networks will be fairly intractable compared to the relatively simple examples presented here. On the other hand, the success of effective field theories at describing the long wavelength fluctuations of random neural networks suggests that even complex networks may be tractable in this limit. Moreover, as neural networks get larger and more complex the behavior of long wavelength fluctuations will become increasingly relevant when thinking about the behavior of the neural network as a whole.

We believe it is likely that there exist universality classes of neural networks whose effective field theories contain the same set of relevant operators. Classifying neural networks in this way would allow us to make statements about the behavior of entire classes of networks. This would transition the paradigm of neural network design away from specific architectural decisions towards a more general discussion about which class of models was most suitable for a specific problem.

Finally, we note that there is has been significant effort made to understand biological neural activity leveraging similar analogies to lattice models and statistical field theory. Notably, Schneidman et al. (2006) noticed that Ising-like models can quantitatively capture the statistics of neural activity in vertebrate retina; Buice and Chow (2013) developed field theoretic extensions to older mean-field theories of populations of neurons; far earlier, Ermentrout and Cowan (1979) used similar techniques to investigate how hallucinations between two similar patterns might come about. By placing artificial neural networks into the context of field-theory it may be possible to find subtle relationships with their biological counterparts.

8 Appendix

8.1 Proof of the Main Result

In this section we prove the main result of the paper. We do so in two steps. First we examine the partition function,

\displaystyle Q=\int \displaystyle[dW][db]\exp\Bigg{[}-\frac{J_{D}}{2}\sum_{i\in\mathcal{M}}\ell(f(% x_{i}),t_{i})-\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}\frac{N_{l}}{\sigma_{w}^{2}}W^{% l}_{\alpha\beta}W^{l}_{\alpha\beta}+\frac{1}{\sigma_{b}^{2}}b^{l}_{\alpha}b^{l% }_{\alpha}\Bigg{)}\Bigg{]} (24)

and introduce the pre-activations at the cost of adding \delta-function constraints. We use the Fourier representation of these constraints to bring them into the exponent. This requires introducing auxiliary variables that enforce the constraints. Once in this form it becomes apparent that the weights and biases are Gaussian distributed and may therefore be integrated out explicitly. Finally we integrate out the constraints that we introduced in the preceding step to convert the distribution into a distribution over the pre-activations alone.

Result 1.

The partition function for the maximum entropy distribution of a fully-connected feed-forward neural network can be written as,

\displaystyle Q= \displaystyle\int[d\Omega]\exp\Bigg{[}-\sum_{i\in\mathcal{M}}\Bigg{\{}\frac{J_% {D}}{2}\ell(\phi^{L+1}(z^{L}_{i}),t_{i})+i\lambda_{\alpha;i}^{0}(z^{0}_{\alpha% ;i}-W^{0}_{\alpha\beta}x_{\beta;i}-b^{0}_{\alpha;i})
\displaystyle+\sum_{l=1}^{L}i\lambda^{l}_{\alpha;i}(z^{l}_{\alpha;i}-W^{l}_{% \alpha\beta}\phi^{l}(z_{\beta;i}^{l-1})-b^{l}_{\alpha})\Bigg{\}}-\sum_{l=0}^{L% }\Bigg{(}\frac{N_{l}}{2\sigma_{w}^{2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}+% \frac{1}{2\sigma_{b}^{2}}b^{l}_{\alpha}b^{l}_{\alpha}\Bigg{)}\Bigg{]} (25)

where we have let [d\Omega]=[dW][db][dz][d\lambda] for notational convenience.

Proof.

To demonstrate this result we begin with eq. (24) and iteratively use \delta-functions to change variables to the pre-activations. Explicitly writing the neural network out in eq. (24) gives,

\displaystyle Q \displaystyle=\int[dW][db]\exp\Bigg{[}-\frac{J_{D}}{2}\sum_{i\in\mathcal{M}}% \ell(\phi^{L}(\cdots\phi^{1}(W^{0}x_{i}+b^{0})\cdots),t_{i})
\displaystyle                          -\sum_{l=0}^{L}\left(\frac{N_{l}}{2% \sigma_{w}^{2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}+\frac{1}{2\sigma_{b}^{2}% }b^{l}_{\alpha}b^{l}_{\alpha}\right)\Bigg{]} (26)
\displaystyle=\int[dW][db]\left(\prod_{i\in\mathcal{M}}\exp\left[-\frac{J_{D}}% {2}\sum_{i}\ell(\phi^{L}(\cdots\phi^{1}(W^{0}x_{i}+b^{0})\cdots),t_{i})\right]\right)
\displaystyle                      \times\exp\left[-\sum_{l=0}^{L}\left(\frac{% N_{l}}{2\sigma_{w}^{2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}+\frac{1}{2\sigma% _{b}^{2}}b^{l}_{\alpha}b^{l}_{\alpha}\right)\right] (27)
\displaystyle=\int[dW][db]\Bigg{(}\prod_{i\in\mathcal{M}}\int[dz^{0}_{i}]\exp% \left[-\frac{J_{D}}{2}\sum_{i}\ell(\phi^{L}(\cdots\phi^{2}(W^{1}\phi^{1}(z^{0}% _{i})+b^{1})\cdots),t_{i})\right]
\displaystyle                   \times\delta(z^{0}_{i}-W^{0}x_{i}-b^{0})\Bigg{% )}\exp\left[-\sum_{l=0}^{L}\left(\frac{N_{l}}{2\sigma_{w}^{2}}W^{l}_{\alpha% \beta}W^{l}_{\alpha\beta}+\frac{1}{2\sigma_{b}^{2}}b^{l}_{\alpha}b^{l}_{\alpha% }\right)\right]. (28)

We can repeat this process iteratively until all of the pre-activations have been introduced. We find,

\displaystyle Q \displaystyle=\int[dW][db]\Bigg{(}\prod_{i\in\mathcal{M}}\prod_{l=0}^{L}\int[% dz_{i}^{l}]\exp\left[-\frac{J_{D}}{2}\sum_{i}\ell(\phi^{L+1}(z^{L}_{i}),t_{i})\right]
\displaystyle        \times\prod_{l=0}^{L}\delta(z^{l}_{i}-W^{l}\phi^{l}(z^{l-% 1}_{i})+b^{l})\Bigg{)}\exp\left[-\sum_{l=0}^{L}\left(\frac{N_{l}}{2\sigma_{w}^% {2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}+\frac{1}{2\sigma_{b}^{2}}b^{l}_{% \alpha}b^{l}_{\alpha}\right)\right] (29)

where we will use \phi^{0}(z^{-1})\equiv x interchangeably for notational simplicity. This procedure has essentially used a change of variables to introduce the pre-activations explicitly into the partition function.

Here, \delta-functions constrain the pre-activations their correct values given the weights. To complete the proof we leverage the Fourier representation of the \delta-function as \delta(x)=\int d\lambda e^{-ix\lambda}. In particular we use Fourier space denoted by \lambda^{l}_{\alpha} for each pre-activation constraint. We therefore find,

\displaystyle Q \displaystyle=\int[dW][db]\Bigg{(}\prod_{i\in\mathcal{M}}\prod_{l=0}^{L}\int[% dz_{i}^{l}]\exp\left[-\frac{J_{D}}{2}\sum_{i}\ell(\phi^{L+1}(z^{L}_{i}),t_{i})\right]
\displaystyle        \times\prod_{l=0}^{L}\delta(z^{l}_{i}-W^{l}\phi^{l}(z^{l-% 1}_{i})+b^{l})\Bigg{)}\exp\left[-\sum_{l=0}^{L}\left(\frac{N_{l}}{2\sigma_{w}^% {2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}+\frac{1}{2\sigma_{b}^{2}}b^{l}_{% \alpha}b^{l}_{\alpha}\right)\right] (30)
\displaystyle=\int[dW][db]\Bigg{(}\prod_{i\in\mathcal{M}}\int[dz_{i}][d\lambda% _{i}]\exp\left[-\frac{J_{D}}{2}\sum_{i}\ell(\phi^{L+1}(z^{L}_{i}),t_{i})\right]
\displaystyle                      \times\prod_{l=1}^{L}\exp\left[-i\lambda_{% \alpha;i}^{l}(z^{l}_{\alpha;i}-W^{l}_{\alpha\beta}\phi(z^{l-1}_{\beta;i})-b^{l% }_{\alpha})\right]\Bigg{)}
\displaystyle                      \times\exp\left[-\sum_{l=0}^{L}\left(\frac{% N_{l}}{2\sigma_{w}^{2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}+\frac{1}{2\sigma% _{b}^{2}}b^{l}_{\alpha}b^{l}_{\alpha}\right)\right] (31)
\displaystyle=\int[d\Omega]\exp\Bigg{[}-\sum_{i\in\mathcal{M}}\Bigg{\{}\frac{J% _{D}}{2}\ell(\phi^{L+1}(z^{L}_{i}),t_{i})+i\lambda_{\alpha;i}^{0}(z^{0}_{% \alpha;i}-W^{0}_{\alpha\beta}x_{\beta;i}-b^{0}_{\alpha;i})
\displaystyle    +\sum_{l=1}^{L}i\lambda^{l}_{\alpha;i}(z^{l}_{\alpha;i}-W^{l}% _{\alpha\beta}\phi^{l}(z_{\beta;i}^{l-1})-b^{l}_{\alpha})\Bigg{\}}-\sum_{l=0}^% {L}\Bigg{(}\frac{N_{l}}{2\sigma_{w}^{2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}% +\frac{1}{2\sigma_{b}^{2}}b^{l}_{\alpha}b^{l}_{\alpha}\Bigg{)}\Bigg{]} (32)

as required. ∎

Main Result.

Provided N_{l}\gg|\mathcal{M}|, the weights, biases, and fictitious fields can be integrated out of eq. (25) to give a stochastic process involving only the pre-activations as,

\displaystyle Q=\int[dz]\exp\left[-\frac{J_{D}}{2}\sum_{i\in\mathcal{M}}\ell(% \phi^{L+1}(z^{L}_{i}),t_{i})-\frac{1}{2}\sum_{l=0}^{L}\left((\bm{z}^{l}_{% \alpha})^{T}(\bm{\Sigma}^{l})^{-1}\bm{z}^{l}_{\alpha}+\ln|\bm{\Sigma}^{l}|% \right)\right] (33)

where (\bm{z}^{l}_{\alpha})^{T}=(z^{l}_{\alpha;1},\cdots,z^{l}_{\alpha,|\mathcal{M}|}) is a vector of pre-activations corresponding to each input to the network, \bm{\Sigma}^{l}_{ij}=\sigma_{w}^{2}N_{l}^{-1}\phi^{l}(z^{l-1}_{\alpha;i})\phi^% {l}(z^{l-1}_{\alpha;j})+\sigma_{b}^{2} if l>0, and \bm{\Sigma}^{0}_{ij}=\sigma_{w}^{2}N_{l}^{-1}x_{\alpha;i}x_{\alpha;j}+\sigma_{% b}^{2} is the correlation matrix between activations of the network from different inputs.

Proof.

We proceed directly completing the square and integrating out Gaussian variables. For notational simplicity we temporarily let z^{-1}=x and \phi^{0}(z)=z be linear. We then integrate out the weights and biases by completing the square,

\displaystyle Q \displaystyle=\int[d\Omega]\exp\Bigg{[}-\sum_{i\in\mathcal{M}}\Bigg{\{}\frac{J% _{D}}{2}\ell(\phi^{L+1}(z^{L}_{i}),t_{i})+\sum_{l=0}^{L}i\lambda^{l}_{\alpha;i% }(z^{l}_{\alpha;i}-W^{l}_{\alpha\beta}\phi^{l}(z_{\beta;i}^{l-1})-b^{l}_{% \alpha})\Bigg{\}}
\displaystyle                    -\sum_{l=0}^{L}\Bigg{(}\frac{N_{l}}{2\sigma_{% w}^{2}}W^{l}_{\alpha\beta}W^{l}_{\alpha\beta}+\frac{1}{2\sigma_{b}^{2}}b^{l}_{% \alpha}b^{l}_{\alpha}\Bigg{)}\Bigg{]} (34)
\displaystyle=\int[d\Omega]\exp\Bigg{[}-\frac{J_{D}}{2}\sum_{i\in\mathcal{M}}% \ell(\phi^{L+1}(z^{L}_{i}),t_{i})-\sum_{l=0}^{L}\sum_{i\in\mathcal{M}}i\lambda% ^{l}_{\alpha;i}z^{l}_{\alpha;i}
\displaystyle        -\sum_{l=0}^{L}\frac{N_{l}}{2\sigma_{w}^{2}}\left(W^{l}_{% \alpha\beta}-\frac{i\sigma_{w}^{2}}{N_{l}}\sum_{i\in\mathcal{M}}\lambda^{l}_{% \alpha;i}\phi^{l}(z^{l-1}_{\beta;i})\right)\left(W^{l}_{\alpha\beta}-\frac{i% \sigma_{w}^{2}}{N_{l}}\sum_{i\in\mathcal{M}}\lambda^{l}_{\alpha;i}\phi^{l}(z^{% l-1}_{\beta;i})\right)
\displaystyle        -\sum_{l=0}^{L}\frac{1}{2\sigma_{b}^{2}}\left(b^{l}_{% \alpha}-i\sigma_{b}^{2}\sum_{i\in\mathcal{M}}\lambda^{l}_{\alpha;i}\right)% \left(b^{l}_{\alpha}-i\sigma_{b}^{2}\sum_{i\in\mathcal{M}}\lambda^{l}_{\alpha;% i}\right)
\displaystyle        -\sum_{l=0}^{L}\sum_{i,j\in\mathcal{M}}\lambda_{\alpha;i}% ^{l}\lambda_{\alpha;j}^{l}(\sigma_{w}^{2}N_{l}^{-1}\phi^{l}(z^{l-1}_{\beta;i})% \phi^{l}(z^{l-1}_{\beta;j})+\sigma_{b}^{2})\Bigg{]} (35)
\displaystyle=\int[dz][d\lambda]\exp\Bigg{[}-\frac{J_{D}}{2}\sum_{i\in\mathcal% {M}}\ell(\phi^{L+1}(z^{L}_{i}),t_{i})-\sum_{l=0}^{L}\sum_{i\in\mathcal{M}}i% \lambda^{l}_{\alpha;i}z^{l}_{\alpha;i}
\displaystyle                          -\frac{1}{2}\sum_{l=0}^{L}\sum_{i,j\in% \mathcal{M}}\lambda_{\alpha;i}^{l}\lambda_{\alpha;j}^{l}(\sigma_{w}^{2}N_{l}^{% -1}\phi^{l}(z^{l-1}_{\beta;i})\phi^{l}(z^{l-1}_{\beta;j})+\sigma_{b}^{2})\Bigg% {]}. (36)

Interestingly, we notice that upon integrating out the weights and biases, the pre-activations from different inputs become coupled. This is reminiscent of replica calculations in the spin glass literature.

We now rewrite the above expression to elucidate its structure. To do this we first let (\bm{\lambda}^{l}_{\alpha})^{T}=(\lambda^{l}_{\alpha;1},\lambda^{l}_{\alpha;2}% ,\cdots,\lambda^{l}_{\alpha;|\mathcal{M}|}), (\bm{z}^{l}_{\alpha})^{T}=(z^{l}_{\alpha;1},z^{l}_{\alpha;2},\cdots z^{l}_{% \alpha;|\mathcal{M}|}), and (\bm{\phi}^{l}_{\alpha})^{T}=(\phi^{l}(z^{l-1}_{\alpha;1}),\phi^{l}(z^{l-1}_{% \alpha;2}),\cdots,\phi^{l}(z^{l-1}_{\alpha;|\mathcal{M}|})). Finally we define the matrix \bm{\Sigma}^{l}=\sigma_{w}^{2}N_{l}^{-1}\bm{\phi}^{l}_{\alpha}(\bm{\phi}^{l}_{% \alpha})^{T}+\bm{1}\sigma_{b}^{2} where \bm{1} is the |\mathcal{M}|\times|\mathcal{M}| matrix of ones. Using this notation we can rewrite eq. (36) as,

Q=\int[dz][d\lambda]\exp\Bigg{[}-\frac{J_{D}}{2}\sum_{i\in\mathcal{M}}\ell(% \phi^{L+1}(z^{L}_{i}),t_{i})-\sum_{l=0}^{L}\left\{\frac{1}{2}(\bm{\lambda}^{l}% _{\alpha})^{T}\bm{\Sigma}^{l}\bm{\lambda}^{l}_{\alpha}-i(\bm{\lambda}^{l}_{% \alpha})^{T}\bm{z}^{l}_{\alpha}\right\}\Bigg{]}. (37)

Eq. (37) clearly has the structure of a multivariate Gaussian as a function of the \bm{\lambda}^{l}_{\alpha}. In principle it is therefore possible to integrate out the \bm{\lambda}^{l}_{\alpha}. We notice, however, that \bm{\Sigma}^{l} is an |\mathcal{M}|\times|\mathcal{M}| matrix constructed as a sum of N_{l}+1 terms each being the outer-product of a vector. It follows that the rank of \bm{\Sigma}^{l} is at most N_{l}+1. For this work we will be explicitly interested in the large N_{l} limit and so we may safely assume that \bm{\Sigma}^{l} is full-rank. However, more care must be taken when N_{l}+1\sim|\mathcal{M}|.

Thus, in the case that N_{l}\gg|\mathcal{M}| we may integrate out the \bm{\lambda}^{l}_{\alpha} in the usual way to find,

Q=\int[dz]\exp\left[-\frac{J_{D}}{2}\sum_{i\in\mathcal{M}}\ell(\phi^{L+1}(z^{L% }_{i}),t_{i})-\frac{1}{2}\sum_{l=0}^{L}\left((\bm{z}^{l}_{\alpha})^{T}(\bm{% \Sigma}^{l})^{-1}\bm{z}^{l}_{\alpha}+\ln|\bm{\Sigma}^{l}|\right)\right] (38)

as required. ∎

Corollary 1.

In the event that the network has only a single input eq. (9) reduces to,

\displaystyle Q=\int[dz]\exp\Bigg{[}-\frac{1}{2}\Bigg{\{}J_{D}\ell(\phi^{L+1}(% z^{L}),t) \displaystyle+\frac{z_{\alpha}^{0}z_{\alpha}^{0}}{\sigma_{w}^{2}N_{0}^{-1}x_{% \beta}x_{\beta}+\sigma_{b}^{2}}
\displaystyle+\sum_{l=1}^{L}\frac{z_{\alpha}^{l}z_{\alpha}^{l}}{\sigma_{w}^{2}% N_{l}^{-1}\phi^{l}(z^{l-1}_{\beta})\phi^{l}(z^{l-1}_{\beta})+\sigma_{b}^{2}}
\displaystyle+\sum_{l=1}^{L}N_{l}\log(\sigma_{w}^{2}N_{l}^{-1}\phi^{l}(z^{l}_{% \alpha})\phi^{l}(z^{l}_{\alpha})+\sigma_{b}^{2})\Bigg{\}}\Bigg{]}. (39)

Here we omit the sample index since it is unnecessary.

Proof.

This result follows directly from the previous result by plugging in for only a single input. ∎

8.2 Theoretical Results on Linear Stochastic Networks

Here we prove several results elucidating the behavior of the linear stochastic network on a ring. We will begin with the full partition function for the linear stochastic network,

Q=\int[dz]\exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left\{\frac{z^{l}_{\alpha}z^{l}% _{\alpha}}{\sigma_{w}^{2}N^{-1}z^{l-1}_{\beta}z^{l-1}_{\beta}+\sigma_{b}^{2}}+% N\log(\sigma_{w}^{2}N^{-1}z^{l}_{\alpha}z^{l}_{\alpha}+\sigma_{b}^{2})\right\}% \right]. (40)

Our first result concerns the change of variables into hyperspherical coordinates. We will denote the radius to be r^{l}.

Result 2.

The energy for the stochastic linear network on a ring can be changed into hyperspherical coordinates. The resulting lattice model is described by the energy,

\mathcal{L}(\{r^{l}\})=\frac{1}{2}\sum_{l=0}^{L}\left\{\frac{(r^{l})^{2}}{% \sigma_{w}^{2}N^{-1}(r^{l-1})^{2}+\sigma_{b}^{2}}-N\log\left(\frac{(r^{l})^{2}% }{\sigma_{w}^{2}N^{-1}(r^{l})^{2}+\sigma_{b}^{2}}\right)\right\} (41)

where r^{l} is the norm of the pre-activation in layer l.

Proof.

We proceed by simply making the change of variables in eq. (40). Since the integrand is isotropic we express the integral over angles in layer l by d\Omega^{l}. However we note that the angular integrals will change the partition function by at most a constant and so may be discarded.

\displaystyle Q \displaystyle=\int[dz]\exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left\{\frac{z^{l}_{% \alpha}z^{l}_{\alpha}}{\sigma_{w}^{2}N^{-1}z^{l-1}_{\beta}z^{l-1}_{\beta}+% \sigma_{b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}z^{l}_{\alpha}z^{l}_{\alpha}+\sigma_% {b}^{2})\right\}\right] (42)
\displaystyle=\prod_{l=0}^{L}\int[dz^{l}]\exp\left[-\frac{1}{2}\sum_{l=0}^{L}% \left\{\frac{z^{l}_{\alpha}z^{l}_{\alpha}}{\sigma_{w}^{2}N^{-1}z^{l-1}_{\beta}% z^{l-1}_{\beta}+\sigma_{b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}z^{l}_{\alpha}z^{l}_% {\alpha}+\sigma_{b}^{2})\right\}\right] (43)
\displaystyle=\prod_{l=0}^{L}\int dr^{l}d\Omega^{l}(r^{l})^{N-1}\exp\Bigg{[}-% \frac{1}{2}\sum_{l=0}^{L}\Bigg{\{}\frac{(r^{l})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l% -1})^{2}+\sigma_{b}^{2}}
\displaystyle                                                   +N\log(\sigma_% {w}^{2}N^{-1}(r^{l})^{2}+\sigma_{b}^{2})\Bigg{\}}\Bigg{]} (44)
\displaystyle\approx\int[dr]exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left\{\frac{(r% ^{l})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l-1})^{2}+\sigma_{b}^{2}}-N\log\left(\frac{% (r^{l})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l})^{2}+\sigma_{b}^{2}}\right)\right\}% \right]. (45)

The definition of the energy follows immediately. Here we replace N-1 by N for convenience since typically N\gg 1. ∎

We now discuss the saddle point approximation to eq. (45). We begin our discussion by that when \sigma_{w}^{2}<1, eq. (41) is minimized by a uniform arrangement of spins, r^{l}=r^{*} independent of layer.

Result 3.

When \sigma_{w}^{2}<1, there exists a constant configuration of spins, with r^{l}=r^{*} independent of layer, that minimizes the energy for the stochastic neural network on a ring, given by eq (41). The constant solution is given by,

r^{*}=\sqrt{\frac{N\sigma_{b}^{2}}{\Delta_{w}}} (46)

where \Delta_{w}=1-\sigma_{w}^{2}.

Proof.

When r^{l}=r^{*} independent of layer, eq. (41) will be given by,

\mathcal{L}(r^{*})=\frac{L}{2}\left[\frac{(r^{*})^{2}}{\sigma_{w}^{2}N^{-1}(r^% {*})^{2}+\sigma_{b}^{2}}-N\log\left(\frac{(r^{*})^{2}}{\sigma_{w}^{2}N^{-1}(r^% {*})^{2}+\sigma_{b}^{2}}\right)\right]. (47)

Note that this equation has the form x-N\log x which has a minimum when x=N. It follows that eq. (47) will have a minimum precisely when

(r^{*})^{2}=\sigma_{w}^{2}(r^{*})^{2}+N\sigma_{b}^{2} (48)

as required. ∎

Next we can expand eq. (41) in small nonuniform fluctuations about r^{*}.

Result 4.

Small fluctuations about eq. (46), given by r^{l}=r^{*}+\epsilon^{l}, are governed by the energy,

\mathcal{L}(\{r^{*}+\epsilon^{l}\})=\mathcal{L}(r^{*})+\frac{\Delta_{w}}{% \sigma_{b}^{2}}\sum_{l=0}^{L}\left[(1+\sigma_{w}^{4})(\epsilon^{l})^{2}-2% \sigma_{w}^{2}\epsilon^{l}\epsilon^{l-1}\right]+\mathcal{O}(\epsilon^{4}). (49)
Proof.

We consider the cost of small fluctuations about the constant solution so that r^{l}=r^{*}+\epsilon^{l} where \epsilon^{l}\ll r^{*}. For notational simplicity we write D=\sigma_{w}^{2}N^{-1}(r^{*})^{2}+\sigma_{b}^{2} and \alpha=\sigma_{w}^{2}N^{-1}. We then expand the perturbation to the energy to quadratic order about r^{*} to find,

\displaystyle\mathcal{L}(\{r^{*}+\epsilon^{l}\}) \displaystyle=\sum_{l=0}^{L}\left(\frac{1}{2}\frac{(r^{l})^{2}}{\sigma_{w}^{2}% N^{-1}(r^{l-1})^{2}+\sigma_{b}^{2}}-N\log r^{l}+\frac{N}{2}\log(\sigma_{w}^{2}% N^{-1}(r^{l})^{2}+\sigma_{b}^{2})\right) (50)
\displaystyle=\sum_{l=0}^{L}\Bigg{\{}\frac{1}{2}\frac{(r^{*})^{2}+2(r^{*})% \epsilon^{l}+(\epsilon^{l})^{2}}{\sigma_{w}^{2}N^{-1}(r^{*})^{2}+\sigma_{b}^{2% }+\sigma_{w}^{2}N^{-1}\epsilon^{l-1}(2(r^{*})+\epsilon^{l-1})}-N\log(r^{*})
\displaystyle           -N\log\left(1+\frac{\epsilon^{l}}{(r^{*})}\right)+% \frac{N}{2}\log\left(\sigma_{w}^{2}N^{-1}(r^{*})^{2}+\sigma_{b}^{2}\right)
\displaystyle           +\frac{N}{2}\log\left(1+\frac{\sigma_{w}^{2}N^{-1}% \epsilon^{l}(2(r^{*})+\epsilon^{l})}{\sigma_{w}^{2}N^{-1}(r^{*})^{2}+\sigma_{b% }^{2}}\right)\Bigg{\}} (51)
\displaystyle=\mathcal{L}(r^{*})+\sum_{l=0}^{L}\Bigg{\{}-\frac{\alpha}{D^{2}}(% r^{*})^{3}\epsilon^{l-1}+\left(\frac{r^{*}}{D}-\frac{N}{r^{*}}+N\frac{\alpha}{% D}r^{*}\right)\epsilon^{l}
\displaystyle                        +\frac{\alpha z^{2}}{2D^{2}}\left(\frac{4% \alpha}{D}(r^{*})^{2}-1\right)(\epsilon^{l-1})^{2}-2\frac{\alpha(r^{*})^{2}}{D% ^{2}}\epsilon^{l}\epsilon^{l-1}
\displaystyle                        +\left(\frac{1}{2D}+\frac{N}{2(r^{*})^{2}% }+\frac{N\alpha}{2D}-\frac{N\alpha^{2}}{D^{2}}(r^{*})^{2}\right)(\epsilon^{l})% ^{2}\Bigg{\}}. (52)

Next we substitute in for r^{*} noting that D=(r^{*})^{2}/N. It follows that,

\displaystyle\mathcal{L}(\{r^{*}+\epsilon^{l}\})-\mathcal{L}(r^{*}) \displaystyle=\sum_{l=0}^{L}\Bigg{\{}-\frac{\alpha N^{2}}{r^{*}}\epsilon^{l-1}% +\frac{\alpha N^{2}}{r^{*}}\epsilon^{l}+\frac{\alpha N^{2}}{2(r^{*})^{2}}(4N% \alpha-1)(\epsilon^{l-1})^{2}
\displaystyle           -2\frac{\alpha N^{2}}{(r^{*})^{2}}\epsilon^{l}\epsilon% ^{l-1}+\left(\frac{N}{(r^{*})^{2}}+\frac{N^{2}\alpha}{2(r^{*})^{2}}-\frac{N^{3% }\alpha^{2}}{(r^{*})^{2}}\right)(\epsilon^{l})^{2}\Bigg{\}}. (53)

We note that each term in the sum appears twice - once from the l term and once from the l+1 term - except for \epsilon^{l}\epsilon^{l-1}. We may therefore reorganize the sum to symmetrize the different pieces. As a result we note that all the terms linear in \epsilon^{l} vanish. Substituting in for z^{*} we find,

\displaystyle\mathcal{L}(\{r^{*}+\epsilon^{l}\})-\mathcal{L}(r^{*}) \displaystyle=\sum_{l=0}^{L}\Bigg{\{}\left(\frac{N}{(r^{*})^{2}}+\frac{N^{3}% \alpha^{2}}{(r^{*})^{2}}\right)(\epsilon^{l})^{2}-2\frac{\alpha N^{2}}{(r^{*})% ^{2}}\epsilon^{l}\epsilon^{l-1}\Bigg{\}} (54)
\displaystyle=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{l=0}^{L}\left[(1+\sigma_{% w}^{4})(\epsilon^{l})^{2}-2\sigma_{w}^{2}\epsilon^{l}\epsilon^{l-1}\right] (55)

as required. ∎

Examining eq. (49) we note, among other things, that as \sigma_{w}^{2}\to 1 the cost of fluctuations goes to zero. We have successfully constructed a linear field theory for small fluctuations in the stochastic linear network for \sigma_{w}^{2}<1. It is important to note that if it were desirable one could continue the expansion to higher order. This would give you perturbative corrections to the linear theory that we expect to be \mathcal{O}(N^{-1}). One could imagine using this expansion to study the effect of finite width networks.

Next we show that eq. (49) can be diagonalized by switching to Fourier basis. Because our network is topologically equivalent to a ring we can always expand \epsilon^{l} in Fourier series to get,

\epsilon^{l}=\sum_{q}\epsilon_{q}e^{-iql}. (56)

Since \epsilon^{L+1}=\epsilon^{0} it follows that q=2n\pi/(L+1) for n\in\mathbb{Z}. The depth of our network therefore determines the highest frequency fluctuations that we will be able to observe.

Result 5.

Replacing \epsilon^{l} in eq. (49) by its Fourier series we get an energy,

\mathcal{L}(\{r^{*}+\epsilon_{q}\})-\mathcal{L}(r^{*})=\frac{L\Delta_{w}}{% \sigma_{b}^{2}}\sum_{q}\left\{(1+\sigma_{w}^{4})-2\sigma_{w}^{2}\cos q\right\}% |\epsilon_{q}|^{2}. (57)

The associated probability distribution is factorial Gaussian. It follows that the different Fourier modes of fluctuations in a deep linear network behave as uncoupled Gaussian random variables.

Proof.

It follows that we may rewrite eq. 49 as,

\displaystyle\mathcal{L}(\{r^{*}+\epsilon^{l}\})-\mathcal{L}(r^{*}) \displaystyle=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{l=0}^{L}\left((1+\sigma_{% w}^{4})(\epsilon^{l})^{2}-2\sigma_{w}^{2}\epsilon^{l}\epsilon^{l-1}\right) (58)
\displaystyle=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{l=0}^{L}\sum_{qq^{\prime}% }\left((1+\sigma_{w}^{4})\epsilon_{q}\epsilon_{q^{\prime}}e^{-i(q+q^{\prime})l% }-2\sigma_{w}^{2}\epsilon_{q}\epsilon_{q^{\prime}}e^{-i(q+q^{\prime})l}e^{iq}\right) (59)
\displaystyle=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{qq^{\prime}}\left((1+% \sigma_{w}^{4})\epsilon_{q}\epsilon_{q^{\prime}}-2\sigma_{w}^{2}\epsilon_{q}% \epsilon_{q^{\prime}}e^{iq}\right)\sum_{l=0}^{L}e^{-i(q+q^{\prime})l} (60)
\displaystyle=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{qq^{\prime}}\left((1+% \sigma_{w}^{4})\epsilon_{q}\epsilon_{q^{\prime}}-2\sigma_{w}^{2}\epsilon_{q}% \epsilon_{q^{\prime}}e^{iq}\right)\delta_{q,-q^{\prime}} (61)
\displaystyle=\frac{L\Delta_{w}}{\sigma_{b}^{2}}\sum_{q}\left((1+\sigma_{w}^{4% })\epsilon_{q}\epsilon_{-q}-2\sigma_{w}^{2}\epsilon_{q}\epsilon_{-q}e^{iq}\right) (62)

where we have used the exponential representation of the \delta-function,

\sum_{l}e^{-iql}=L\delta_{q,0}. (63)

Finally note that since \epsilon^{l} is real it must be true that \epsilon_{-q}=\epsilon_{q}^{\dagger}. It follows that,

\displaystyle\mathcal{L}(\{r^{*}+\epsilon^{l}\})-\mathcal{L}(r^{*}) \displaystyle=\frac{L\Delta_{w}}{\sigma_{b}^{2}}\sum_{q}\left((1+\sigma_{w}^{4% })\epsilon_{q}\epsilon_{-q}-2\sigma_{w}^{2}\epsilon_{q}\epsilon_{-q}e^{iq}\right) (64)
\displaystyle=\frac{L\Delta_{w}}{\sigma_{b}^{2}}\sum_{q}\left((1+\sigma_{w}^{4% })-2\sigma_{w}^{2}e^{iq}\right)|\epsilon_{q}|^{2} (65)
\displaystyle=\frac{L\Delta_{w}}{\sigma_{b}^{2}}\sum_{q}\left((1+\sigma_{w}^{4% })-2\sigma_{w}^{2}\cos q\right)|\epsilon_{q}|^{2} (66)

where in the last step we have rearranged the sum to pair each mode with its complex conjugate. ∎

The final theoretical result for this section shows that the long-distance behavior of the linear stochastic network can be well described by an effective field theory.

Result 6.

Long range fluctuations of the stochastic linear network (i.e. fluctuations in which \epsilon^{l} varies slowly on the scale of one layer) are governed by the effective field theory defined by the energy,

U[\epsilon(x)]=\frac{\Delta_{w}}{\sigma_{b}^{2}}\int dx\left[\Delta_{w}^{2}(% \epsilon(x))^{2}+\sigma_{w}^{2}\left(\frac{\partial\epsilon(x)}{\partial x}% \right)^{2}\right]. (67)
Proof.

Note that we can rewrite eq. (49) as,

\displaystyle U(\{\epsilon^{l}\}) \displaystyle=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{l=0}^{L}\left[(1+\sigma_{% w}^{4})(\epsilon^{l})^{2}-2\sigma_{w}^{2}\epsilon^{l}\epsilon^{l-1}\right] (68)
\displaystyle=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{l=0}^{L}\left[(1-2\sigma_% {w}^{2}+\sigma_{w}^{4})(\epsilon^{l})^{2}+\sigma_{w}^{2}(\epsilon^{l}-\epsilon% ^{l-1})^{2}\right] (69)
\displaystyle=\frac{\Delta_{w}}{\sigma_{b}^{2}}\sum_{l=0}^{L}\left[\Delta_{w}^% {2}(\epsilon^{l})^{2}+\sigma_{w}^{2}(\epsilon^{l}-\epsilon^{l-1})^{2}\right]. (70)

Let us now suggestively write \epsilon(l)=\epsilon^{l}. If \epsilon^{l} is varying slowly on the scale of individual layers and further if L\gg 1 then we can approximate,

\frac{\epsilon^{l}-\epsilon^{l-1}}{1}\approx\frac{\partial\epsilon(l)}{% \partial l}. (71)

We can additionally interpret the sum over layers as a Riemann sum. This yields the effective field theory for long-wavelength fluctuations,

U[\epsilon(x)]\approx\frac{\Delta_{w}}{\sigma_{b}^{2}}\int dx\left[\Delta_{w}^% {2}\epsilon^{2}(x)+\sigma_{w}^{2}\left(\frac{\partial\epsilon}{\partial x}% \right)^{2}\right] (72)

with the replacement l\to x. ∎

8.3 Numerical Results on Linear Stochastic Networks

We now provide a more detailed description of the numerical methods discussed in the main text. We would like to sample the mean pre-activation and fluctuations about the mean for deep and wide linear stochastic networks on a ring. However, in practice it is easier to consider a linear topology (non-ring) and consider the “bulk” fluctuations and mean of the pre-activations after any transient from the input to the network has decayed. In general we consider constant width networks with N=200 and L=1024.

To generate a single sample of pre-activations from the ensemble of linear stochastic networks with J_{D}=0, we randomly initialize the weights and biases according to W_{ij}^{l}\sim\mathcal{N}(0,\sigma_{w}^{2}/N) and b^{l}_{i}\sim\mathcal{N}(0,\sigma_{b}^{2}). We then feed a random input into the network and record the norm of the pre-activations after each layer. By repeating this process M=200 times we are able to get Monte-Carlo samples from the ensemble of pre-activations for linear stochastic neural networks.

Figure 7: Samples from an L=1024 layer stochastic linear neural network with J_{D}=0, N=200, \sigma_{b}^{2}=0.01, and \sigma_{w}^{2}=0.9. (a) Real space values of the norm as it travels through different instantiations of the network. We notice that early in the network there is a transient signal after which the pre-activations reach their “bulk” behaviour. We separate the signal propagation into two regions: a transient region and an analysis region. (b) The Fast Fourier Transform of the pre-activations in the analysis region of (a) for different instantiations of the network.

Fig. 7 (a) shows the norm of the pre-activations at different layers of a stochastic linear network for different random instantiations of the weights. This plot therefore shows different samples from the ensemble of stochastic linear networks. We notice that there is a transient effect of the input that lasts for around 100 layers. To perform our analysis and make a correspondence between the stochastic network on a ring we would like to only consider the “bulk” behaviour of the network. To this end we divide the trajectory of the norm of pre-activations into two halves and study only the half from layer 512 to layer 1024. We anticipate for all values of \sigma_{w}^{2} studied the transient ought to have decayed by this point.

In Fig. 7 (b) we show the Fast Fourier Transform (FFT) for the fluctuations of the instantiations of the norm of the pre-activations about the theoretical mean. We notice that the Fourier modes - as with the real space fluctuations, the Fourier modes are also stochastic. There is clearly a change in the variance of the modes as a function of wavevector.

Figure 8: The distribution of small fluctuations, \epsilon_{q}. Different curves represent different values of q.

Fig. 8 shows histograms of \epsilon_{q} for different values of q.

Figure 9: Statistics of small fluctuations. (a) Shows the expected value of \epsilon_{q} and (b) shows the variance, \langle\epsilon_{q}^{\dagger}\epsilon_{q}\rangle.

Fig. 9 (a) shows the mean of the distribution of \epsilon_{q} as a function of wavevector. It is clear that \epsilon_{q} has mean zero everywhere but the uncertainty in the measurement of the mean increases as q\to 0. In fig. 9 (b) we finally plot the variance of small fluctuations. This is what is plotted against the theoretical prediction in the main text.

8.4 Theoretical Results on Rectified Linear Stochastic Networks

Next we discuss the case of a rectified linear network. For rectified linear units we begin by noting that any vector z^{l} can be decomposed into its positive and negative components as z^{l}=z^{l}_{+}+z^{l}_{-}. With this decomposition in mind we can write the squared norm of z^{l} as |z^{l}|^{2}=|z^{l}_{-}|^{2}+|z^{l}_{+}|^{2}. The partition function can therefore be written as

Q=\int[dz]\exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left(\frac{|z_{+}^{l}|^{2}+|z^{% l}_{-}|^{2}}{\sigma_{w}^{2}N^{-1}|z^{l-1}_{+}|^{2}+\sigma_{b}^{2}}+N\log(% \sigma_{w}^{2}N^{-1}|z^{l}_{+}|^{2}+\sigma_{b}^{2})\right)\right]. (73)

We wish to perform a change of variables into hyperspherical coordinates as before. Unlike in the linear case, here we must be more careful since the probability distribution is anisotropic. This leads us to our first result.

Result 7.

The partition function for rectified linear stochastic networks can bet written as,

\displaystyle Q \displaystyle=2\left(\frac{\sqrt{\pi}}{2}\right)^{N}\prod_{l}\sum_{k_{l}=0}^{N% }{N\choose k_{l}}\frac{1}{\Gamma\left(\frac{N-k_{l}}{2}\right)\Gamma\left(% \frac{k_{l}}{2}\right)}\int dr^{l}_{+}dr^{l}_{-}(r^{l}_{+})^{N-k_{l}-1}(r^{l}_% {-})^{k_{l}-1}
\displaystyle           \times\exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left(\frac{% (r_{+}^{l})^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l-1}_{+})^{2}+\sigma_% {b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+\sigma_{b}^{2})\right)\right] (74)

by making a hyperspherical coordinate transformation in both z^{l}_{+} and z^{l}_{-} separately. Here r_{+}^{l} is the norm of the positive components of the pre-activations in layer l, r_{-}^{l} is the norm of the negative components of the pre-activations, and k^{l} is the number of components of the pre-activations that are positive.

Proof.

The partition function in eq (73) can be decomposed into a sum over orthants for each layer separately as,

\displaystyle Q \displaystyle=\int\prod_{l}[dz^{l}]\exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left(% \frac{|z_{+}^{l}|^{2}+|z^{l}_{-}|^{2}}{\sigma_{w}^{2}N^{-1}|z^{l-1}_{+}|^{2}+% \sigma_{b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}|z^{l}_{+}|+\sigma_{b}^{2})\right)\right]
\displaystyle=\frac{1}{2^{N}}\prod_{l}\sum_{k_{l}=0}^{N}\int[dz^{l}_{+}][dz^{l% }_{-}]{N\choose k_{l}}\exp\Bigg{[}-\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}\frac{|z_{% +}^{l}|^{2}+|z^{l}_{-}|^{2}}{\sigma_{w}^{2}N^{-1}|z^{l-1}_{+}|^{2}+\sigma_{b}^% {2}}
\displaystyle                                                   +N\log(\sigma_% {w}^{2}N^{-1}|z^{l}_{+}|+\sigma_{b}^{2})\Bigg{)}\Bigg{]} (75)

where N\choose k_{l} counts the number of orthants with k_{l} positive components. In the above z^{l}_{+} is integrated over \mathbb{R}^{N-k_{l}} and z^{l}_{-} is integrated over \mathbb{R}^{k_{l}}.

In each orthant, the integrand is spherically symmetric over z^{l}_{+} and z^{l}_{-} separately. We may therefore make two change of variables into spherical coordinates for both z^{l}_{+} and z^{l}_{-} respectively.

\displaystyle Q \displaystyle=\frac{1}{2^{N}}\prod_{l}\sum_{k_{l}=0}^{N}\int[dz^{l}_{+}][dz^{l% }_{-}]{N\choose k_{l}}\exp\Bigg{[}-\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}\frac{|z_{% +}^{l}|^{2}+|z^{l}_{-}|^{2}}{\sigma_{w}^{2}N^{-1}|z^{l-1}_{+}|^{2}+\sigma_{b}^% {2}}
\displaystyle                                                   +N\log(\sigma_% {w}^{2}N^{-1}|z^{l}_{+}|+\sigma_{b}^{2})\Bigg{)}\Bigg{]} (76)
\displaystyle=\frac{1}{2^{N}}\prod_{l}\sum_{k_{l}=0}^{N}{N\choose k_{l}}\int dr% ^{l}_{+}dr^{l}_{-}d\Omega_{+}d\Omega_{-}(r^{l}_{+})^{N-k_{l}-1}(r^{l}_{-})^{k_% {l}-1}
\displaystyle           \times\exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left(\frac{% |z_{+}^{l}|^{2}+|z^{l}_{-}|^{2}}{\sigma_{w}^{2}N^{-1}|z^{l-1}_{+}|^{2}+\sigma_% {b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}|z^{l}_{+}|+\sigma_{b}^{2})\right)\right] (77)
\displaystyle=2\left(\frac{\sqrt{\pi}}{2}\right)^{N}\prod_{l}\sum_{k_{l}=0}^{N% }{N\choose k_{l}}\frac{1}{\Gamma\left(\frac{N-k_{l}}{2}\right)\Gamma\left(% \frac{k_{l}}{2}\right)}\int dr^{l}_{+}dr^{l}_{-}(r^{l}_{+})^{N-k_{l}-1}(r^{l}_% {-})^{k_{l}-1}
\displaystyle           \times\exp\left[-\frac{1}{2}\sum_{l=0}^{L}\left(\frac{% (r_{+}^{l})^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l-1}_{+})^{2}+\sigma_% {b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+\sigma_{b}^{2})\right)\right] (78)

Where d\Omega_{+} and d\Omega_{-} are angular integrals over the positive and negative components respectively. In the final step we have integrated over the angular piece explicitly to give a volume factor which, crucially, depends on k_{l}. ∎

We now consider the N\to\infty limit of eq. (74). Unlike in the linear case, we must take some care when applying the saddle point approximation here. In particular, we would like to first get the partition function into a form that is more amenable to analysis. Our first step will therefore be to construct a continuum approximation for k_{l}.

Result 8.

As N\to\infty the sum over orthants in eq. (74) can be converted to an integral and the \Gamma functions can be approximated to give,

\displaystyle Q \displaystyle=\int[dk_{l}][dr_{+}^{l}][dr_{-}^{l}]\exp\Bigg{[}-\frac{1}{2}\sum% _{l=0}^{L}\Bigg{(}\frac{(r_{+}^{l})^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(% r^{l-1}_{+})^{2}+\sigma_{b}^{2}}+k_{l}(3\log k_{l}-2\log r_{-}^{l})
\displaystyle               +N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+\sigma_% {b}^{2})+(N-k_{l})(3\log(N-k_{l})-2\log r_{+}^{l})\Bigg{)}\Bigg{]} (79)

where k_{l} is now a continuously varying field.

Proof.

In the large N limit we first note that the sum over orthants will concentrate about k_{l}=N/2. Moreover the product of binomial coefficients and \Gamma functions can be approximated using Stirling’s approximation. We therefore aim to approximate the sum by an integrand in the large N limit. To do this first define \Delta k=(2/\sqrt{\pi})^{N} noting that \Delta k\to 0 as N\to\infty. In the large N limit the sum is identically a Riemann sum and so we may write (taking liberties to add/subtract 1 when convenient),

\displaystyle Q \displaystyle\approx 2\prod_{l}\int dk_{l}\frac{\Gamma(N+1)}{\Gamma(N-k_{l}+1)% \Gamma(k_{l}+1)\Gamma\left(\frac{N-k_{l}}{2}\right)\Gamma\left(\frac{k_{l}}{2}% \right)}\int dr^{l}_{+}dr^{l}_{-}(r^{l}_{+})^{N-k_{l}-1}(r^{l}_{-})^{k_{l}-1}
\displaystyle                      \times\exp\left[-\frac{1}{2}\sum_{l=0}^{L}% \left(\frac{(r_{+}^{l})^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l-1}_{+})% ^{2}+\sigma_{b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+\sigma_{b}^{2})% \right)\right] (80)
\displaystyle\approx\prod_{l}\int dk_{l}\frac{1}{\Gamma(N-k_{l}+1)\Gamma(k_{l}% +1)\Gamma\left(\frac{N-k_{l}}{2}+1\right)\Gamma\left(\frac{k_{l}}{2}+1\right)}% \int dr^{l}_{+}dr^{l}_{-}(r^{l}_{+})^{N-k_{l}}(r^{l}_{-})^{k_{l}}
\displaystyle                      \times\exp\left[-\frac{1}{2}\sum_{l=0}^{L}% \left(\frac{(r_{+}^{l})^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l-1}_{+})% ^{2}+\sigma_{b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+\sigma_{b}^{2})% \right)\right] (81)
\displaystyle\approx\int[dk_{l}][dr_{+}^{l}][dr_{-}^{l}]\left(\frac{2^{1/3}e}{% N-k_{l}}\right)^{\frac{3}{2}(N-k_{l})}\left(\frac{2^{1/3}e}{k_{l}}\right)^{% \frac{3}{2}k_{l}}\exp\Bigg{[}-\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}\frac{(r_{+}^{l% })^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l-1}_{+})^{2}+\sigma_{b}^{2}}
\displaystyle           +N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+\sigma_{b}^% {2})-2(N-k_{l})\log r_{+}^{l}-2k_{l}\log r_{-}^{l}\Bigg{)}\Bigg{]} (82)
\displaystyle=\int[dk_{l}][dr_{+}^{l}][dr_{-}^{l}]\exp\Bigg{[}-\frac{1}{2}\sum% _{l=0}^{L}\Bigg{(}\frac{(r_{+}^{l})^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(% r^{l-1}_{+})^{2}+\sigma_{b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+% \sigma_{b}^{2})
\displaystyle                      +(N-k_{l})(3\log(N-k_{l})-2\log r_{+}^{l})+% k_{l}(3\log k_{l}-2\log r_{-}^{l})\Bigg{)}\Bigg{]}. (83)

Note that we have neglected the sub-leading constant factor in Stirling’s approximation. Thus, we see that the anisotropy of the rectified linear unit causes us to have three independent fields that must be dealt with. ∎

We now compute the saddle point approximation to eq. (79). As in the case of the stochastic linear network we begin by assuming the existence of a uniform solution with r_{+}=r^{l}_{+}, r_{-}=r^{l}_{-}, and k=k_{l}. This leads us to our next result.

Result 9.

For \sigma_{w}^{2}<2 rectified linear stochastic networks have a uniform configuration of fields that minimizes the energy. This minimum configuration has,

r_{+/-}=\sqrt{\frac{N\sigma_{b}^{2}}{2(1-\sigma_{w}^{2}/2)}}\hskip 60.0ptk=% \frac{N}{2}. (84)
Proof.

In the case of the rectified linear network we notice that the energy function will be,

\displaystyle\mathcal{L}=\frac{1}{2}\sum_{l=0}^{L}\Bigg{(} \displaystyle\frac{(r_{+}^{l})^{2}+(r^{l}_{-})^{2}}{\sigma_{w}^{2}N^{-1}(r^{l-% 1}_{+})^{2}+\sigma_{b}^{2}}+N\log(\sigma_{w}^{2}N^{-1}(r^{l}_{+})^{2}+\sigma_{% b}^{2})
\displaystyle+(N-k_{l})(3\log(N-k_{l})-2\log r_{+}^{l})+k_{l}(3\log k_{l}-2% \log r_{-}^{l})\Bigg{)}. (85)

Given the anzats of a constant solution we seek a minimum satisfying the equations,

\displaystyle\frac{\partial H}{\partial r_{+}}=\frac{(1+\sigma_{w}^{2})r_{+}}{% \sigma_{w}^{2}N^{-1}r_{+}^{2}+\sigma_{b}^{2}}-\frac{r_{+}^{2}+r_{-}^{2}}{(% \sigma_{w}^{2}N^{-1}r_{+}^{2}+\sigma_{b}^{2})^{2}}\sigma_{w}^{2}N^{-1}r_{+}-% \frac{N-k_{l}}{r_{+}^{l}}=0 (86)
\displaystyle\frac{\partial H}{\partial r_{-}}=\frac{r_{-}}{\sigma_{w}^{2}N^{-% 1}r_{+}^{2}+\sigma_{b}^{2}}-\frac{k}{r_{-}}=0 (87)
\displaystyle\frac{\partial H}{\partial k}=\log r_{+}-\log r_{-}-\frac{3}{2}% \log(N-k)+\frac{3}{2}\log k=0 (88)

These equations can be solved straightforwardly to give the required result. While the extremum of the saddle point approximation is qualitatively identical to the linear network we expect fluctuations in this case to be quite different. We can see that this will be the case first and foremost because we now have three fields instead of a single field. Fluctuations in these three directions will interact in interesting and measurable ways. ∎

Next we compute fluctuations about the saddle point solution. To do this, as before, we make the change of variables k_{l}=k+\epsilon_{k}^{l}, r_{+}^{l}=r+\epsilon_{+}^{l}, and r_{-}^{l}=r+\epsilon_{-}^{l}. This leads us to our main result for rectified linear networks.

Result 10.

Small fluctuations of r^{l}_{+/-} and k^{l} about the saddle point solution are described by the energy in excess of the energy of the constant solution

U=\mathcal{L}(\{r_{+}+\epsilon^{l}_{+}\},\{r_{-}+\epsilon^{l}_{-}\},\{k+% \epsilon^{l}_{k}\})-\mathcal{L}(r_{+},r_{-},k), (89)

which may be expanded to quadratic order to give,

U=-\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}(1+\sigma_{w}^{4}/2)(\tilde{\epsilon}_{+}^% {l})^{2}+(\tilde{\epsilon}_{-}^{l})^{2}+3(\tilde{\epsilon}_{k}^{l})^{2}+\tilde% {\epsilon}_{k}^{l}(\tilde{\epsilon}_{+}^{l}-\tilde{\epsilon}_{-}^{l})-\sigma_{% w}^{2}\tilde{\epsilon}_{+}^{l-1}(\tilde{\epsilon}_{+}^{l}+\tilde{\epsilon}_{-}% ^{l})\Bigg{)} (90)

where \tilde{\epsilon}^{l}_{+/-}=\sqrt{2(1-\sigma_{w}^{2}/2)/\sigma_{b}^{2}}\epsilon% ^{l}_{+/-} and \tilde{\epsilon}_{k}^{l}=\epsilon^{l}_{k}/\sqrt{N}.

Proof.

Before we begin we define r=r_{+/-},

\sigma_{w}^{2}N^{-1}r^{2}+\sigma_{b}^{2}=\frac{1}{2}\frac{\sigma_{w}^{2}\sigma% _{b}^{2}+2(1-\sigma_{w}^{2}/2)\sigma_{b}^{2}}{1-\sigma_{w}^{2}/2}=\frac{\sigma% _{b}^{2}}{1-\sigma_{w}^{2}/2}=\eta (91)

and \alpha=\sigma_{w}^{2}N^{-1}. Expanding the energy directly we find that,

\displaystyle\mathcal{L} \displaystyle=\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}\frac{2r^{2}+2r(\epsilon_{+}^{l% }+\epsilon_{-}^{l})+(\epsilon_{+}^{l})^{2}+(\epsilon_{-}^{l})^{2}}{\eta+\sigma% _{w}^{2}N^{-1}\epsilon_{+}^{l-1}(2r+\epsilon_{+}^{l-1})}+N\log(\eta+\sigma_{w}% ^{2}N^{-1}\epsilon_{+}^{l}(2r+\epsilon_{+}^{l}))
\displaystyle                                    +(N/2-\epsilon_{k}^{l})(3\log% (N/2-\epsilon_{k}^{l})-2\log(r+\epsilon_{+}^{l}))
\displaystyle                                    +(N/2+\epsilon_{k}^{l})(3\log% (k+\epsilon_{k}^{l})-2\log(r+\epsilon_{-}^{l})\Bigg{)}\Bigg{]} (92)
\displaystyle\approx\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}\frac{N}{r^{2}}(1+\alpha^% {2}N^{2}/2)(\epsilon_{+}^{l})^{2}+\frac{N}{r^{2}}(\epsilon_{-}^{l})^{2}+\frac{% 6}{N}(\epsilon_{k}^{l})^{2}
\displaystyle                      +\frac{2}{r}\epsilon_{k}^{l}(\epsilon_{+}^{% l}-\epsilon_{-}^{l})-\frac{\alpha N^{2}}{r^{2}}\epsilon_{+}^{l-1}(\epsilon_{+}% ^{l}+\epsilon_{-}^{l})\Bigg{)}\Bigg{]}. (93)

Substituting back in for \alpha and r we arrive at the equation,

\displaystyle=\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}\frac{2(1-\sigma_{w}^{2}/2)(1+% \sigma_{w}^{4}/2)}{\sigma_{b}^{2}}(\epsilon_{+}^{l})^{2}+\frac{2(1-\sigma_{w}^% {2}/2)}{\sigma_{b}^{2}}(\epsilon_{-}^{l})^{2}+\frac{6}{N}(\epsilon_{k}^{l})^{2}
\displaystyle               +2\sqrt{\frac{2(1-\sigma_{w}^{2}/2)}{N\sigma_{b}^{% 2}}}\epsilon_{k}^{l}(\epsilon_{+}^{l}-\epsilon_{-}^{l})-\frac{2\sigma_{w}^{2}(% 1-\sigma_{w}^{2}/2)}{\sigma_{b}^{2}}\epsilon_{+}^{l-1}(\epsilon_{+}^{l}+% \epsilon_{-}^{l})\Bigg{)}\Bigg{]}. (94)

We will make the change of variables \epsilon^{l}_{k}\to\epsilon^{l}_{k}/\sqrt{N}, \epsilon^{l}_{+/-}\to\sqrt{2(1-\sigma_{w}^{2}/2)/\sigma_{b}^{2}}\epsilon^{l}_{% +/-} in which case we may rewrite the above equation in normal coordinates as,

\displaystyle Q \displaystyle=\int[d\epsilon_{k}^{l}][d\epsilon_{+}^{l}][d\epsilon_{-}^{l}]% \exp\Bigg{[}-\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}(1+\sigma_{w}^{4}/2)(\epsilon_{+% }^{l})^{2}+(\epsilon_{-}^{l})^{2}+3(\epsilon_{k}^{l})^{2}
\displaystyle                                 +\epsilon_{k}^{l}(\epsilon_{+}^{% l}-\epsilon_{-}^{l})-\sigma_{w}^{2}\epsilon_{+}^{l-1}(\epsilon_{+}^{l}+% \epsilon_{-}^{l})\Bigg{)}\Bigg{]} (95)

which completes the proof. ∎

Next we change variables to Fourier basis and - as in the case of the linear network - we find that the Fourier modes are decoupled. Unlike in the case of the linear network here the different fields remain coupled yielding. Small fluctuations in the Fourier basis are therefore described by a Gaussian distribution with nontrivial covariance matrix. In the main text we checked this covariance matrix against measurements from numerical experiments.

Result 11.

Changing variables to Fourier basis in eq. (90) yields the distribution over Fourier modes defined by the energy,

U=-\frac{1}{2}\sum_{q}\begin{pmatrix}\epsilon_{+}^{-q}&\epsilon_{-}^{-q}&% \epsilon_{k}^{-q}\end{pmatrix}\begin{pmatrix}1+\sigma_{w}^{4}/2-\sigma_{w}^{2}% \cos q&-\frac{1}{2}\sigma_{w}^{2}e^{-iq}&\frac{1}{2}\\ -\frac{1}{2}\sigma_{w}^{2}e^{iq}&1&-\frac{1}{2}\\ \frac{1}{2}&-\frac{1}{2}&3\end{pmatrix}\begin{pmatrix}\epsilon_{+}^{q}\\ \epsilon_{-}^{q}\\ \epsilon_{k}^{q}\end{pmatrix} (96)

which is well defined as a Gaussian with inverse covariance matrix,

\Sigma^{-1}(q)=\begin{pmatrix}1+\sigma_{w}^{4}/2-\sigma_{w}^{2}\cos q&-\frac{1% }{2}\sigma_{w}^{2}e^{-iq}&\frac{1}{2}\\ -\frac{1}{2}\sigma_{w}^{2}e^{iq}&1&-\frac{1}{2}\\ \frac{1}{2}&-\frac{1}{2}&3\end{pmatrix}. (97)
Proof.

We now change variables to the Fourier basis and write

\epsilon^{l}_{+}=\sum_{l}e^{-iql}\epsilon^{q}_{+}\hskip 36.0pt\epsilon^{l}_{-}% =\sum_{l}e^{-iql}\epsilon^{q}_{-}\hskip 36.0pt\epsilon^{l}_{k}=\sum_{l}e^{-iql% }\epsilon^{q}_{k} (98)

where each of the q_{x} are summed between 0 and 2\pi in steps of 2\pi/L. With this change of variables eq. (90) can be rewritten as,

\displaystyle Q \displaystyle=\int[d\epsilon_{k}^{q}][d\epsilon_{+}^{q}][d\epsilon_{-}^{q}]% \exp\Bigg{[}-\frac{1}{2}\sum_{l=0}^{L}\sum_{q,q^{\prime}}\Bigg{(}(1+\sigma_{w}% ^{4}/2)\epsilon_{+}^{q}\epsilon_{+}^{q^{\prime}}+\epsilon_{-}^{q}\epsilon_{-}^% {q^{\prime}}+3\epsilon_{k}^{q}\epsilon_{k}^{q^{\prime}}
\displaystyle                      +\epsilon_{k}^{q}(\epsilon_{+}^{q^{\prime}}% -\epsilon_{-}^{q^{\prime}})-\sigma_{w}^{2}\epsilon_{+}^{q}(\epsilon_{+}^{q^{% \prime}}+\epsilon_{-}^{q^{\prime}})e^{-iq}\Bigg{)}e^{i(q+q^{\prime})l}\Bigg{]} (99)
\displaystyle=\int[d\epsilon_{k}^{q}][d\epsilon_{+}^{q}][d\epsilon_{-}^{q}]% \exp\Bigg{[}-\frac{1}{2}\sum_{q}\Bigg{(}(1+\sigma_{w}^{4}/2-\sigma_{w}^{2}\cos q% )|\epsilon_{+}^{q}|^{2}+|\epsilon_{-}^{q}|^{2}+3|\epsilon_{k}^{q}|^{2}
\displaystyle                      +\frac{1}{2}\left(\epsilon_{k}^{q}(\epsilon% _{+}^{q})^{*}-\epsilon_{k}^{q}(\epsilon_{-}^{q})^{*}-\sigma_{w}^{2}\epsilon_{+% }^{q}(\epsilon_{-}^{q})^{*}e^{-iq}+\text{h.c.}\right)\Bigg{)}\Bigg{]} (100)

where h.c. refers to the Hermitian conjugate. This may be rewritten in matrix form as,

Q=\int[d\epsilon_{k}^{q}][d\epsilon_{+}^{q}][d\epsilon_{-}^{q}]\exp\left[-% \frac{1}{2}\sum_{q}\begin{pmatrix}\epsilon_{+}^{-q}&\epsilon_{-}^{-q}&\epsilon% _{k}^{-q}\end{pmatrix}\Sigma^{-1}(q)\begin{pmatrix}\epsilon_{+}^{q}\\ \epsilon_{-}^{q}\\ \epsilon_{k}^{q}\end{pmatrix}\right] (101)

which is well defined as a Gaussian with inverse covariance matrix,

\Sigma^{-1}(q)=\begin{pmatrix}1+\sigma_{w}^{4}/2-\sigma_{w}^{2}\cos q&-\frac{1% }{2}\sigma_{w}^{2}e^{-iq}&\frac{1}{2}\\ -\frac{1}{2}\sigma_{w}^{2}e^{iq}&1&-\frac{1}{2}\\ \frac{1}{2}&-\frac{1}{2}&3\end{pmatrix}. (102)

As before this allows us to test the predictions of the theory. ∎

Finally, we construct the effective field theory under the assumption that all of the fields are slowly varying with respect to a single layer.

Result 12.

The effective field theory for long wavelength fluctuations of the rectified linear network will be given by,

\displaystyle U=\frac{1}{2}\int dx\Bigg{[} \displaystyle(1-\sigma_{w}^{2}+\sigma_{w}^{4}/2)(\epsilon_{+}(x))^{2}+(% \epsilon_{-}(x))^{2}+3(\epsilon_{k}(x))^{2}+\epsilon_{k}(x)(\epsilon_{+}(x)-% \epsilon_{-}(x))
\displaystyle+\sigma_{w}^{2}\epsilon_{+}(x)\epsilon_{-}(x)+\sigma_{w}^{2}\left% (\frac{\partial\epsilon_{+}(x)}{\partial x}\right)^{2}+\sigma_{w}^{2}\frac{% \partial\epsilon_{+}(x)}{\partial x}\epsilon_{-}(x)\Bigg{]}. (103)
Proof.

We begin by noting that eq. (90) can be rewritten as,

\displaystyle U \displaystyle=-\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}(1+\sigma_{w}^{4}/2)(\tilde{% \epsilon}_{+}^{l})^{2}+(\tilde{\epsilon}_{-}^{l})^{2}+3(\tilde{\epsilon}_{k}^{% l})^{2}+\tilde{\epsilon}_{k}^{l}(\tilde{\epsilon}_{+}^{l}-\tilde{\epsilon}_{-}% ^{l})-\sigma_{w}^{2}\tilde{\epsilon}_{+}^{l-1}(\tilde{\epsilon}_{+}^{l}+\tilde% {\epsilon}_{-}^{l})\Bigg{)} (104)
\displaystyle=-\frac{1}{2}\sum_{l=0}^{L}\Bigg{(}(1-\sigma_{w}^{2}+\sigma_{w}^{% 4}/2)(\tilde{\epsilon}_{+}^{l})^{2}+(\tilde{\epsilon}_{-}^{l})^{2}+3(\tilde{% \epsilon}_{k}^{l})^{2}+\sigma_{w}^{2}\tilde{\epsilon}_{+}^{l}\tilde{\epsilon}_% {-}^{l}+\tilde{\epsilon}_{k}^{l}(\tilde{\epsilon}_{+}^{l}-\tilde{\epsilon}_{-}% ^{l})
\displaystyle                                                 +\frac{1}{2}% \sigma_{w}^{2}(\tilde{\epsilon}_{+}^{l}-\tilde{\epsilon}_{+}^{l-1})^{2}+\sigma% _{w}^{2}(\tilde{\epsilon}_{+}^{l}-\tilde{\epsilon}_{+}^{l-1})\tilde{\epsilon}_% {-}^{l}\Bigg{)}. (105)

As before we when fluctuations are slowly varying on the order of a single layer of the network we can interpret \tilde{\epsilon}^{l}_{+}-\tilde{\epsilon}^{l-1}_{+}\approx\partial\epsilon_{+}% /\partial x and we can approximate the sum by an integral. Together these approximations give,

\displaystyle U=\frac{1}{2}\int dx\Bigg{[} \displaystyle(1-\sigma_{w}^{2}+\sigma_{w}^{4}/2)(\epsilon_{+}(x))^{2}+(% \epsilon_{-}(x))^{2}+3(\epsilon_{k}(x))^{2}+\epsilon_{k}(x)(\epsilon_{+}(x)-% \epsilon_{-}(x))
\displaystyle+\sigma_{w}^{2}\epsilon_{+}(x)\epsilon_{-}(x)+\sigma_{w}^{2}\left% (\frac{\partial\epsilon_{+}(x)}{\partial x}\right)^{2}+\sigma_{w}^{2}\frac{% \partial\epsilon_{+}(x)}{\partial x}\epsilon_{-}(x)\Bigg{]}. (106)

as expected. ∎

References

  • Buice and Chow (2013) Michael A Buice and Carson C Chow. Beyond mean field theory: statistical field theory for neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2013(03):P03003, 2013. URL http://stacks.iop.org/1742-5468/2013/i=03/a=P03003.
  • Chaikin and Lubensky (2000) Paul M Chaikin and Tom C Lubensky. Principles of condensed matter physics. Cambridge university press, 2000.
  • Cho and Saul (2009) Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350, 2009.
  • Choromanska et al. (2015) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
  • Daniely et al. (2016) A. Daniely, R. Frostig, and Y. Singer. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity. arXiv:1602.05897, 2016.
  • Daniely et al. (2017) Amit Daniely, Roy Frostig, Vineet Gupta, and Yoram Singer. Random features for compositional kernels. CoRR, abs/1703.07872, 2017. URL http://arxiv.org/abs/1703.07872.
  • Ermentrout and Cowan (1979) G Bard Ermentrout and Jack D Cowan. A mathematical theory of visual hallucination patterns. Biological cybernetics, 34(3):137–150, 1979.
  • Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  • Jaynes (1957) Edwin T Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620, 1957.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • Mézard et al. (1987) Marc Mézard, Giorgio Parisi, and Miguel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Co Inc, 1987.
  • Neal (1996) Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
  • Neal (2012) Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  • Poole et al. (2016) B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. Neural Information Processing Systems, 2016.
  • Raghu et al. (2017) M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural networks. International Conference on Machine Learning, 2017.
  • Saxe et al. (2014) A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations, 2014.
  • Schneidman et al. (2006) Elad Schneidman, Michael J. Berry, Ronen Segev, and William Bialek. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440(7087):1007–1012, 04 2006. URL http://dx.doi.org/10.1038/nature04701.
  • Schoenholz et al. (2017) Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep Information Propagation. International Conference on Learning Representations, 2017.
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. 2017. URL https://openreview.net/pdf?id=B1ckMDqlg.
  • Weinberg (1967) Steven Weinberg. A model of leptons. Phys. Rev. Lett., 19:1264–1266, Nov 1967. doi: 10.1103/PhysRevLett.19.1264. URL http://link.aps.org/doi/10.1103/PhysRevLett.19.1264.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
1280
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description