Expectation propagation: a probabilistic view of Deep Feed Forward Networks

# Expectation propagation: a probabilistic view of Deep Feed Forward Networks

## Abstract

We present a statistical mechanics model of deep feed forward neural networks (FFN). Our energy-based approach naturally explains several known results and heuristics, providing a solid theoretical framework and new instruments for a systematic development of FFN. We infer that FFN can be understood as performing three basic steps: encoding, representation validation and propagation. We obtain a set of natural activations – such as sigmoid, and ReLu – together with a state-of-the-art one, recently obtained by Ramachandran et al. prajit () using an extensive search algorithm. We term this activation ESP (Expected Signal Propagation), explain its probabilistic meaning, and study the eigenvalue spectrum of the associated Hessian on classification tasks. We find that ESP allows for faster training and more consistent performances over a wide range of network architectures.

## 1 Introduction

Advances in modern computing hardware and availability of massive datasets have empowered multilayer artificial neural networks, or deep learning, with unprecedented capabilities for image and speech recognition tasks. Despite these empirical success, theoretical understanding of why and when multilayer neural networks perform well remains limited. Only recently, theoretical efforts in this direction have been intensively reported. For example, recent works shed light on how FFN attains its expressive power Poggio2017 (); LinTegmark2017 (); Raghu2017 (); Poole2016 (), what contributes to its generalizability Zhang2017 (); Dinh2017 (), and how myriad parameters in the network affect the geometry of the loss function  dauphin (); ch (); penn1 (); penn2 (). Taken together, these theoretical results have paved the way for a systematic design of robust and explainable FFNs.

Although FFN is one of the most studied and widely used neural network architectures, state-of-the-art models still largely inherit their building blocks from the decades-old multilayer perceptron (MLP). By augmenting MLP with modern optimization techniques such as dropout srivastava2014 (), non-linear activation functions that facilitate optimization protocols, and specialized complex network architectures Krizhevsky (), to name a few, the FFN can be efficiently trained on large-scale datasets such as ImageNet or CIFAR to achieve low training and generalization errors. While these engineering feats improve the performance of FFN, a clear design principle behind these developments is still lacking. This leads to an unsystematic growth of the FFN complexity, which obscures its original exposition as a simple physical system that can perform non-linear classification and regression tasks.

To assist future systematic studies and construction of FFNs, we propose a theoretical framework based on the toolset of statistical mechanics. It allows for the definition of an energy based model in which a hidden unit is regarded as a communication channel, first encoding and then transmitting the result of its computation through a gate with a specific transmission probability; as opposed to a hidden unit in multilayer perceptrons that transmits its sigmoidal firing probability, or in modern FFNs where a standard ad hoc activation function such as ReLu and its variants are used. The transmission probability is obtained via the maximum entropy principle zecchina (); jaynes () under the biologically inspired constraint that a neuron responds by firing (transmit its signals) or not with a certain probability. The maximum entropy principle provides the least biased probabilistic interpretation of a hidden neuron, and recasts a FFN as an energy-based model. By interpreting a hidden unit as a communication channel, its activation function takes the form of the expected (mean) signal transmission; remarkably, this activation function, which we term Expected Signal Propagation (ESP), agrees with the state-of-the-art activation (Swish), obtained in Ref. prajit () through an extensive search algorithm trained with reinforcement learning and shown to best perform on the CIFAR and ImageNet datasets among all candidate activations. Although some of these activations may perform better for the specific datasets they have been built for, they generally fail to generalize. Finally, the standard ReLu activation arises as a limiting case of ESP, corresponding to the noiseless limit of a communication channel. To the best of our knowledge, this provides the first formal derivation of the ReLu activation, typically introduced heuristically to facilitate optimization protocols. Despite restricting our analysis to pure FFNs, most of our conclusions carry on to Convolutional and Recurrent networks, which will be addressed in future works.

The paper is organized as follows: In section 2, we provide a plausible argument based on dimensional analysis to explain why a hidden unit should be regarded as a communication channel transmitting the result of its computation. We then discuss the formulation of a FFN as an energy-based model using the maximum entropy principle over the binary states of a hidden unit, and discuss the advantage of this formulation. In analogy with Refs. tishby1 (); tishby2 (), we show that each hidden unit acts first as an encoder and then as a filter determining the quality of its input by comparing it with a local bias potential.

In section 3, we explore the geometry of loss surfaces associated with training FFNs with ESP hidden units on simple classification tasks. The index , representing the fraction of negative eigenvalues of the Hessian (a measure of the number of descent directions), as well as the index , representing the fraction of zero eigenvalues (a measure of the number of flat directions), are investigated. We find that FFNs trained with ReLu exhibit while those trained with ESP often show , resulting in faster and more flexible training. These two indices seem to determine not only the speed of learning but also whether learning will be successful, although further studies are required to understand the detailed dynamics of gradient descent. Lastly, we conclude by commenting on how our model relates (and incorporates) other recently proposed frameworks such as the Renormalization Group mehta () and the information Bottleneck tishby1 (); tishby2 () in section 4. Codes and other numerical results, including training on the MNIST dataset, are provided in the supplemental materials (SM).

## 2 Motivation and Model

In this section, we introduce a statistical mechanics model of an FFN, and highlight its connection to an energy-based model as well as its information theoretic interpretation. To motivate the model, we begin by discussing a (physical) dimensional inconsistency in the standard formulation of forward propagation in FFNs, and how this can be reconciled within the proposed framework.

Consider a general FFN and denote an input vector, where denotes a feature component, and denotes an example. In analogy with the communication channel scheme in information theory mckay (); jaynes (), the input vector constitutes the information source entering the processing units (neurons) of the network, while the units constitute the encoders. Quite generally, the encoders can either build a lower (compression) or higher dimensional (redundant) representation of the input data by means of a properly defined transition function. In a FFN, the former corresponds to a compression layer (fewer units) while the latter to an expansion layer (more units). If the encoded information constitutes an informative representation of the output signal (given the input), it is passed over to the next layer for further processing until the output layer is reached. In the biological neuron, this last step is accomplished by the synaptic button, that releases information whether or not the input signal exceeds a local bias potential . Both in the brain and in electronic devices, the input information is often conveyed in the form of an electric signal, with the electron charge being the basic unit of information, the signal has dimension of in SI. For an image, the information is the brightness level of each pixel, physically proportional to the current passing through it; on a computer, this is encoded in bits 1. Therefore, it is convenient to measure a general information content in units of  jaynes (). Clearly, a linear combination of the input signals, together with the bias, has to preserve dimensions:

 hμi=n∑j=1wijXμj+bi, (1)

where indices the receiving units in the first layer. For noiseless systems, the input signal is transmitted i.f.f. the overall signal exceeds the bias . However, in the presence of noise, the signal can be transmitted with a certain probability even below the threshold. In so far, we just described the general functioning principle underlying FFNs. Let us first consider the sigmoid non linearity , where has inverse dimension of ; being a distribution function defined in it is intrinsically dimensionless. The parameter defines the spread of the distribution, corresponding to the noise level (the temperature in physics); typically, one sets or rescales it inside the weights and bias zecchina (), but here we keep it general for reasons that will be clear later. Defining as the input of the next layer, we can immediately see the dimensional mismatch: a linear combination of is dimensionless and when passed through the new non-linearity, necessarily becomes dimensionful. In the next section we show how this simple fact is at the root of the well known phenomena of vanishing gradients during back propagation. From a conceptual point of view, one is transmitting the expectation value of the transmission gate rather than the processed signal. This problem persists when using the activation (a translation of the sigmoid), but it is absent when using ReLu, that correctly transmits the information itself. In the next subsection we formalise this intuitive picture using the tools of statistical mechanics.

### 2.1 Statistical mechanics of Feed Forward networks

A prototypical statistical mechanics formulation of Neural Networks is the inverse Ising problem zecchina (), where one is interested in infering the values of the couplings and external fields of the Ising model, given a set of pairwise connected binary units ; in computer science, this problem is solved by the Boltzmann machine. While the Boltzmann machine is an energy-based model, standard FFN is not commonly regarded as such. Here, we propose an energy-based formulation of FFNs to bridge on these two formulations, solely based on the maximum entropy principle  zecchina (); roberto (); mckay (); jaynes () to obtain the least biased probabilistic interpretation of hidden neurons. The core task at hand is to determine the state of the channel (open/close). The absence of lateral connections in FFNs means that each unit is only influenced by its receptive signal, and the intralayer correlations are taken – a priori– to be zero. In other words, each unit within the same layer behaves independently of the others (i.e. no contextual information) 2.

Given a statistical ensemble of hidden binary units, the most likely (and the least biased) distribution can be obtained by maximising its entropy, subject to constraints imposed by the conserved quantities; in the grand canonical Gibbs ensemble these are the global average energy and particle number roberto (), while here they correspond to the first and second moments of the data zecchina (); mckay (). However, in the absence of lateral connections, the entropy functional of the hidden units reads

 F=−∑sP(s)logP(s)+η(∑sP(s)−1)+∑iλi(mi−∑ssiP(s)), (2)

where are Lagrange multipliers chosen to reproduce the first moment of the data, while enforces normalization. Functionally varying with respect to the probability and solving for the Lagrange multipliers, one easily obtains roberto ():

 P(s|h)=1Ze∑iβisihi,Z[h]=∏i∑si={0,1}eβisihi=∏i(1+eβihi). (3)

The parameters encode the noise, or statistical uncertainty, concerning our knowledge of (). In a physical system, is the inverse temperature in units of Boltzmann’s constant and in equilibrium it is the same for each ; however, here the network is only in a local equilibrium as the units are not allowed to exchange “energy” (information) among themselves due to the lack of pairwise interactions (the lateral connections) 3. We have also introduced the notation to denote the conditional probability of given the “external field” – and the partition function .

We discuss now the central results of this paper, based on the identification of as an effective, coarse grained field coupling to the  4. The feedforward nature of coarse graining (a directed graph), stems from the fact that it is not possible to entirely retrieve the original information after eliminating part of it; this endorses the forward pass with a semi-group structure. Considering the first layer, we need to evaluate the probability associated to the new, coarse grained variable ,

 P(h)=∫dXQ(h|X)P(X), (4)

where is the transition function modelling the encoder. In deep learning, the form of the transition function is fixed by the forward pass operation, while the input data are drawn from some unknown distribution . We can follow two different paths: fix the value of on the observed (empirical) sequence, or assume the form of the distribution from which has been sampled from. The former corresponds to an empirical dataset, whereas the latter is preferred for an analytical inspection and to determine general statistical properties about the system. In this work we consider the former, while the latter will be discussed in a forthcoming manuscript. Consider then the empirical estimator  zecchina (), where the Kronecker-delta can be understood as a binning function. Using the empirical estimator together with the explicit form of the transition function in the discrete version of Eq. (4) we find

 P(h)=∑Xδ(h−wTX−b)(1m∑μδX,Xμ)=1m∑μδ(h−wTXμ−b), (5)

where the Dirac delta represents the transition function. Eq. (5) is akin to the real space block-spin renormalization developed by Kadanoff, reformulated in the more general language of probability theory roberto (); ma (); cassandro (). The relation between Deep Learning and the renormalization group (RG) was previously observed in the context of restricted Boltzmann machine mehta (). We note in passing that, while the discussion here and in Ref.  mehta () suggests that feedforward propagation in the compression layer is analogous to the coarse graining step of RG, rescaling of the coarse grained variable is missing. Thus, the analogy is incomplete 5.

Finally, given the distribution of the coarse grained inputs and the conditional probability of the channels , one needs to evaluate the channel transmission probability

 P(s)=∫dhP(s|h)P(h)=∑μe−∑iβiHi[s,Xμ]Zm,Z =1mm∑μ=1n1∏i=1[1+eβihμi], (6)

where and, in analogy with the Ising model, we have identified the simple coarse grained Hamiltonian

 Hi=−∑jsiwijXμj−sibi. (7)

Eqs. (6) and (7) can be seen as the starting point of an energy based model, where it is the coarse grained probability that propagates through the network (see e.g. connie () ) as opposed to signals in a FFN. The expected value of the channel transmission, , can be easily obtained as hertz ()

 ⟨si⟩=1βi∂∂bilogZi=∏μ11+e−βihμi≡∏μσ(βihμi), (8)

that is the logistic function (or Fermi-Dirac distribution in physics), evaluated for each empirical realization . We stress that this quantity is not the coarse grained input signal transmitted by the channel; it solely determines the expectation of channel transmissions given the coarsed grained input signal . Note that would be the correct quantity for a single perceptron, or equivalently the output layer of an MLP performing a classification task. However, as discussed in the previous section, the hidden layers should not transmit , as this leads to dimensional mismatch. To ensure dimensional consistency across hidden layers, the output signal of each hidden unit must have the same dimension as its input signal. Therefore, the correct quantity to consider is the expectation value of the coarse grained field ; this can be easily obtained by summing over all gate states, for each unit and example, or by using the partition function of Eq. (6),

 aμi≡⟨hμi⟩=∂∂βilogZi,μ≡hμi⟨si⟩=hμiσ(βihμi), (9)

that agrees with the definition of the energy flux in statistical mechanics bellac (). Note that contrary to Eq. (8), the noise parameters cannot be rescaled away now. We call this activation ESP, “Expected Signal Propagation” to stress its meaning and physical analogy, even though here does not have to be strictly positive. This function was recently obtained in Ref. prajit () (there named Swish), through an extensive search algorithm trained with reinforcement learning. In their extensive search, the authors found that activations of the form better performed on several benchmark datasets. Our model naturally explains this observation by identifying ESP with the expectation value of the coarse grained input transmitted by each unit. In the next section we will better analyse this activation function and its meaning within backward propagation. Consider now the limit of a noiseless system, i.e. , in the above equation:

 limβi→∞⟨hμi⟩=hμiθ(hμi)≡max{hμi,0}≡ReLu, (10)

where is the Heaviside step function and in the last equality we have identified the ReLu activation. To the best of our knowledge, this is the first formal derivation of ReLu, usually obtained heuristically. ReLu should be understood as the noiseless limit of the mean transmitted information across the units, thus explaining the empirical observation that it often outperforms or in multilayer networks. Both ESP and ReLu pass on the expected value of their computation; the latter does that with probability one only if the coarse grained input signal exceeds a threshold, while the former can pass a signal lower than the threshold with a finite probability. In Fig. (1) we plot the three activation functions for . Clearly, ESP encompasses the statistical uncertainty conveyed by each processing unit in estimating the pre-activation value . A positive value of the activation tells the units in the next layer that a certain combination of the previous inputs should be strengthen while a negative value means that it should be weakened, i.e. unlearned; the latter option is absent when using the ReLu function, a feature known to generally slow down learning engel (). In the opposite extreme, the noisy limit , the ESP becomes linear. Therefore, we can consider linear networks as a noisy limit of non linear ones. This process has to be repeated for each and every layer, feeding the post activation distribution of the previous layer as the new empirical distribution, see Eqs. (4) and (9). Therefore, we introduce a layer index for the weights, biases and effective fields: .

We would like to conclude this section noticing that in spin glasses parisi2 (); giardina (), the energy landscape presents an increasing number of degenerate local minima as the temperature is lowered, a feature shared with the Hopfield model of associative memory amit (). The effect of noise in FFNs was recently considered in Ref. pratik (), where an improved optimization algorithm based on the maximum entropy principle was proposed.

### 2.2 Back-propagation: a probabilistic point of view

At the heart of any FFN model is the back propagation algorithm hertz (); bishop (). Forward propagation returns a first “guess” on the value of the learning parameters, that are subsequently adjusted layer by layer by minimizing a properly defined loss function, i.e. the energy of the system. Consider the output layer “L”, then the gradient of the weights is

 ∂L∂wL=1mm∑μ=1[eμg(hμL)]aL−1,g(hμl)=σ(βlhμl)[1+hμlβlσ(−βlhμl)], (11)

where is the residual error, depending on the difference between the network output and the ground truth vector . In the optimization phase we look for the stationary point , where is the set of learning parameters. For a linear network, this condition is strictly satisfied if . However, in a non linear network we can also have . In principle, there may be situations in which is far from zero but , in which case learning will not be effective. For a activation, , commonly known in physics as a phase space factor; it appears e.g. in the collision integral of the Boltzmann equation roberto (), which describes the transition probability from an occupied to an empty state. Ref. tishby1 (); tishby2 () show that there are two distinct phases of learning: empirical error minimization (the residual) and representation learning. Following the discussion of the previous section, we can identify the latter with the task of optimizing . When , i.e. when the signal greatly exceeds the noise, then 6 and the back propagated signal is small, being proportional to . We then have the paradoxical situation in which, although the lower dimensional representation of the information is considered to be relevant, learning is likely to be inefficient.

Consider now the same limiting behaviour for ESP. If we now have , i.e. the representation learning phase is completed and learning moves towards minimising the empirical error. In the opposite limit, the signal is much smaller than the noise and learning is impossible, as expected. Finally, in the noiseless case one obtains the ReLu solution , for respectively greater or smaller than zero. This corresponds to a purely “excitatory” network, in which representation unlearning is not possible. In other words, the fact that ESP can be negative for allows for greater flexibility and the possibility to avoid plateaus surrounding local minima and saddle-points. In Ref. dauphin () it was proposed that plateaus of zero (or small) curvatures in the loss are mostly responsible for slowing down or preventing convergence of gradient descent. The following section provides a numerical analysis to support this picture. Details of the the back-propagation algorithm with ESP for a general layer network can be found in the SM.

## 3 Numerical Analysis

A thorough performance analysis of ESP versus other popular choices of activations was carried out in Ref. prajit (), where it was shown that the former outperformed all other choices on image classification tasks using a convolutional structure. It was also noted that to take full advantage of ESP one should also reconsider the way convolution is performed; for the moment, we leave this interesting point open and rather focus on understanding the optimization dynamics in pure FFNs. We have considered both artificial and experimental datasets trained with ADAM gradient descent adam (). Given that both datasets lead to the same qualitative features, we focus our analysis on the former and discuss the latter, together with additional details in the SM.

In Fig. (2) we show the loss functions for two binary classification tasks: a linear and a non-linear one, each trained with one and two hidden layers. For the linear task with a single, 10 units layer (adding more units does not improve performance), all three activations attain full train/test accuracy but ESP is the faster converging. For the non-linear task, ReLu quickly converges to a suboptimal plateau. To obtain a better understanding, we have evaluated two different indices: the fraction of negative eigenvalues of the Hessian – – (Fig. (3)) and the fraction of zero eigenvalues – –. The former measures the ratio of descent to ascent directions on the energy landscape; when is large, gradient descent can quickly escape a critical point – a saddle point in this case – due to the existence of multiple unstable directions. However, when a critical point exhibits multiple near-zero eigenvalues, roughly captured by , the energy landscape in this neighbourhood consists of several near-flat (to second-order) directions; in this situation, gradient descent will slowly decrease the training loss. In Ref. dauphin () it was noted that energy plateaus are responsible for slowing down or preventing learning. This can be understood by considering the case in which learning completely fails: the loss does not decrease and all the eigenvalues of the Hessian are zero. In this case the analysis is not conclusive and higher order inspection is needed to reveal possible inflection points. In general, we find that for ReLu networks , while this is typically not the case for both ESP and -networks. Taking the two layer case as a representative example, we show that ReLu networks are sensitive to fine tuning of the model: choosing a or a configuration over the considered here, greatly improves learning. In stark contrast, ESP networks exhibit consistent performance over a wider choice of architecture/learning parameters. Although the performance impact might be fairly small for small networks, it certainly plays an important role for larger datasets, as discussed in Ref. prajit (). In Fig. (3) we show the index, usually smaller for the ReLu network, and a finite value of that slows down learning. We find for ReLu and for ESP and , both for the linear and non-linear task. We have also evaluated the fraction of residuals closer to zero and found, surprisingly, that ESP greatly outperforms the other activations in minimizing the empirical error, see Fig. (4).

In addition, we find that the eigenvalue distribution obtained with ReLu shrinks with increasing training epochs, giving rise to the singular distribution reported in Ref. penn1 (); levent (). In contrast, ESP trained networks show a significantly wider spread in the eigenvalue distribution at the end of training, see SM for details and discussions, together with results for the MNIST dataset ?.

## 4 Conclusions and perspectives

In this work we have introduced an energy-based model that allows systematic construction and analysis of FFNs. It provides a coherent interpretation of the computational process of each and every unit (neuron). Furthermore, it presents a method that overcomes the dimensional mismatch that arises from using heuristic activations. Enforcing dimensional consistency naturally leads to a class of activations with the prime focus of propagating the expectation value of processed signals, hence Expected Signal Propagation. Our results provide a theoretical justification of the activation found in Ref. prajit (). In addition, we demonstrate the superiority of this ESP activation through numerical experiments that reveal the geometry of its loss manifold. ReLu networks, unlike and , do not suffer from dimensional mismatch, yet their restricted phase space results in a finite fraction of null directions of gradient descent that can slow down or, in some cases, completely prevent learning. To overcome this problem, one needs a detailed fine tuning of the network’s topology that likely increases the complexity of the learning task. In stark contrast, ESP trained networks are less prone to fine tuning and outperform other activations on minimizing the empirical error. We hope this statistical mechanics view can facilitate future studies of FFNs, e.g. to explore the scaling relation between the ESP Hessian’s eigenvalue distributions and the parameters that describe its loss landscape, analogous to those studied in Ref. penn1 () for standard activations. As previously discussed, the Renormalization Group (RG) approach could hint at design principles of FFNs, as it delineates how complex behaviour emerges from the elementary constituents of a system. Although this framework has been considered in Ref. mehta (); maciej (), the complete mathematical formalism to correctly use it in FFNs is still missing. We believe this could answer some of the most pressing questions in the construction of FFNs, namely: given a dataset showing specific strength and extent of input/output correlations, what is the optimal network topology? Finally, extensions of the methods discussed here to Convolutional and Recurrent architectures could lead to substantial performance improvement and, potentially, faster training algorithms.

## Acknowledgements

M.M. would like to thank Ned Phillips, Bill Phillips, Sarah Chan and Roberto Raimondi for support and discussions. A special thanks to Fábio Hipólito and Aki Ranin for carefully reading and commenting the manuscript. T.C. would like thank Shaowei Lin for useful discussions and for financial support from the startup research grant SRES15111, and the SUTD-ZJU collaboration research grant ZJURP1600103. P.E.T would like to thank M. D. Costa for help with HPC. Some of the calculations were carried out at the HPC facilities of the NUS Centre for Advanced 2D materials. Finally, M.M. And T.C. would like to thank the AI Saturday meetup initiative organised by Nurture.ai, that sparkled some of the ideas presented in this work.

Supplemental Material

In this supplemental material we provide details on the back-propagation algorithm for ESP, the empirical analysis over the MNIST dataset and the evolution of eigenvalue distributions. The code and data used to perform the numerical experiments, together with the videos of the time evolution of the eigenvalue distribution, can be found in Ref. [36].

## Appendix A Back-propagation for ESP

Multi-layer Neural Networks operate – broadly speaking – in two steps: Forward and Backward propagation. The former performs operations on the input information, that then propagates like a signal through the network until it reaches the output layer. Here the processed signal is compared with the provided answer by means of a properly defined error (loss) function, measuring the alignment between the two. Let us introduce first the layer index where denotes the input layer. For algorithmic implementation it is convenient to work directly with the transpose of the weight matrices, i.e. ; then FW propagation is easily defined as:

In the above snippet, the entries are – the input data – of dimension (m= number of training samples, f = number of features ), the weights matrices , the biases and the noise terms . Given a loss function and a set of parameters , we need to find the optimal configuration via optimization . For the last layer one easily finds:

 ∂L∂wL (1) =1mm∑μ=1{eμ[σ(hμL)+hμLβLσ(hμL)(1−σ(hμL))]}⋅aTL−1=1mm∑μ=1[eμg(hμL)]⋅aTL−1.

We use to denote the scalar product, otherwise products are element-wise. We have also introduced the “residual” , whose explicit expression depends on the form of the Loss function. For the regression task, we use the minimum square error:

 L=12mm∑μ=1(^y[θ,Xμ]−yμ)2≡12mm∑μ=1(eμ)2. (2)

 L=−1mm∑μ=1{yμlog(^yμ[θ,Xμ])+(1−yμ)log(1−^yμ[θ,Xμ])}, (3)

where the output of the network is now the probability of the underlying binomial process. Note that the argument of the logarithm is correctly a dimensionless quantity. In this case the residue reads:

 eμ=−(yμ^yμ−1−yμ1−^yμ)=^yμ−yμ^yμ(1−^yμ) (4)

In equation (1) we have also defined

 g(hμl)=σ(βlhμl)+hμlβlσ(βlhμl)[1−σ(βlhμl)]≡σ(βlhμl)[1+hμlβlσ(−βlhμl)]. (5)

Likewise, for the other parameters we find:

 ∂L∂bL =1mm∑μ=1eμg(hμL) (6) ∂L∂βL =1mm∑μ=1eμf(hμL) (7) f(hμL) =[hμL]2σ(hμL)σ(−hμL). (8)

In general, one finds for the l-th hidden layer:

 ∂L∂wl =1mm∑μ=1{eμg(hμL)}⋅∂hL∂wl=1m[L∏i=l+1wi]⋅[el+1∏i=Lg(hi)]⋅al−1. (9) ∂L∂bl =1mm∑μ=1{eμg(hμL)}⋅∂hL∂bl=1m[L∏i=l+1wi]⋅[el+1∏i=Lg(hi)] ∂L∂βl =1mm∑μ=1{eμg(hμL)}⋅∂hL∂βl=1m[L∏i=l+1wi]⋅[el+1∏i=Lg(hi)]f(hμl).

Note that for each of the gradients above, the last equality is fully vectorized, i.e. the sum over is taken care of in the scalar product. The algorithmic implementation of the above equations can be found in the pure python code provided in Ref. [36].

## Appendix B Numerical analysis

In this section we present additional details on the numerical analysis performed in the main text. The two datasets used to study the binary classification problem are shown in Fig. (1) and provided in Ref. [36]. We also present the analysis of the multi-class classification task performed on the MNIST dataset. In Ref. [1] the authors considered the CIFAR-10 and CIFAR-100 datasets as benchmarks to test ESP against ReLu and many of the heuristic activations used in the literature. A detailed analysis using different Convolutional based architectures showed how ESP consistently outperformed all other activations. Here we carry out our analysis using a simple FFN, in which the original MNIST image is flattened to a one dimensional array and used as an input to a 25-units, single layer FFN. The problem with this handwaving method is that it destroys the two dimensional spatial correlations of the original image, hence the poor performance compared to convolutional networks. Within this approach, the difference between ESP and ReLu is less marked in terms of overall performance, but it is qualitatively similar to the binary classification task considered in the main text. Also in this case, training with ESP leads to a faster decay of the loss function, an higher value of and a lower (but here finite) of . Finally, the fraction of residues close to zero is, also in this case, larger for the ESP trained FFN.

A collection of results for the MNIST dataset is shown in Fig. (2).

For completeness, in Fig. (3) we show plots of the -index for the binary classification tasks considered in the main text.

### b.1 Empirical Eigenvalue Distributions

In this section, we report numerical experiments on the time evolution of the empirical eigenvalue distributions of the Hessian as we train the FFN with different activations on different classification tasks. We then compare and contrast the geometry of loss surfaces associated with ESP to that associated with ReLu and with Sigmoid activations.

Recall that, at a critical point, the eigenvector corresponding to a negative (positive) eigenvalue of the Hessian specifies the descent (ascent) direction of the loss surface near the critical point. A critical point is a local minima if all the eigenvalues are positive. On the other hand, the larger the relative fraction of the negative eigenvalues, measured by an index in the main text, the larger the ratio of the descent directions to the ascent and null directions. As a result, when is large, gradient descent can quickly escape a critical point – a saddle point in this case – due to the existence of multiple unstable directions. However, when a critical point exhibits multiple near-zero eigenvalues, roughly captured by the index in the main text, the geometry near the critical point consists of several near-flat (to second-order) directions; in this situation, gradient descent will slowly decrease the training loss, a typical situation when training is almost completed or when the training get stuck in a spurious local minima.

Defining the parameter vector for layer as , , and for a generic loss function , the Hessian matrix is defined as . For a single hidden layer ESP network, it reads:

 ^H=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝∂2L∂w[1]ij∂w[1]lm∂2L∂w[1]ij∂b[1]l∂2L∂w[1]ij∂β[1]l∂2L∂w[1]ij∂w[2]lm∂2L∂w[1]ij∂b[2]l∂2L∂w[1]ij∂β[2]l∂2L∂b[1]i∂w[1]lm∂2L∂b[1]i∂b[1]j∂2L∂b[1]i∂β[1]j∂2L∂b[1]i∂w[2]lm∂2L∂b[1]i∂b[2]j∂2L∂b[1]i∂β[2]j∂2L∂β[1]i∂w[1]lm∂2L∂β[1]i∂b[1]j∂2L∂β[1]i∂β[1]j∂2L∂β[1]i∂w[2]lm∂2L∂β[1]i∂b[2]j∂2L∂β[1]i∂β[2]j∂2L∂w[2]ij∂w[1]lm∂2L∂w[2]ij∂b[1]l∂2L∂w[2]ij∂β[1]l∂2L∂w[2]ij∂w[2]lm∂2L∂w[2]ij∂b[2]l∂2L∂w[2]ij∂β[2]l∂2L∂b[2]i∂w[1]lm∂2L∂b[2]i∂b[1]j∂2L∂b[2]i∂β[1]j∂2L∂b[2]i∂w[2]lm∂2L∂b[2]i∂b[2]j∂2L∂b[2]i∂β[2]j∂2L∂β[2]i∂w[1]lm∂2L∂β[2]i∂b[1]j∂2L∂β[2]i∂β[1]j∂2L∂β[2]i∂w[2]lm∂2L∂β[2]i∂b[2]j∂2L∂β[2]i∂β[2]j⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=(H11H12H21H22). (10)

In the last equality we have introduced a block matrix representation for the Hessian. Clearly, in order to capture correlations between different layers, one needs to evaluate the full Hessian matrix.

In the following figures, we show our empirical observations of different classification tasks. The eigenvalue distributions from training a FFN consisting of a single hidden layer with 10 units and training a FFN consisting of two hidden layers with 8 and 2 hidden units for the non-linear binary classification task are shown in Fig. (4) and Fig. 5, respectively. The results for the same FFN architectures trained on the linear binary classification task are shown in Fig. (6) and Fig. (7), while those trained on MNIST with a single hidden layer of 25 units is shown in Fig. (8). In these figures, we zoom in near zero, where almost all of the eigenvalues are located, while a few significantly larger eigenvalues (by many order of magnitudes) are not displayed in the plot. The percentage of eigenvalues displayed within the visualization domain is reported in the label of each plot. Separate plots that show the magnitude of the top eigenvalues are shown separately in Figs. (9) and (10).

In all the cases, we observe that the time evolution of the distributions proceeds through two phases (see the videos in  [36] for detailed illustrations). In the first phase, the distribution tends to widely spread out around zero, with the index roughly equal to 0.5, signifying that gradient descent behaviour would descend quickly in the beginning of training due to the existence of multiple unstable directions in the loss surface. In the second phase, the width of the distribution begins to shrink and, at long-times, the negative eigenvalues tend to disappear (or relatively a few exist, which is typically the case when training reaches a very stable critical point) while the distribution of the positive eigenvalues clusters near zero with a large deviation tail, whose largest eigenvalue is shown in Figs. (9) and (10). These behaviours are qualitatively consistent with the predictions of Refs [13, 14] (albeit the results there require additional assumptions), and with observations discussed in Ref. [38]. Whether these two phases are related to the empirical error minimisation (ERM) and the representation compression phase of Ref. [4] is worth further investigation.

In linear and nonlinear binary classification tasks, the eigenvalue distributions of ReLu are mostly located near zero during training, suggesting that the ReLu loss landscape consists of critical points with many second-order flat directions; thus, gradient descent could get stuck in these nearly flat critical points. As seen in Fig.2 of the main text, the fact that training errors of ReLu plateaus out at high loss value supports this picture. For ESP and Sigmoid, the distributions spread further from 0 at the beginning, with may negative eigenvalues (large index), and gradient descent can conveniently move towards lower loss regime due to the existence of multiple unstable directions. When the training is stopped after many epochs have passed, the ESP eigenvalue distribution appears non-negative, whereas ReLu eigenvalue distribution still exhibits negative values; thus, training FFN with ESP on binary classification tasks allow gradient descent to find local minima faster than training with Sigmoid, and of course faster than ReLu. For training on MNIST datasets, ESP also reaches the non-negative eigenvalue distributions faster than the other two activation functions do, as shown in the last training epoch of Fig. (8). Again, this demonstrates the superiority of ESP in allowing gradient descent to reach a local minima in its loss landscape faster than the other two activation functions do.

### Footnotes

1. The input features have often different dimensions. For example, the Boston housing dataset boston () contains 14 features ranging from crime rate to property tax.
2. The contextual information is reintroduced in Convolutional and Recurrent networks – although heuristically– and it accounts for the superior performance of these architectures.
3. Technically, this corresponds to a generalized Gibbs ensemble, where conserved quantities are local rather than global, see e.g. calabrese ()
4. Here we refer to compression layers, but similar conclusions hold for expansion ones.
5. Alternatively, one can interpret as an effective cavity field representing the action of units on unit , see e.g. Ref. parisi1 () and opper () for connections to the belief propagation algorithm.
6. Here is the input of the unit, not necessarily the input data of the network.

### References

1. Prajit Ramachandran, Barret Zoph and Quoc V. Le. Searching for Activation Functions. arXiv:1710.05941 , 2017.
2. Pankaj Mehta and David. J. Schwab. An exact mapping between the Variational Renormalization Group and Deep Learning. arXiv:1410.3831 , 2014.
3. Navtali Tishby and Noga Zaslavsky. Deep Learning and the Information Bottleneck Principle. Invited paper to ITW 2015; 2015 IEEE Information Theory Workshop, arXiv:1503.02406v1, 2015.
4. Ravid Schwartz-Ziv abd Navtali Tishby. Opening the black box of Deep Neural Networks via Information. arXiv:1703.00810, 2017.
5. Tomaso Pogio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review International Journal of Automation and Computing 14, pp. 503-519,2017.
6. Henry W. Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics 168, pp. 1223-1247, 2017
7. Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. Proceedings of the 34th International Conference on Machine Learning pp. 2847-2854, 2017.
8. Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. Advances in Neural Information Processing Systems 29 pp. 3360–3368, 2016.
9. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. Understanding deep learning requires rethinking generalization.
10. Laurent Dinh, Razvan Pascanu, Samy Bengio, and Joshua Bengio. Sharp minima can generalize for deep nets. Proceedings of the 34th International Conference on Machine Learning
11. Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems 27, pp. 2933-2941, 2014.
12. Anna Choromanska, Mikael Henaff, Michael Mathieu, Arous Michael, Ben Gerard , and Yann LeCun. The loss surface of multilayer networks. JMLR , 38, 2015.
13. Jeffrey Pennington, Yasaman Bahri. Geometry of Neural Network Loss Surfaces via Random Matrix Theory. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017.
14. Jeffrey Pennington, Pratik Worah. Nonlinear random matrix theory for deep learning. 31st Conference on Neural Information Processing Systems (NIPS 2017)
15. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov Dropout: A Simple Way to Prevent Neural Networks from Overfitting Journal of Machine Learning Research 15 1929-1958, 2014.
16. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
17. H. Chau Nguyen, Riccardo Zecchina, and Johannes Berg. Inverse statistical problems: from the inverse Ising problem to data science. Advances in Physics, 66:3, pp. 197-261, 2017.
18. Carlo Di Castro and Roberto Raimondi. Statistical mechanics and applications in condensed matter. Cambridge University press, 2015.
19. David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University press, 2003.
20. E.T. Jaynes. Probability Theory, the logic of science. Cambridge University Press, 2003.
21. See for example the related Kaggle page for a description of the problem and the dataset https://www.kaggle.com/c/boston-housing.
22. Pasquale Calabrese and John Cardy. Quantum quenches in extended systems. J. Stat. Mech., P06008, 2007.
23. Shang-Keng Ma. Modern Theory of Critical Phenomena. Advanced Book Program, Westview press, 1976.
24. Marzio Cassandro and Giovanni Jona-Lasinio. Critical Point Behaviour and Probability Theory. Advances in Physics, 27:6, pp. 913-941, 1978.
25. Marc Mézard and Giorgio Parisi. The Bethe lattice spin glass revisited. European Physical Journal B 20, pp. 217-233, 2001.
26. Jack Raymond, Andre Manoel, Manfred Opper. Expectation Propagation. Lecture notes from the autumn school: ?Statistical Physics, Optimization, Inference, and Message-Passing Algorithms?, Les Houches, 2013. arXiv:1409.6179v1.
27. Connie Kou, Hwee Kuan Lee and Teck Khim Ng. Distribution Regression Network. arXiv:1804.04775, 2018.
28. John Hertz, Anders Krogh and Richard G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company, 1991.
29. Michel Le Bellac, Fabrice Mortessagne and G. George Batrouni. Equilibrium and non-Equilibrium Statistical Thermodynamics. Cambridge University press, 2004.
30. A. Engel and C. Van den Broeck. Statistical mechanics of learning. Cambridge University press , 2004.
31. Marc Mezard, Giorgio Parisi and Miguel Angel Virasoro. Spin Glass Theory and Beyond. World Scientific Lecture Notes in Physics, Volume 9, 1987.
32. Cirano De Dominicis and Irene Giardina. Random Fields and Spin Glasses A Field Theory Approach. Cambridge University Press, 2016.
33. Daniel J. Amit and Hanoch Gutfreund. Spin Glass model of Neural Networks. Physical Review A, 32 , pp. 1007, 1985.
34. Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina. Entropy-SGD: Biasing Gradient descent into wide valleys. arXiv:1611.01838, 2017.
35. Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.
36. All the code and data produced in this work can be found on https://github.com/WessZumino/expectation_propagation. This includes a pure Python and a Tensorflow implementation, together with the data used in the numerical experiments.
37. Diederik P. Kingma, Jimmy Ba. Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference for Learning Representations, 2015.
38. Levent Sagun, Leon Bottou and Yann LeCun. Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond. arXiv:1611.07476v2, 2017.
39. Maciej Koch-Janusz and Zohar Ringel. Mutual information, neural networks and the renormalization group. Nature Physics, doi: 10.1038/s41567-018-0081-4, 2018.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters