Exploring the Function Space of Deep-Learning Machines

# Exploring the Function Space of Deep-Learning Machines

Bo Li Department of Physics, The Hong Kong University of Science and Technology, Hong Kong    David Saad Non-linearity and Complexity Research Group, Aston University, Birmingham B4 7ET, United Kingdom
###### Abstract

The function space of deep-learning machines is investigated by studying growth in the entropy of functions of a given error with respect to a reference function, realized by a deep-learning machine. Using physics-inspired methods we study both sparsely and densely-connected architectures to discover a layer-wise convergence of candidate functions, marked by a corresponding reduction in entropy when approaching the reference function, gain insight into the importance of having a large number of layers, and observe phase transitions as the error increases.

Deep-learning machines (DLM) have both fascinated and bewildered the scientific community and have given rise to an active and ongoing debate Elad (2017). They are carefully structured layered networks of non-linear elements, trained on data to perform complex tasks such as speech recognition, image classification, and natural language processing. While their phenomenal engineering successes LeCun et al. (2015) have been broadly recognized, their scientific foundations remain poorly understood, particularly their good ability to generalize well from a limited number of examples with respect to the degrees of freedom Zhang et al. (2017); Chaudhari et al. (2017); Neyshabur et al. (2017) and the nature of the layer-wise internal representations Zeiler and Fergus (2014); Yosinski et al. (2015).

Supervised learning in DLM is based on the introduction of example pairs of input and output patterns, which serve as constraints on space of candidate functions. As more examples are introduced the function space monotonically decreases. Statistical physics methods have been successful in gaining insight into both pattern-storage Gardner (1988) and learning scenarios, mostly in single layer machines Hertz et al. (1991) but also in simple two-layer scenarios Watkin et al. (1993); Saad and Solla (1995). However, extending these methods to DLM is difficult due to the recursive application of non-linear functions in successive layers and the undetermined degrees of freedom in intermediate layers. While training examples determine both input and output patterns, the constraint imposed on hidden-layer representations are difficult to pin down. These constitute the main difficulties for a better understanding of DLM.

In this Letter, we propose a general framework for analyzing DLM by mapping them onto a dynamical system and by employing the Generating Functional (GF) approach to analyze their typical behavior. More specifically, we investigate the landscape in function space around a reference function by perturbing its parameters (weights in the DLM setting), and quantifying the entropy of the corresponding functions space for a given level of error with respect to the reference function. This provides a measure for the abundance of nearly-perfect solutions and hence an indication for the ability to obtain good approximations using DLM. The function error measure is defined as the expected difference (Hamming distance in the discrete case) between the perturbed and reference functions’ outputs given the same input (additional explanation is provided in Li2 ()). This setup is reminiscent of the teacher-student scenario, commonly used in the neural networks literature Saad (1998) where the average error serves as a measure of distance between the perturbed and reference network in function space. For certain classes of reference networks, we obtain closed form solutions of the error as a function of perturbation on each layer, and consequently the weight-space volume for a given level of function error. By the virtue of supervised learning and constraints imposed by the examples provided, high-error functions will be ruled out faster than those with low errors, such that the candidate function space is reduced and the concentration of low-error functions increases. A somewhat similar approach, albeit based on recursive mean field relations between each two consecutive layers separately, has been used to probe the expressivity of DLM Poole et al. (2016).

Through the GF framework and entropy maximization, we analyze the typical behavior of different classes of models including networks with continuous and binary parameters (weights) and different topologies, both fully and sparsely connected. We find that as one lowers the error level, typical functions gradually better match the reference network starting from earlier layers to later ones. More drastically, for fully connected binary networks, weights in earlier layers of the perturbed functions will perfectly match those of the reference function, implying a possible successive layer by layer learning behavior. Sparsely connected topologies exhibit phase transitions with respect to the number of layers, by varying the magnitude of perturbation, similar to the phase transitions in noisy Boolean computation Mozeika et al. (2009), which support the need of deep networks for improving generalization.

Densely connected network models–The model considered here comprises two coupled feed-forward DLM as illustrated in Fig. 1, one of which serves as the reference function and the other is obtained by perturbing the reference network parameters. We first consider the densely connected networks. Each network is composed of layers of neurons each. The reference function is parameterized by weight variables and maps an -dimensional input to an -dimensional output , through intermediate-layer internal representations and according to the stochastic rule

 P(^sL|^w,^s0) =L∏l=1P(^sl|^wl,^sl−1). (1)

The -th neuron in the -th layer experiences a local field and its state is determined by the conditional probability

 P(^sli|^wl,^sl−1)=eβ^sli^hli(^wl,^sl−1)2cosh[β^hli(^wl,^sl−1)], (2)

where the temperature quantifies the strength of thermal noise. In the noiseless limit , node represents a perceptron and Eq. (1) corresponds to a deterministic neural network with a sign activation function. The perturbed network operates in the same manner, but the weights are obtained by applying independent perturbation to each of the reference weights; the perturbed weights , give rise to a function that is correlated with the reference function.

We focus on the similarity between reference and perturbed functions outputs for randomly sampled input patterns , drawn from some distribution . Considering the joint probability of the two systems

 P[{^sl},{sl}] = P(^s0)N∏i=1δs0i,^s0i L∏l=1P(^sl|^wl,^sl−1)P(sl|wl,sl−1),

where the weight parameters and are quenched disordered variables. We consider two cases, where the weights are continuous or discrete variables drawn from the Gaussian and Bernoulli distributions, respectively. The quantity of interests are the overlaps between the two functions at the different layers , where angled brackets denote the average over the joint probability . The outputs represent weakly coupled Boolean functions of the same form of disordered, and thus share the same average behavior.

The form of probability distribution (Exploring the Function Space of Deep-Learning Machines) is analogous to the dynamical evolution of disordered Ising spin systems Hatchett et al. (2004) if the layers are viewed as discrete time steps of parallel dynamics. We therefore apply the GF formulation from statistical physics to these deep feed-forward functions similarly to the approach used to investigate random Boolean formulae Mozeika et al. (2009). We compute the GF from which the moments can be calculated, e.g., . Assuming the systems are self-averaging for and computing the disorder average (denoted by the upper line) , the disorder-averaged overlaps can be obtained For convenience, we introduce the field doublet . Expressing the GF by macroscopic order parameters and averaging over the disorder yields the saddle-point integral where is Li2 ()

 Ψ =iL∑l=0Qlql+log∫L∏l=1d^hldhl∑{^sl,sl}M[^s,s,^h,h], (4)

and the effective single site measure has the following form for both continuous and binary weights

 M[^s,s,^h,h]=P(^s0)δ^s0,s0e−i∑Ll=0Ql^slsl ×L∏l=1⎧⎨⎩eβ^sl^hl2coshβ^hleβslhl2coshβhl\leavevmode\nobreak e−12(Hl)T⋅Σ−1l⋅Hl√(2π)2|Σl(ql−1)|⎫⎬⎭. (5)

The Gaussian density of the local field in (5) comes from summing a large number of random variables in and . The precision matrix , linking the effective field and , measures the correlation between internal fields of the two systems and depends on the overlap of the previous layer. In the limit the GF is dominated by the extremum of . Variation with respect to gives rise to saddle-point equations of the order parameters , where the average is taken over the measure of (5). The conjugate order parameter , ensuring the normalization of the measure, vanishes identically. It leads to the evolution equation Li2 ()

 ql=∫d^hldhltanh(β^hl)tanh(βhl)e−12(Hl)T⋅Σ−1l⋅Hl√(2π)2|Σl|. (6)

The overlap evolution is somewhat similar to dynamical mean field relation in Poole et al. (2016), but the objects investigated and the remainder of the study are different. We focus on the function-space landscape rather than the sensitivity of function to input perturbations.

Densely connected continuous weights–In the first scenario, we assume weight variables to be independently drawn from a Gaussian density and the perturbed weights to have the form , where are drawn from independently of . It ensures that has the same variance . The parameter quantifies the strength of perturbation introduced in layer . In this case the covariance matrix between the local fields and takes the form

 Σl(ηl,ql−1)=σ2[1√1−(ηl)2ql−1√1−(ηl)2ql−11], (7)

leading to the close form solution of the overlap as ,

 ql=2πsin−1(√1−(ηl)2ql−1). (8)

Of particular interest is the final-layer overlap given the same input for the two system under specific perturbations . The average error measures the typical distance between the two mappings.

The number of solutions at a given distance (error) away from the reference function is indicative of how difficult it is to obtain this level of approximation at the vicinity of the exact function. Let the -dimensional vectors and denote the weights of the -th perceptron of the reference and perturbed systems at layer , respectively; the expected angle between them is . Then the perceptron occupies on average an angular volume around as  Seung et al. (1992); Engel and Broeck (2001). The total weight-space volume of the perturbed system is , and the corresponding entropy density is

 Scon({ηl})=1LN2logΩtot({ηl})≈1LL∑l=1logηl. (9)

In the thermodynamic limit , the set of perturbed functions at distance away from the reference function is dominated by those with perturbation vector , which maximizes the entropy subject to the constraint . The result of for a four-layer network, shown in Fig. 2(a), reveals that the dominant perturbation to the reference network decays faster for smaller values; this indicates that closer to the reference function, solutions are dominated by functions where early-layer weights match better the reference network. Consequently, high- function are ruled out faster during training through the successful alignment of earlier layers, resulting in the increasing concentration of low- functions and better generalization. We denote the maximal weight-space volume at distance away from the reference function as .

Supervised learning is based on the introduction of input-output example pairs. Introducing constraints, in the form of examples provided by the reference function, the weight-space volume at small distance away from the reference function is re-shaped as in the annealed approximation Seung et al. (1992); Engel and Broeck (2001); details of the derivation can be found in Li2 (). The typical distance can be interpreted as the generalization error in the presence of examples, giving rise to an approximate generalization curves shown in Fig. 2(c). These are expected to be valid in the small (large ) limit on which the perturbation analysis is based. It is observed that typically a large number of examples () are needed for good generalization. This may imply that DLMs trained on realistic data sets (usually ) occupy a small, highly-biased subspace, different from the typical function space analyzed here (e.g., the handwritten digit MNIST database represents highly biased inputs that occupy a very small fraction of the input space). Note that the results correspond to a typical generalization performance under the assumption of self-averaging, potentially with unlimited computational resources and independently of the training rule used.

Densely connected binary weights–Once trained, networks with binary weights are highly efficient computationally, which is especially useful in devices with limited memory or computational resources Courbariaux et al. (2016); Rastegari et al. (2016). Here we consider a reference network with binary weight variables drawn from the distribution , while the perturbed network weights follow the distribution , where is the flipping probability at layer . The covariance matrix

 Σl(pl,ql−1)=[1(1−2pl)ql−1(1−2pl)ql−11], (10)

gives rise to overlaps as of the form

 ql=2πsin−1((1−2pl)ql−1). (11)

The entropy density of the perturbed system is given by

 Sbin({pl})=1LL∑l=1−pllogpl−(1−pl)log(1−pl). (12)

Similarly, the entropy is maximized by the perturbation vector subject to at a distance away from the reference function. The result of for a four-layer binary neural network is shown in Fig. 2(b). Surprisingly, as decreases, the first-layer weights are first to align perfectly with those of the reference function followed by the second-layer weights and so on. The discontinuities come from the non-convex nature of the entropy landscape when one restricts the perturbed system to the nonlinear -error surface satisfying . Nevertheless, there exists many more high- than low- functions for densely-connected binary networks (as indicated by the entropy shown in the inset of Fig. 2(b)), and it remains to explore how low generalization error functions could be identified.

Sparsely connected binary weights–Lastly, we consider the sparsely connected DLM with binary weights; these topologies are of interest to practitioners due to the reduction in degrees of freedom and their computational and energy efficiency. The layered setup is similar to the previous case, except that unit at layer is randomly connected to a small number of units in layer and its local field is given by where the adjacency matrix represents the connectivity between the two layers. The perturbed network has the same topology but its weights are randomly flipped ; the activation and the joint probability of the two systems follow from (2) and (Exploring the Function Space of Deep-Learning Machines). Unlike the case of densely-connected networks, the magnetization also plays at important role in the evolution of sparse networks. The GF approach gives rise to the order parameter relating to the magnetization and overlap by .

The random topology provides an additional disorder to average over. For simplicity, we assign the reference weights to , which in the limit relate to the -majority gate (MAJ-) based Boolean formulas that provide all Boolean functions with uniform probability at the large limit Savický (1990); Mozeika et al. (2010). For a uniform perturbation over layers we focus on functions generated in the deep regime , where the order parameters take the form

 ml = ∑{sj}k∏j=112[1+sjml−1(1−2p)]sgn[k∑j=1sj], (13) ql = ∑{sj,^sj}k∏j=114[1+^sj^ml−1+sjml−1(1−2p) +sj^sjql−1(1−2p)]sgn[k∑j=1^sj]sgn[k∑j=1sj].

For finite , the macroscopic observables at layer are polynomially dependent on the observables at layer up to order . In the limit , the Boolean functions generated depend on the initial magnetization . Here, we consider biased case with initial conditions and . The reference function admits a stationary solution , computing a 1-bit information-preserving majority function Mozeika et al. (2010). Both magnetization of the perturbed function and the function error exhibit a transition from the ordered phase to the paramagnetic phase at some critical perturbation level , below which the perturbed network computes the reference function with error . The results for are shown in Fig. 2(d). Interestingly, the critical perturbation coincides with the location of the critical thermal noise for noisy -majority gate-based Boolean formulas; for , the critical perturbation  Mozeika et al. (2009). Below , there exist two ordered states with and the overlap satisfies  Li2 (), which is also reminiscent of the thermal noise-induced solutions Mozeika et al. (2009). However, the underlying physical implications are drastically different. Here it indicates that even in the deep network regime, there exists a large number of networks that can reliably represent the reference function when . This function landscape is important for learning tasks to achieve a similar rule to the reference function. The propagation of internal error , shown in Fig. 2(f), exhibits a stage of error-increase followed by a stage of error-decrease for . Consequently a successful sparse DLM requires more layers to reduce errors and provide a higher similarity to the reference function when we approach , indicating the need of deep networks in such models.

In summary, we propose a GF analysis to probe the function landscapes of DLM, focusing on the entropy of functions, given their error with respect to a reference function. The entropy maximization of densely connected networks at fixed error to the reference function indicates that weights of earlier layers are the first to align with reference function parameters when the error decreases. It highlights the importance of early-layer weights for reliable computation Raghu et al. (2017) and sheds light on the parameter learning-dynamics in function space during the learning process. We also investigate the phase transitions behavior in sparsely-connected networks, which advocate the use of deeper machines for suppressing errors with respect to the reference function in these models. The suggested GF framework is very general and can accommodate other structures and computing elements, e.g., continuous variables, other activation functions (such as the commonly used ReLU activation function Li2 ()) and more complicated weight ensembles. In Li2 (), we also demonstrate the effect of negatively/positively correlated weight variables on the expressive power of networks with ReLU activation and their impact on the function space, and investigate the behavior of simple convolutional DLM. Moreover, the GF framework allows one to investigate other aspect as well, including finite size effects and the use of perturbative expansion to provide a systematic analysis of the interactions between network elements. This is a step towards a principled investigation of the typical behavior of DLM and we envisage follow up work on various aspects of the learning process.

###### Acknowledgements.
We thank K. Y. Michael Wong for helpful discussions. Support from The Leverhulme Trust (RPG-2013-48) (DS) and Research Grants Council of Hong Kong (Grants No. 605813 and No. 16322616) (BL) is acknowledged.

## References

Exploring the Function Space of Deep-Learning Machines

Supplemental Material

## .1 Function Space and the Entropy of Solutions

Our study explores the function space and especially the number of solutions in the vicinity of a reference function. It relies on the assumption, in the absence of any other detailed information, that the bigger the volume of solutions is, the easier it will be to find one. For instance, if there exist many low-error functions in function space (Fig. S1(a) left and S1(b) left), then it is in general easier to achieve good alignment with the reference function (generalization) by some generic unspecified learning algorithm. On the other hand, if the volume of low-error functions is very small compared to high-error solutions (Fig. S1(a) right and S1(b) right), then it is generally harder to achieve good alignment (generalization) since the function space is dominated by the high-error functions. This concept has been introduced, albeit for simpler frameworks Schwartz et al. (1990).

## .2 Disorder Averaged Generating Functional for Densely-Connected Networks

Here we give detailed derivations of the generating functional for densely connected networks

 Γ[^ψ,ψ] =∑{sli,^sli}∀l,iP(^s0)N∏i=1δs0i,^s0iL∏l=1P(^sl|^wl,^sl−1)P(sl|wl,sl−1)e−i∑l,i(^ψli^sli+ψlisli) =∑{sli,^sli}∀l,iP(^s0)∏iδ^s0i,s0iL∏l=1∏ieβ^sli^hli(^wl,^sl−1)2cosh[β^hli(^wl,^sl−1)]eβslihli(wl,sl−1)2cosh[βslihli(wl,sl−1)], (S1)

with the local field . To deal with the non-linearity of the local fields in the conditional probability, we introduce auxiliary fields through the integral representation of -function

 1=∫∞−∞d^hlid^xli2πei^xli(^hli−1√N∑j^wlij^sl−1j),1=∫∞−∞dhlidxli2πeixli(hli−1√N∑jwlijsl−1j), (S2)

which allows us to express the quench random variables and linearly in the exponents, leading to

 Γ[^ψ,ψ] =∫L∏l=1∏id^hlid^xli2πdhlidxli2π∑{^sli,sli}∀l,iP(^s0)∏iδ^s0i,s0ie−i∑l,i(^ψli^sli+ψlisli) ×exp{−i√NL∑l=1[∑ij^wlij^xli^sl−1j+∑ijwlijxlisl−1j]} (S3) ×exp{L∑l=1∑i[β^sli^hli+βslihli−log2coshβ^hli−log2coshβhli+i^xli^hli+ixlihli]}. (S4)

Assuming the system is self-averaging, the disorder average can be traced over ab initio De Dominicis (1978), leaving the disorder averaged generating functional . We consider two types of networks, one with continuous weight variable following a Gaussian distribution and the other with binary weight variables. In both cases, we consider input distribution of the form

 P(^s0)=∏iP(^s0i)=∏i(12δ^s0i,1+12δ^s0i,−1). (S5)

### .2.1 Continuous weight variables with Gaussian disorder

In this case, we assume that weight variables are independent and follow a Gaussian density , and that the perturbed weight has the form where also have density but is independent of . For weight variables that are independent and identically distributed, an alternative derivation based on central limit theorem can be employed Poole et al. (2016). Nevertheless, the proposed GF framework is more principled and general, and can accommodate many possible extensions.

Averaging Eq. (S3) over the weight and perturbation gives

 ∫P(^w)P(δw)d^wdδwexp{−i√NL∑l=1[∑ij^wlij(^xli^sl−1j+√1−(ηl)2xlisl−1j)+ηl∑ijδwlijxlisl−1j]} = exp{−σ2L∑l=1[12∑i(^xli)2+12∑i(xli)2+√1−(ηl)2∑i^xlixli(∑j1N^sl−1jsl−1j)]}, (S6)

leading to the disorder-averaged generating functional

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Γ[^ψ,ψ] =∫L∏l=1∏id^hlid^xli2πdhlidxli2π∑{^sli,sli}∀l,iP(^s0)∏iδ^s0i,s0ie−i∑l,i(^ψli^sli+ψlisli) ×exp{−σ2L∑l=1[12∑i(^xli)2+12∑i(xli)2+√1−(ηl)2∑i^xlixli(∑j1N^sl−1jsl−1j)]} ×exp{L∑l=1∑i[β^sli^hli+βslihli−log2coshβ^hli−log2coshβhli+i^xli^hli+ixlihli]}. (S7)

Site factorization can be achieved by defining the macroscopic order parameter through the integral representation of the -function

 1=∫dQldql2π/Nexp{iNQl[ql−1N∑isli^sli]}, (S8)

which leads to

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Γ[^ψ,ψ] =∫L∏l=0dQldql2π/Nexp{iNL∑l=0Qlql} ×∫L∏l=1∏id^hlid^xli2πdhlidxli2π∑{^sli,sli}∀l,iP(^s0)∏iδ^s0i,s0ie−i∑l,i(^ψli^sli+ψlisli) ×exp{−σ2L∑l=1[12∑i(^xli)2+12∑i(xli)2+√1−(ηl)2∑i^xlixliql−1]} ×exp{L∑l=1∑i[β^sli^hli+βslihli−log2coshβ^hli−log2coshβhli+i^xli^hli+ixlihli]} ×exp(−iL∑l=0∑iQl^slisli). (S9)

Now the spin and field variables are the same for any site ; we therefore consider the generating functional as a function of site-independent conjugate fields, i.e., and , which takes the form

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Γ[^ψ,ψ] =∫L∏l=0dQldql2π/Nexp{iNL∑l=0Qlql} ×⎧⎪⎨⎪⎩∫L∏l=1d^hld^xl2πdhldxl2π∑{^sl,sl}∀lP(^s0)δ^s0,s0e−i∑l(^ψl^sl+ψlsl) ×exp(−σ2L∑l=1[12(^xl)2+12(xl)2+√1−(ηl)2^xlxlql−1]+iL∑l=1(^xl^hl+xlhl)) (S10) ×exp(L∑l=1[β^sl^hl+βslhl−log2coshβ^hl−log2coshβhl]) ×exp(−iL∑l=0Ql^slsl)}N. (S11)

For convenience, we define the fields doublet , and the covariance matrix

 Σl(ηl,ql−1)=σ2[1√1−(ηl)2ql−1√1−(ηl)2ql−11]. (S12)

The density of in Eq. (S10) has the form , which can be integrated over , yielding the joint Gaussian density of with precision matrix