Rademacher complexity and spin glasses:A link between the replica and statistical theories of learning

# Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning

## Abstract

Statistical learning theory provides bounds of the generalization gap, using in particular the Vapnik-Chervonenkis dimension and the Rademacher complexity. An alternative approach, mainly studied in the statistical physics literature, is the study of generalization in simple synthetic-data models. Here we discuss the connections between these approaches and focus on the link between the Rademacher complexity in statistical learning and the theories of generalization for typical-case synthetic models from statistical physics, involving quantities known as Gardner capacity and ground state energy. We show that in these models the Rademacher complexity is closely related to the ground state energy computed by replica theories. Using this connection, one may reinterpret many results of the literature as rigorous Rademacher bounds in a variety of models in the high-dimensional statistics limit. Somewhat surprisingly, we also show that statistical learning theory provides predictions for the behavior of the ground-state energies in some full replica symmetry breaking models.

\tcbuselibrary

theorems

## 1 Introduction

Empirical risk minimization is the workhorse of most of modern supervised machine learning successes. Consider for instance a data-set of examples assumed to be drawn from a distribution , with labels used for a binary classification task. We consider an estimator that belongs to a hypothesis class , for instance a neural network or a linear function, with respective weights or parameters w. The latter are typically computed by minimizing the empirical risk

over w, where denotes a loss function, e.g. the mean-squared-loss . The main theoretical issue of statistical learning theory concerns the performance of the estimator obtained by such a minimization on yet unseen data, namely the generalization problem. In fact, what we really hope to minimize is the population risk, defined as

 Rpopulation(fw)=Ey,x[L(y,fw(x))].

Since we are optimizing the empirical risk instead, the difference between the two might be arbitrarily large. Bounding this difference between empirical and population risks is therefore a major problem of statistical learning theories.

In a large part of the literature, statistical learning analysis (see e.g. [bartlett2002rademacher, vapnik2013nature, shalev2014understanding]) relies on the Vapnik-Chervonenkis (VC) analysis and on the so-called Rademacher complexity. The latter is a measure of the complexity of , the hypothesis class spanned by , to bound , the generalization gap. A gem within the literature is the Uniform Convergence result which states the following: if the Rademacher complexity or the VC dimension is finite, then for a large enough number of samples the generalization gap will vanish uniformly over all possible values of parameters w. Informally, uniform convergence tells us that with high probability, for any weights value w, the generalization gap satisfies

 Rpopulation(fw)−Rmempirical(fw)=O⎛⎝√dVC(F)m⎞⎠,\vspace−0.2cm (1)

where denotes the Vapnik-Chervonenkis dimension of the hypothesis class . Tighter bounds can be obtained using the Rademacher complexity. These bounds, although useful, do not seem to fully explain the success of current deep-learning architectures ([zhang2016understanding]).

Over the last four decades, a different vision of generalization — based on the analysis of typical case problems with synthetic data created from simple generative models — was developed to a large extent in the statistical physics literature (see e.g. [Seung1992, Watkin1993, opper1995statistical, engel2001statistical] for a review). The link with the VC dimension was discussed in many of these works, notably via its connection with its twin from statistical physics, the Gardner capacity ([gardner1988optimal]). In particular, one can show that the VC capacity is always larger than half of the Gardner one ([engel2001statistical]). We shall review this discussion later on in this paper. However, to the best of our knowledge the Rademacher complexity was absent from these considerations. This omission is unfortunate: not only does the Rademacher complexity give tighter bounds than the VC dimension, it also intrinsically connects with a quantity that physicists are familiar with and have been computing from the very beginning of their studies, namely the average ground-state energy.

The goal of the present paper is to bridge this gap and unveil the deep link between ground-state energy and Rademacher complexity, and how this connection is valuable to both parties. The paper is organized as follows: After giving proper definitions of common generalization bounds in sec. 2, we detail calculations of Rademacher complexities for simple function classes in sec. 3. These sections serve as an introduction to the readers not familiar with these notions. The subsequent sections 4 and 5 provide the original content of the paper.

##### Here we summarize the main contributions of this paper:
• We point out the one-to-one connections between the Rademacher complexity in statistical learning, and the ground-state energies and Gardner capacity from statistical physics.

• We show how the heuristic replica method from statistical physics can be used to compute the Rademacher complexity in the high-dimensional statistics limit and reinterpret classical results of the statistical physics literature as Rademacher bounds in the case of perceptron and committee machines models with i.i.d data.

• We contrast these results with the generalization in the teacher-student scenario, illustrating the worst-case nature of the Rademacher bound that fails to capture the typical-case behavior.

• We finally show en passant, that learning theory also bears consequences for the spin glass physics and the related replica symmetry breaking scheme by showing it implies strong constraint on the ground-state energy of some spin glass models.

## 2 A primer on Rademacher complexity

The bound of the generalization gap involving the VC dimension is specific to binary classification, and does not depend on the data distribution. While this is a strong property, the Rademacher approach does depend on data distribution and allows for tighter bounds. Moreover, it generalizes to multi-class classification and regression problems. We recall the definition of the Rademacher complexity:

###### Definition 2.1.

Let be any function in the hypothesis class , and let be drawn uniformly at random. The empirical Rademacher complexity is defined as

 ^Rm(F,X)≡Eϵ[supfw∈F1mm∑μ=1ϵμfw(x(μ))], (2)

and depends on the sample examples . The Rademacher complexity is defined as the population average

 Rm(F)≡EX[^Rm(F,X)]. (3)

In this paper, we shall focus on binary classification and consider the corresponding loss function that counts the number of misclassified samples. We will be therefore interested in a hypothesis class . Defining the training and generalization errors for any function by

 ϵmtrain(fw) ≡1mm∑μ=1\mathds1[y(μ)≠fw(x(μ))]%andϵgen(fw)≡Ey,x[\mathds1[y≠fw(x)]], (4)

the Rademacher complexity provides a generalization error bound as expressed by the following theorem, and many of its variants (see e.g. [bartlett2002rademacher, vapnik2013nature, shalev2014understanding, Mohri2018]):

###### Theorem 2.2.

Uniform convergence bound - Binary classification
Fix a distribution and let . Let be drawn
i.i.d from . Then with probability at least (over the draw of ),

 ∀fw∈F,ϵgen(fw)−ϵmtrain(fw)≤Rm(F)+√log(1/δ)m. (5)

Thus, the Rademacher complexity is a uniform bound of the generalization gap. In the high-dimensional limit when both and goes to infinity that we will consider in the remaining of the paper, we shall see that we can discard the dependent term and that only the first term will remains finite.

Note that this theorem can be used to recover the classical result (1). Indeed it can be shown ([massart2000some, ledoux2013probability, dudley1967sizes]) that the Rademacher complexity can be bounded by the VC dimension so that for some constant value ,

 Rm(F)≤C√dVC(F)m. (6)

We remind the reader that the VC dimension is the size of the set that can be fully shattered by the hypothesis class . Informally, if then for all set of data points, there exists an assignment of labels that cannot be fully fitted by the function class ([vapnik2013nature]).

## 3 Synthetic models in the high-dimensional statistics limit

In this section, we consider data generated by a simple generative model. We suppose that each vector of input data points has been generated i.i.d from a factorized, e.g. Gaussian, distribution, that is . In the following, we will focus on this simple data distribution, but sec. 5.5 presents a generalization to rotationally invariant data matrices with arbitrary spectrum. The main interest of such settings is to use the analysis of typical case problems with synthetic data created from simple generative models as means of getting additional insight on real world applications where data are not worst case ([Seung1992, Watkin1993, opper1995statistical, engel2001statistical, Zdeborova2016]). In particular, we shall be interested in the high-dimensional statistics limit when , with . In this paper, the aim is to compute exactly (rather than merely bounding) and asymptotically the Rademacher complexity for such problems.

### 3.1 Linear model

As the simplest example, we first tackle the computation of the Rademacher complexity for a simple function class containing all linear models with weights ,

 Flinear={fw:{Rd⟶Rx⟶1√dw⊺% x,w∈Rd\leavevmode\nobreak \leavevmode\nobreak /\leavevmode\nobreak \leavevmode\nobreak ∥w∥2=Γ√d}. (7)

From eq. (3), computing the empirical Rademacher complexity amounts to finding the vector that maximizes the scalar product between y (that replaces the variable ) and . It is thus sufficient to take and the empirical Rademacher complexity (3) thus reads

 Rm(Flinear) =Ey,X[1m1√d∥Xy∥2∥w∥2]. (8)

which has i.i.d entries, we can apply the central limit theorem, which enforces , hence . Assuming that weights are restricted to lie on the sphere of radius , we set and finally obtain

 Rm(Flinear) =Γ√α, (9)

where recall . The above result for the simple linear function hypothesis class allows to grasp the meaning of the Rademacher complexity: At fixed input dimension , it decreases with the number of samples as , closing the generalization gap in the infinite limit. Illustrating the bias-variance trade-off, we also see that increasing the radius of the weights expands the function complexity (and might help for fitting the data-set), but unfortunately leads to a looser generalization bound.

Note also that the fact that the Rademacher complexity is shows that it remains finite in the high-dimensional statistics limit. In this case, we see indeed that we can disregard the term that goes to zero as in eq. (5).

### 3.2 Perceptron model

The scaling of Rademacher complexity inverse as in the high-dimensional statistics limit is actually not restricted to the linear model but appears to be a universal property, at least at large enough . To see this we now focus on a different hypothesis class: the perceptron, denoted . This class contains linear classifiers which output binary variables, and will fit much better labels in the binary classification task. The class writes

 Fsign=⎧⎨⎩fw:⎧⎨⎩Rd⟶{±1}x⟶ sign(1√dw⊺x),w∈Rd⎫⎬⎭. (10)

Let us consider a sample i.i.d matrix with .

###### Theorem 3.1.

For the perceptron model class eq. (10) with random i.i.d. input data in the high-dimensional limit,  .

The proof is given in Appendix A. In a nutshell, it uses the fact that Rademacher complexity is upper-bounded by the VC dimension divided by , and lower-bounded by one particular example of its function class, when the weights are chosen according to Hebb’s rule ([hebb1962organization]), which also gives a behavior scaling as .

Heuristically, this result generalizes as well to a two-layer neural network with hidden neurons. Indeed, the two-layer function class contains, as a particular case, the single layer one, so the lower bounds goes through. The upper bound is however harder to control rigorously. Since neural networks have a finite VC dimension, the Rademacher complexity is again lower-bounded by ; However, we do not know of any theorem that would ensure that the VC dimension is bounded by ([bartlett2003vapnik]). Nevertheless, anticipating on the statistical physics approach, we indeed expect from the concentration (self-averaging) properties of the ground-state energy ([talagrand2003spin]) in the high-dimensional limit that it will yield a Rademacher complexity that is a function of only at fixed . From this argument, we expect that the dependence of the Rademacher complexity to be very generic in the high-dimensional limit.

## 4 The statistical physics approach

### 4.1 Average case problems: Statistical physics of learning

As anticipated in the previous chapter, the approach inspired by statistical physics to understand neural networks considers a set of data points coming from known distributions. Again, for the purpose of this presentation we focus on a simple example, where with . Sec. 5.5 is devoted to a generalization to random input data corresponding to random matrices with arbitrary singular value density.

Consider a function class, for instance we can again use the perceptron one : ; a typical question in the literature was to compute how many misclassified examples can be obtained for a given rule used to generate the labels ([engel2001statistical]). Given samples , in order to count the number of wrongly classified training samples, we define the Hamiltonian, or ”energy” function [Mezard1986]:

 H({y,X},w) ≡m∑μ=1\mathbbm1[y(μ)≠f% w(x(μ))]=12(m−m∑μ=1y(μ)fw(x(μ))). (11)

A classical problem in statistical physics is to compute the random capacity also called Gardner capacity ([Gardner1989]): given examples and labels randomly chosen between , it consists in finding how many samples can be correctly classified.

It turns out there exists a deep connection between the Gardner capacity and the VC dimension, as their common aim is to measure the maximum number of points such that there exists a function in the hypothesis class being able to fit the data set. In particular, using Sauer’s lemma ([SAUER1972145]) in the large size limit , keeping and , it is possible to show that the Gardner capacity provides a lower-bound of the VC dimension ([engel2001statistical]):

 αc≤2αVC. (12)

To illustrate this inequality, let us consider again the perceptron classifier hypothesis class for which the above inequality is saturated. In fact, the VC dimension is in this case (linear classification with binary outputs) simply . Hence on one hand , on the other hand the Gardner capacity amounts to ([cover1965geometrical, Gardner1989]).

It is fair to say that a large part of the statistical physics literature focused mainly on the Gardner capacity, in particular in a series of works in the 90’s ([Gardner1989, Krauth1989]) that led to more recent rigorous works ([talagrand2003spin, talagrand2006parisi, Sun2018, Aubin2019]).

### 4.2 The Rademacher complexity and the ground-state energy

As we shall see now, computing the Rademacher complexity for random input data can be directly reduced to a more natural object in the physics literature: the ground-state energy. Defining the Gibbs measure at inverse temperature , that weighs configurations with their respective cost, as

 ⟨…⟩β≡∫dw…e−βH({y,X},w)∫dwe−βH({y,X},w), (13)

we observe that averaging the Hamiltonian in eq. (11) over and the Gibbs measure for any function provides

 Ey,X⟨H({y,X},w)d⟩β (14)

where . Taking the zero temperature limit, i.e. , in the above equation, we finally obtain the ground-state energy , a quantity commonly used in physics. Interestingly, we recognize the definition of the Rademacher complexity

 (15)

where random labels y play the role of the Rademacher variable in (3). The above equation shows a simple correspondence between the ground-state energy on the perceptron model and the Rademacher complexity of the corresponding hypothesis class, and shall bring insights from both the machine learning and statistical physics communities. Consequently, as we shall see, this connection means that the Rademacher complexity can be computed (rather than bounded) for many models using the replica method from statistical physics. As far as we are aware, this basic connection between the ground state energy and Rademacher complexity was not previously stated in literature.

### 4.3 An intuitive understanding on the Rademacher bounds on generalization

At this point, the Rademacher complexity becomes a more familiar object to the physics-minded reader. However, could we understand more intuitively why the Rademacher complexity, or equivalently the ground-state energy, is involved in the generalization gap bound? Let us present an intuitive hand-waving explanation. Consider the fraction of mistakes performed by a classifier on unknown samples, namely the generalization error , and on the training set the training error . The worst case scenario that could occur is trying to fit while there exists no underlying rule, meaning that labels are purely random uncorrelated from input. The estimator will purely overfit and its generalization error will remain constant to in any case. This leads to the following heuristic generalization bound:

 ϵgen(fw)−ϵmtrain(fw)≤ϵrandom\leavevmode\nobreak labelsgen(fw)−ϵrandom\leavevmode\nobreak labels,mtrain(fw)=12−ϵrandom\leavevmode\nobreak labels,mtrain(fw% )=12(1−2ϵrandom\leavevmode\nobreak labels,mtrain(fw))=12^Rm(F). (16)

Note that this heuristic reasoning does not give the exact Rademacher generalization bound. In fact, the actual stronger and uniform (over all possible ) bound does not have a factor , and surely cannot be fully captured by the simple above argument. Nevertheless, this argument reflects the crux of the Rademacher bound: it provides a very pessimistic bound by assuming the worst possible scenario: i.e. fitting data and trying to make predictions while the labels are random. Of course, in real data problems the rule is not random; it is then no surprise that the Rademacher bound is not tight ([zhang2016understanding]). Indeed, real problems labels are not randomly correlated with the inputs.

## 5 Consequences and bounds for simple models

In this section, we illustrate our previous arguments and the connection between the spin glass approach and the Rademacher complexity still for the case of Gaussian i.i.d input data matrix in the high-dimensional limit when .

### 5.1 Ground state energies of the perceptron

For a number of samples smaller than the Gardner capacity , it is by definition possible to fit all random labels y. Accordingly, the number of misclassified examples is zero and the ground state energy . This means that the Rademacher complexity is asymptotically equal to for . However above the Gardner capacity , the estimator cannot perfectly fit the random labels and will misclassify some of them, equivalently . From the arguments given in sec. 3, we thus expect

 Rm(F) =1\leavevmode\nobreak\ for α<αc, (17) Rm(F) ≈Θ(√αcα)% for α≫αc.

This relation is already non-trivial, as it yields a link between the Gardner capacity and the Rademacher complexity. Using the replica method from spin glass analysis, and the mapping with ground state energies (15), we shall now see how one can go beyond these simple arguments, and compute the actual precise asymptotic value of the Rademacher complexity.

### 5.2 Computing the ground-state energy with the replica method

Knowing that statistical physics literature focused mainly on the Gardner capacity, the connection between the ground-state energy and the Rademacher complexity suggests that it would be worth looking at these old results in a new light. In fact, the replica method allows for an exact computation of the Rademacher complexity for random input data in the large size limit. In the following, we handle computations by focusing on a simple generalization of the linear functions hypothesis class. Fix any activation function , we define the following hypothesis class

 Fφ≡⎧⎨⎩fw:⎧⎨⎩Rd⟶{−1,1}x⟶φ(1√dw⊺x),w∈Rd⎫⎬⎭. (18)

Starting with the posterior distribution

 P(w|y,X)=P(y|w,X)P(w)P(%y,X)=e−βH({y,X},w)Pw(w)Z({y,X},α,β), (19)

we introduced the partition function associated to the Hamiltonian eq. (11) at inverse temperature

 Z({y,X},α,β)=∫Rddwe−βH({y,X},w)Pw(w). (20)

In the large size limit , the posterior distribution becomes highly peaked in particular regions of parameters. In physics we are interested in these dominant regions and focus on the free energy at inverse temperature defined as

 Φy,X({y,X},α,β)≡−limd→∞1dβlogZ({y,X},α,β). (21)

However, as we are interested in computing quantities in the typical case, we want to average over all potential training sets and compute instead the averaged free energy

 Φ(α,β)≡Ey,X[Φy,X({y,X},α,β)]. (22)

Computing directly this average rigorously is difficult, hence we will carry out the computation using the so-called replica method, starting by writing the replica trick

 (23)

which replaces the expectation of by the moments of , which are easier to compute. Taking the limit , and assuming that we can revert it with the limit , we finally obtain

 Φ(α,β)=limr→0[limd→∞−1dβ∂logEy,X[Z({y,X},α,β)r]∂r]. (24)

We give some details on the replica computation in Appendix B.1, and we also refer the reader to the relevant literature in physics ([Mezard1986, Hertz1993, engel2001statistical, mezard2009information, Zdeborova2016]) and in mathematics ([talagrand2003spin, talagrand2006parisi, bolthausen2007spin, panchenko2004bounds, panchenko2018free]). The computation of the free energy by the replica method is done by deriving a hierarchy of approximate ansatz, named replica symmetric (RS), one-step replica-symmetry breaking (1RSB), two-step replica-symmetry breaking (2RSB)…While in some problems the RSB or the 1RSB ansatz is sufficient, in others only the infinite step solution (full-RSB) gives the exact ansatz ([Mezard1989, talagrand2003spin, talagrand2006parisi]), although the 1RSB approach is usually an accurate approximation.

Computing the ground state energy consists in taking the zero temperature limit above the capacity in the replica free energy ; where denote respectively the energy and entropy contributions. The simplest form of the replica computation is known as Replica Symmetry (RS) and the next simplest is one-step Replica Symmetry Breaking (1RSB) which plugged in eq. (24) leads to expressions [Majer1993a, Erichsen1992, Whyte1996]

 Φ(rs)iid(α,β)=−1βextr% q0,^q0{12(q0^q0−1)+Ψ(rs)w(^q0)+αΨ(rs)out(q0,β)},Φ(1rsb)iid(α,β)=−1βextrq0,q1,^q0,^q1,x{12(q1^q1−1)+x2(q0^q0−q1^q1)+Ψ(1rsb)w(^q0,^q1)+αΨ(1rsb)out(q0,q1,β)}, (25)

with auxiliary functions

 Ψ(rs)w(^q0)≡Eξ0logEw[exp((1−^q0)2w2+ξ0√^q0w)],Ψ(rs)out(q0,β)≡EyEξ0logEz[I(y∣∣√Q−q0z+√q0ξ0,β)],Ψ(1rsb)w(^q0,^q1)≡1xEξ0log(Eξ1Ew[exp((1−^q1)2w2+(√^q0ξ0+√^q1−^q0ξ1)w)]x),Ψ(1rsb)out(q0,q1,β)≡1xEyEξ0log(Eξ1Ez[I(y∣∣√q0ξ0+√q1−q0ξ1+√1−q1z,β)]x). (26)

We introduced a temperature-dependant constraint function where the generic cost function reads in our case . Above expressions are valid for any generic weight distribution and non-linearity . The detailed computation can be found in Appendix B.1, in particular eq. (53) and eq. (65). Then the general method to find the ground state energy it to take the zero temperature limit

 egs,iid(α)≡limβ→∞Φiid(α,β), (27)

while handling carefully the scaling of the optimized order parameters in this limit.

Spherical perceptron The most commonly studied model ([gardner1988optimal, Gardner1988a, Gardner1989, gardner1988optimal]) with continuous weights is the spherical model with such that . The spherical constraint allows to have a well-defined model which excludes diverging or vanishing weights. In this case, the Gardner capacity is rigorously known to be equal to ([cover1965geometrical]).

We computed both the RS and 1RSB free energies ([Majer1993a, Erichsen1992, Whyte1996], see also Appendix B.1.4.). Taking the zero temperature limits and in the 1RSB case, while keeping and finite leads to the following expressions of the ground states energies:

 e(rs)gs,iid ={extr}χ{−12χ+αEy,ξ0minz[V(y|z)+(z−ξ0)22χ]} (28) e(1rsb)gs,iid ={extr}χ,Ω,q0{12Ωχlog(1+Ω(1−q0))+q02χ(1+Ω(1−q0)) (29) +αχΩEξ0logEξ1e−Ωχminz[V(y|z)+12χ(z−√q0ξ0−√1−q0ξ1)2]},

where the cost function . The details of the derivation via the replica methods are given in Appendix B.1.4. The results for Rademacher variable and with are depicted in Fig. 1.

Interestingly, the bounds on the Rademacher complexity also imply consequences for the ground state energy. Indeed the Rademacher complexity scales as for large values of  — namely there exists a constant such that  — therefore the ground state energy behaves for large as

 (30)

We first notice that the replica symmetric (RS) solution complexity fails to deliver the correct scaling as sketched in Fig. 1, so the scaling in eq. (30) must not be entirely trivial. On the other hand, the 1RSB solution we used (which is expected to be numerically very close to the harder to evaluate full-RSB one), seems to yield the correct scaling (see Fig. 1). It is rather striking that the statistical learning connection allows to predict, through eq. (30), the scaling of the energy in the large regime, that is only satisfied with replica symmetry breaking ansatz. This yields an open question for replica theory: in practice, can one compute exactly the value of the constant ? Given the full-RSB solution is notoriously hard to evaluate, this might be an issue worth investigating in mathematical physics.

Binary perceptron Another common choice for the weights distribution is the binary prior
studied e.g. in [Krauth1989]. In this case, the Gardner capacity is predicted to be , a prediction which, remarkably, is still not entirely rigorously proven, but see [Sun2018, Aubin2019].

To see this, we use eq. (25). In the binary perceptron, the landscape of the model is said to be frozen 1RSB (f1RSB), i.e. clustered in point-like dominant solutions, and the RS and 1RSB free energies are the same (even though their entropies are different) . In this case computing the ground state can be tackled via finding the effective temperature such that the , that can be plugged back to find the ground state energy . Again, we note that even though the 1RSB ansatz is unstable and should be replaced by a more complex (and ultimately full-RSB) solution, it already gives the good scaling , and satisfies the scaling eq. (30) for large , as in the case of the spherical model, see Fig. 2.

### 5.3 Teacher-student scenario versus worst case Rademacher

The Rademacher bounds are really interesting as they depend only on the data distribution, and are valid for any rule used to generate the labels, no matter how complicated. In this sense, it is a worst-case scenario on the rule that prescribes labels to data. A different approach, again pioneered in statistical physics [Gardner1989], is to focus on the behavior for a given rule, called the teacher rule. Given the Rademacher bounds tackle the worst case with respect to that rule, it is interesting to consider the generalization error one actually gets for the best case, i.e. fitting the labels according to the same teacher rule.. This is the so-called teacher-student approach. In the wake of the need to understand the effectiveness of neural networks, and the limitations of the classical approaches, it is of interest to revisit the results that have emerged thanks to the physics perspective.

We shall thus assume that the actual labels are given by the rule

 y=sign(1√dw⋆⊺x), (31)

with , the teacher weights that can be taken as Rademacher variables, or Gaussian ones. Now that labels are generated by feeding i.i.d random samples to a neural network architecture (the teacher) and are then presented to another neural network (the student) that is trained using this data, it is interesting to compare the worst case Rademacher bound with the actual generalization error of this student on such synthetic data.

We now consider the error of a typical solution w from the posterior distribution (this is often called the Gibbs rule) for the student. Given the rule is outputting variables, this yields

 ϵGibbsgen=1−Ex,w⋆[⟨fw⋆(x)×fw(x)⟩]=1−q⋆ (32)

where . Computing can be done within the statistical mechanics approach ([Seung1992, Watkin1993, opper1995statistical, engel2001statistical]) and can be rigourously done as well ([Barbier2017b]). Notice that this error is equal to the Bayes optimal error for the quadratic loss (see as well [Barbier2017b]).

The two optimistic (teacher-student) and pessimistic (Rademacher) errors can be seen in Fig. 1 for spherical and in Fig. 2 for binary weights. In this case, since a perfect fit is always possible, the training error is zero and the Rademacher complexity is itself the bound on the generalization error. These two figures show how different the worst and teacher-student case can be in practice, and demonstrate that one should perhaps not be surprised by the fact that the empirical Rademacher complexity does not always give the correct answer [zhang2016understanding], as after all it deals only with worst case scenarios.

### 5.4 Committee machine with Gaussian weights

Given the large gap between the Rademacher bound and the teacher-student setting, we can ask wheather we can find a case where the Rademacher bound is void in the sense that the Rademacher complexity is yet generalization is good for the teacher-student setting? This can be done by moving to two-layer networks. Consider a simple version of this function class, namely the committee machine [engel2001statistical]. It is a two-layer network where the second layer has been fixed, such that only weights of the first layer are learnt. The function class for a committee machine with hidden units is defined by

 (33)

Instead of computing the Rademacher complexity with the replica method, it is sufficient for the purpose of this section to understand its rough behavior. As discussed in sec. 5.1, this requires knowing the Gardner capacity. A generic bound by [Mitchison1989] states that it is upper bounded by . Additionally, the Gardner capacity has been computed by the replica method in [Monasson_1995, urbanczik1997storage, xiong1998storage] who obtained that . We thus expect that

 Rm(Fcom)=1\leavevmode\nobreak\ for α<Θ(K√log(K)),Rm(Fcom)≈Θ⎛⎜⎝√K√logKα⎞⎟⎠for α≫Θ(K√logK). (34)

To compare with the teacher-student case, when the labels are produced by a teacher committee machine as

 y=sign(K∑k=1sign(1√d%w⋆⊺kx)), (35)

the error of the Gibbs algorithm reads

 ϵGibbsgen=1−Ex,w⋆[⟨fw⋆(x)×fw(x)⟩]=1−q⋆ (36)

where, again , has been computed in a series of papers in statistical physics [Hertz1993, schwarze1993learning], and using the Guerra interpolation method in [Aubin2018]. Interestingly, in this case, one can get an error that decays as as soon as . One thus observes a huge gap between the Rademacher bound that scales as and the actual generalization error for large sample size. This large gap further illustrates the considerable difference in behavior one can get between the worst case and teacher-student case analysis, see Fig. 3.

### 5.5 Extension to rotationally invariant matrices

The previous computation for i.i.d data matrix X can be generalized to rotationally invariant (RI) random matrices with rotation matrices , independently sampled from the Haar measure, and a diagonal matrix of singular values. Computation for this kind of matrices can be handled again using the replica method ([Kabashima2008, barbier2018mutual, Gabrie2018]) and leads to RS and 1RSB free energies

 (37)

where each term is properly defined in Appendix B.2. Note that taking a random Gaussian i.i.d matrix, the eigenvalue density follows the Marchenko-Pastur distribution and (37) matches free energies eq. (53), (65), and ground states energies eq. (28) in the spherical case. The ground state energy (and therefore the Rademacher complexity) can be again computed as in the i.i.d case, taking the zero temperature limit

 egs,RI(α)=limβ→∞ΦRI(α,β), (38)

keeping in particular and finite in the limits .

## 6 Conclusion

In this paper, we discussed the deep connection between the Rademacher complexity and some of the classical quantities studied in the statistical physics literature on neural networks, namely the Gardner capacity, the ground-state energy of the random perceptron model, and the generalization error in the teacher-student model. We believe it is rather interesting to draw the link with approaches inspired by statistical physics, and compare its findings with the worst-case results. In the wake of the need to understand the effectiveness of neural networks and also the limitations of the classical approaches, it is of interest to revisit the results that have emerged thanks to the physics perspective. This direction is currently experiencing a strong revival, see e.g. [Chaudhari2016, martin2017rethinking, Advani2017a, Baity-Jesi2018]. The connection discuss in the paper opens the way to a unified presentation of these often contrasted approaches, and we hope this paper will help bridging the gap between researchers in traditional statistics and in statistical physics. There are many possible follow-ups, the more natural one being the computation of Rademacher complexities from statistical physics methods for more complicated and realistic models of data, starting for instance with correlated matrices discussed in section 5.5.

## 7 Acknowledgements

This work is supported by the ERC under the European Union’s Horizon 2020 Research and Innovation Program 714608-SMiLe, as well as by the French Agence Nationale de la Recherche under grant ANR-17-CE23-0023-01 PAIL and from the Chaire CFM-ENS. We thank Henry Pfister for insightful and clarifying discussions that inspired partly this work. We would also like to thank the Kavli Institute for Theoretical Physics (KITP) for welcoming us during part of this research, with the support of the National Science Foundation under Grant No. NSF PHY-1748958.

## Appendix A Rademacher scaling of the perceptron

###### Proof.

Upper bound

For a linear classifier with binary ouputs such as the perceptron, the VC dimension is easy to compute and . Hence we know from Massart theorem’s [massart2000some] that

 Rm(Fsign)≤Θ⎛⎝√dVC(Fsign)m⎞⎠=Θ(√dm)=Θ(α−1/2).

Lower bound
Let us consider the following estimator (known as the Hebb’s rule [hebb1962organization]): . Hence for a given sample the above estimator outputs

 fw⋆(x(μ))= sign(1√dw⋆⊺x(μ))= sign((1dm∑ν=1y(ν)x(ν))⊺x(μ)).

Injecting its expression in the definition the Rademacher complexity eq. (3) one obtains:

 Rm(Fsign)≡Ey,X[supw1mm∑μ=1y(μ)fw(x(μ))]≥Ey,X[1mm∑μ=1y(μ)fw⋆(x(μ))] =Ey,X[1mm∑μ=1 sign(y(μ)1d(m∑ν=1y(ν)x(ν))⊺x(μ))] =Ey,X⎡⎣1mm∑μ=1 sign⎛⎝1+1dm∑ν≠μy(μ)y(ν)x(ν)⊺x(μ)⎞⎠⎤⎦.

As and , . Hence let us define the Gaussian random variable

 θμ≡1dm∑ν≠μy(μ)y(ν)x(ν)⊺x(μ)=1dm∑ν≠μz(ν)⊺z(μ),

and compute its two first moments

 E[θμ] =Ez⎡⎣1dm∑ν≠μz(ν)⊺z(μ)⎤⎦=Ez% ⎡⎣1dm∑ν≠μd∑i=1z(ν)Ey,Xiz(μ)i⎤⎦=0, E[θ2μ] =E⎡⎢⎣1d2⎛⎝m∑ν≠μz(ν)⊺z(μ)⎞⎠2⎤⎥⎦=(m−1)d⟶m→∞α.

Hence because of the central limit theorem, in the high-dimensional limit . Finally

 Rm(Fsign) ≥Eθ[1mm∑μ=1 sign(1+θμ)]=Eθ[ sign(1+θ)] =P[θ≥−1]−P[θ≤−1]=2P[θ≥−1]−1.

Noting that

 P[θ≥−1] =∫∞−1√αDθ=12erfc(−1√2α)≃α→∞12−1√2πα,

we obtain a lower bound for the Rademacher complexity

 Rm(Fsign)≥√2π1√α=Θ(1√α).

## Appendix B Replica computation of the perceptron ground state energy

### b.1 Gaussian i.i.d matrix

In this section, we present the replica computation of Generalized Linear Models (GLM) corresponding to the hypothesis class in eq. (18). We focus on data drawn from i.i.d distribution , and labels y drawn randomly form . We consider for the moment a generic prior distribution that factorizes, and activation function . Let us define the cost function of a given sample that is 0 if the the estimator classifies the example correctly and 1 otherwise, where . Finally we define the constraint function at inverse temperature , that depends explicitly on the Hamiltonian eq. (11)

 I(y|z,β)≡m∏μ=1e−βV(yμ|zμ)=e−βH({y,X},%z), (39)

and note that the constraint function converges at zero temperature to a hard constraint function