Replica Symmetry Breaking in Bipartite Spin Glasses and Neural Networks
Some interesting recent advances in the theoretical understanding of neural networks have been informed by results from the physics of disordered many-body systems. Motivated by these findings, this work uses the replica technique to study the mathematically tractable bipartite Sherrington-Kirkpatrick (SK) spin glass model, which is formally similar to a Restricted Boltzmann Machine (RBM) neural network. The bipartite SK model has been previously studied assuming replica symmetry; here this assumption is relaxed and a replica symmetry breaking analysis is performed. The bipartite SK model is found to have many features in common with Parisi’s solution of the original, unipartite SK model, including the existence of a multitude of pure states which are related in a hierarchical, ultrametric fashion. As an application of this analysis, the optimal cost for a graph partitioning problem is shown to be simply related to the ground state energy of the bipartite SK model. As a second application, empirical investigations reveal that the Gibbs sampled outputs of an RBM trained on the MNIST data set are more ultrametrically distributed than the input data itself.
School of Mathematical Sciences
University of Southampton
††institutetext: Department of Physics
University of California at Santa Barbara
Santa Barbara, CA 93106
Research into deep neural networks has advanced at a stunning rate in recent years, leading to tremendous progress for a number of difficult machine learning problems and benchmarks lecun2015deep (). Despite these impressive accomplishments, there is a sense in which much of the recent progress has been made in engineering new methods and techniques, and that the development of a robust theoretical understanding has lagged behind. This state of affairs is a natural consequence both of the inherent difficulty in building a strong theoretical framework, and in the unprecedented rate of success that neural networks have had in solving machine learning problems.
Recently, some interesting advances in improving the theoretical understanding of neural networks have been inspired by results from physics, particularly the physics of disordered many-body systems. For example, in bray2007statistics (), Bray and Dean considered Gaussian random fields in high-dimensional spaces and calculated many properties of the distribution of critical points. Although the neural networks used in practical applications are certainly not Gaussian random fields, Dauphin et al. dauphin2014identifying () found through an empirical analysis that Bray and Dean’s observations for Gaussian random fields hold for neural networks trained on common machine learning data sets, such as MNIST and CIFAR-10. These include the result that high-error saddles pose a difficulty for training neural networks, and that most local minima have errors very close to the global minimum error. By drawing inspiration from physics, machine learning researchers were able to improve their understanding of neural networks, and moreover were led to propose a new Newton’s method algorithm for optimization problems which suffer from many saddle points.
Another advance in the theoretical understanding of neural networks that borrows from physics was made in choromanska2015loss (), where it was observed that under certain simplifying assumptions, the error surface of feed-forward neural networks can be mapped precisely to the energy landscape for a physics model known as the -spin spherical spin-glass, for which exact results have recently been derived in auffinger2013random (); auffinger2013complexity (). As in the Dauphin et al. study, these authors then empirically verified that these mathematical insights qualitatively apply to neural networks trained on MNIST, even though the assumptions behind the spin glass analysis no longer hold.111It is also worth noting that interesting progress has been made recently in the other direction, namely using machine learning methods to learn more about condensed matter systems carrasquilla2017machine (). Yet another recent work along these lines is schoenholz2017correspondence ().
Inspired by these results, in this work we study in detail a mathematically tractable spin glass model, and then use these results to better understand a family of neural networks. The spin glass model we consider is a simple extension of the paradigmatic Sherrington-Kirkpatrick (SK) model sherrington1975solvable (). The SK model is an infinite-range spin glass with Hamiltonian
Here , are binary spins, . The 2-spin couplings are independent identically distributed (iid) random variables drawn from a normal distribution with mean and standard deviation . The SK model is an ideal toy model; despite its simplicity its theoretical analysis is very rich and involves phenomena such as a continuous order parameter, an ultrametric space of states, and stochastic branching processes. In fact, the historical analysis of the SK model is very closely tied to the development of spin glass theory.
As every spin in the SK model couples to every other spin, the couplings may be characterized in terms of a complete graph. In this paper, we consider the extension of the SK model to complete bipartite graphs, which we will refer to as the “bipartite SK model.” The special case of balanced bipartite graphs has been studied for quite some time korenblit1985spin (); korenblit1985magnetic (); korenblit1987magnetic (); fyodorov1987antiferromagnetic (); fyodorov1987phase (); oppermann2004modulated (), but the general case has only been studied much more recently barra2017phase (); barra2017phase2 (); barra2011equilibrium (); barra2012glassy (); barra2014mean (); barra2015multi (); panchenko2015free (); genovese2016non (); fedele2012rigorous (); genovese2012universality (). These papers have calculated the model’s phase diagram in certain regimes fyodorov1987antiferromagnetic (); barra2011equilibrium (); barra2017phase () (related to our results in the next section), but to our knowledge the replica symmetry breaking of the bipartite SK model, which is crucial for correctly describing the physics of the spin glass phase, has been much less well-studied.222Some relevant works are as follows. oppermann2004modulated () considered the replica symmetry breaking for the special case of balanced bipartite graphs. barra2015multi () considered the replica symmetry breaking of multi-species spin models and proved an upper bound on the free energy, and panchenko2015free () subsequently showed that the bound is sharp. However, the model they considered necessarily allows self-interactions between species, unlike the bipartite model considered here. baik2017free () considered the bipartite spherical SK model. decelle2017spectral (); decelle2018thermodynamics () performed a replica-symmetric analysis very similar to the one we perform in Sec. 2.1.
Our primary motivation for studying the bipartite SK model is its similarity to a family of neural networks known as Restricted Boltzmann Machines (RBMs) smolensky1986information (). RBMs are generative models used in unsupervised learning, and are appealing to study from a physics perspective because they are energy-based models with a Hamiltonian of the same parametric form as the bipartite SK model. In fact, the only real difference between the bipartite SK model and the RBM lies in how the parameters are chosen. In the bipartite SK model they are iid normally distributed, whereas in the RBM they are determined by a learning algorithm. Therefore, similar in spirit to Dauphin et al dauphin2014identifying () and Choromanska et al choromanska2015loss (), it is plausible that the spin glass analysis of the bipartite SK model could provide novel insights into RBMs. Indeed, we will find this to be the case.
The outline of this paper is as follows. First, in Sec. 2 we analyze the bipartite SK model, considering both the replica symmetric and replica symmetry breaking cases. Then in Sec. 3 we consider a combinatorial optimization problem for which the optimal cost can be shown to be simply related to the ground state energy of the bipartite SK model. In Sec. 4 we present the results from a range of numerical investigations, including Markov-chain Monte Carlo (MC) simulations, and find good agreement. In Sec. 5, we empirically investigate the extent to which RBMs trained on realistic data exhibit spin glass phenomena. We conclude in Sec. 6.
2 The Bipartite Sherrington-Kirkpatrick Model
In this section we study the statistical physics of an infinite-range spin glass model called the bipartite SK model, which generalizes the famous Sherrington-Kirkpatrick (SK) model sherrington1975solvable () to a bipartite coupling graph. The Hamiltonian is
Here are spin variables which, anticipating the connection with RBMs in Sec. 5, we will call the visible and hidden spins, respectively. There are spins in total, divided into visible and hidden spins. We will use the convention that the index runs from 1 to , and the index runs from to . The parameters of the model are an matrix of couplings , and the - and -length vectors and which correspond to an independent external magnetic field for each spin. Following the machine learning literature, we will also refer to these as the biases. The Hamiltonian defines a Gibbs probability distribution over the spin variables,
with the inverse temperature and
the partition function. The Tr symbol indicates a sum over all configurations of the spin variables. The sum is intractable for both sets of spins together, but because there are no visible-visible or hidden-hidden couplings, it can be performed analytically if one set of spins is held fixed. For example, the marginal distribution over the visible spins can be written as , with
and similarly for the hidden spins.
In the bipartite SK model, the parameters appearing in the Hamiltonian are independently drawn from normal distributions,
where the notation indicates that the random variable is drawn from a normal distribution with mean and standard deviation . Our motivation for studying the bipartite model is its similarity to a class of neural networks used for unsupervised learning known as Restricted Boltzmann Machines (RBMs). RBMs are probabilistic models whose Hamiltonians take the same general form as Eq. 2, but with the parameters allowed to take on arbitrary values.333In the symmetric, uniform-bias case , , , the bipartite SK model is known as the Korenblit-Shender model korenblit1985spin (); korenblit1985magnetic (). and are typically equal in physical spin systems but not in RBMs. The spin-glass properties of RBMs will be the focus of Sec. 5.
Calculating the free energy
for a specific choice of parameters (known as a “disorder realization” in physics terminology) is an intractable problem, but in the large- limit we can calculate the expectation value of the free energy with respect to the probability distributions given in Eq.’s 6. We will use the notation to denote this expectation value (known as the “disorder-averaged value”) for any quantity .
Crucially, in the large- limit, certain “self-averaging” quantities (including the free energy) become independent of the particular disorder realization. In order to have a well-defined large- limit in which the number of spins is infinite, the moments of the 2-spin coupling will need to be rescaled. Defining and , the limit may be taken while keeping , finite. We will also find it useful to introduce
We can calculate using the replica method, which is based on the simple mathematical identity
is initially computed with taken to be a positive integer, and each copy of is referred to as a replica. At the end of the calculation, the above formula is used to compute and hence also . There are many mathematical subtleties associated with the replica method which have been discussed extensively in the physics literature; we therefore omit any further discussion here.
For the bipartite SK model, the disorder average may be taken immediately, yielding
Here the indices , run from 1 to , and label each replica. indicates a trace over all replicas, and a bar indicates that the inverse temperature has been absorbed into the parameter, for example .
In the analysis of the original, unipartite SK model, the Hubbard-Stratonovich integral transform is used to convert the argument of the exponential to be quadratic in the spin variables, at the cost of introducing an integral. For the bipartite SK model, there are two species of spin variables, and a different integral transform is required. Such a transform is provided by
which converges for . This transform may be applied to the term in Eq. 10 for
and to the term for
The resulting expression is
In the above, the site indices for the spins have been dropped because the sums over sites factorize and each site is treated equally due to the disorder average, and refers to a trace over all replicas of just the visible spins only, and similarly for .
In the large- limit, the integral may be evaluated by steepest descents: . The indicates that has been evaluated at a saddle, which is determined by the following equations:
where we have introduced the average magnetizations
as well as the overlaps
Angled brackets denote a thermal average with respect to the Gibbs distribution Eq. 3.
It is not useful to immediately impose the saddle point equations because the trace must still be carried out in the terms. Before carrying out the traces, however, it is convenient to first convert to complex variables, defined by
These variables have the useful property that the hatted variables do not depend on the traces, and so their saddle equations may be immediately solved for, yielding444Note that the hatted quantities are not necessarily the complex conjugates of the un-hatted quantities: this is true only if the saddle solutions of the variable are all real, which will turn out not to be the case.
(The notation means that the equivalent equation with every appearance of the labels and interchanged also holds.) This choice of variables has the additional property that the saddle equations Eq. 17 become simply , , , .
At this point the calculation has been formulated as a simple extension of the usual replica symmetry breaking calculation of the unipartite model – see for example dotsenko2005introduction (); mezard1987spin (); nishimori2001statistical (). One interesting point to note is that if and , , then there is a symmetry between the visible and hidden spins and the expression reduces to exactly that of the unipartite SK model. We have verified that our results reduce to those of the SK model when this condition is imposed.
For simplicity, we will henceforth neglect the bias terms by setting them to zero.555korenblit1985spin (); korenblit1985magnetic (); korenblit1987magnetic (); fyodorov1987antiferromagnetic (); fyodorov1987phase (); oppermann2004modulated () considered the effect of a uniform bias in the special case of the Korenblit-Shender model defined in footnote 3. The free energy is invariant under the transformation , which corresponds to flipping all the spins in the visible part (and also under the equivalent transformation for the hidden part). In the bipartite SK model, this corresponds to and either or but not both). In the zero-bias case we will consider, the free energy is therefore symmetric in , and without loss of generality we will only consider the ferromagnetic case .
As in the SK model, an ansatz for the matrix structure of the integration variables will need to be specified in order to proceed. We will first assume a replica symmetric ansatz and analyze the phase diagram. Replica symmetry is broken in the spin glass phase, and in the subsequent section the replica symmetry breaking (RSB) ansatz of Parisi will be used.
2.1 Replica Symmetric Analysis
First, we assume a replica symmetric ansatz, with all matrix and vector entries equal:
which results in
The quadratic spin terms in may be linearized through the Hubbard-Stratonovich transform, after which the traces become trivial. Finally, Eq. 9 may be used to yield the free energy density, , with the result that
where indicates an expectation value taken over the Gaussian random variable with zero mean and unit variance that was introduced by the Hubbard-Stratonovich transform,
Extremizing the free energy with respect to the variables yields the saddle equations
and the equivalent equations with .
We can now map out the phase diagram of the zero-bias system, and we find three phases. In the paramagnetic phase, with , each spin fluctuates randomly over time. In the ferromagnetic phase, with , the spins are frozen into a single globally aligned symmetry-breaking direction. In the spin-glass phase, with but , each individual spin is frozen in a particular direction, but an equal number are frozen in each direction, so there is no net magnetization.
The equations Eq. 28 must be solved numerically, except in the vicinity of the transitions that border the paramagnetic phase, at which both the magnetizations and overlaps are small and a perturbative analysis may be applied. The ferromagnetic/spin glass transition is not accessible via perturbative methods as the overlap is non-zero on both sides of the transition. Perturbative analysis yields the following critical lines: for as a phase boundary between the paramagnetic and ferromagnetic phases, and as the boundary between the paramagnetic and spin glass phases. The ferromagnet/spin glass phase boundary can be solved numerically by expanding the equations Eq. 28 around zero magnetization. The paramagnetic/spin glass phase boundary for was previously derived (with different conventions) in barra2011equilibrium (); barra2017phase (). We plot the zero-bias phase diagram in Fig. 1.666This phase diagram only applies for the case of zero biases. When the bias means are non-zero, the Hamiltonian is no longer symmetric under the transformation and the ferromagnetic and antiferromagnetic cases are no longer equivalent. Assuming the bias means have the same sign, in the ferromagnetic case the transition between the paramagnetic and ferromagnetic phases disappears and they merge into a single phase, which has but is best thought of as the paramagnetic phase, because the ground state is unique and there is no symmetry breaking. In the antiferromagnetic case , the two parts magnetize unequally, with the larger part more strongly magnetized (in the symmetric case , there remains a symmetry-breaking phase in which the larger magnetization is selected spontaneously). Both cases still have a spin glass phase characterized by many pure states mezard1987spin (); korenblit1985spin (), but with a net magnetization.
In more detail, the ferromagnet/spin glass boundary may be solved for by first specifying a value of . Then the overlaps are determined by solving the Eq.’s 28a, 28b evaluated at zero magnetization. Finally, the value of corresponding to the phase boundary may be determined by solving Eq.’s 28a, 28b expanded to first order in the magnetizations, which can be combined to yield the equation
In the paramagnetic phase, the magnetizations and overlaps are zero, and the expression for the free energy Eq. 26 simplifies to
This expression was first derived in barra2011equilibrium (), and it was also proven that this replica-symmetric expression is valid in this phase. In the SK model, the assumption of replica symmetry fails to hold in the spin glass phase and in a low-temperature subset of the ferromagnetic phase due to the presence of unstable modes, as shown by de1978stability (). We expect the same is true for the bipartite SK model, however we will omit a detailed stability analysis here. Instead, the unphysical nature of the replica symmetric solution may be illustrated by the fact that it predicts a negative entropy in the zero temperature limit within the spin glass phase. In particular, setting , the zero temperature limit of the saddle equations Eq. 28 yield zero magnetization, , as well as
This may be inserted into the free energy to yield
As the free energy is related to the entropy via at fixed volume and particle number, the assumption of replica symmetry predicts an entropy density , which is negative for all and hence unphysical.
2.2 Replica Symmetry Breaking Analysis
In this section we drop the assumption of replica symmetry, which fails to hold in the spin glass phase. (oppermann2004modulated () performed a similar calculation for the related Korenblit-Shender model defined in footnote 3, in which ; here we consider zero biases but arbitrary .)
2.2.1 Review: Parisi Ansatz
Parisi famously proposed in parisi1980sequence () a very interesting ansatz for replica symmetry breaking in the context of the original unipartite SK model. We will consider the natural extension of his ansatz to the bipartite model studied here. In order for the presentation to be as self-contained as possible, we first briefly review the Parisi ansatz for the unipartite SK model for a single set of spins , with Hamiltonian given by Eq. 1.
The spin glass phase of the unipartite SK model is characterized by a free energy with many local minima separated by high free energy barriers. As a result, thermodynamic averages may be decomposed into sums of pure states. If the pure states are indexed by (the reason for using the same index as for the replicas will become apparent soon), then for any observable ,
where the weights of the pure states are given by , with the free energy of the pure state. Pure states satisfy cluster decomposition, which for an infinite-dimensional model such as the bipartite SK model implies that correlation functions of observables at different sites must factorize: . The large free energy barriers between pure states cause ergodicity to be broken, so that the system becomes trapped in a single pure state during time evolution.
The relationship between the existence of many distinct pure states and the replica method was first worked out by Parisi parisi1980sequence (), who showed that the matrix which measures the overlap between replicas and may be identified with the matrix which measures the overlap between pure states and . This corresponds to equating
and justifies using the same index for replicas and pure states. Under time evolution, ergodicity is broken, and different replicas can spontaneously equilibrate into different pure states despite having been initialized equivalently. Thus, to properly capture the physics of the many pure states, an ansatz for the overlap matrix must be specified which breaks the permutation symmetry between replicas.
Parisi introduced a particularly interesting ansatz, which may be constructed according to the following prescription. First, a series of integers is introduced, , , with , and , and all assumed to be integers. The off-diagonal elements of the overlap matrix are then parametrized according to
where is the ceiling function. The form a sequence of variational parameters which replace the role of in the replica-symmetric analysis. An example of this ansatz is depicted below in Fig. 2.
This ansatz breaks the permutation symmetry among replicas in a sequential, iterative manner. The symmetry is broken between pure states belonging to different blocks, and as increases the number of different blocks increases. The case corresponds to the replica symmetric case considered previously. Finite corresponds to partial symmetry breaking, and is referred to as the RSB-k scheme. In the extreme case where , the symmetry is fully broken. Regardless of the degree to which replica symmetry is broken, eventually the limit is taken and the overlap matrix becomes .
At this point, the relationship between the replica symmetry breaking ansatz for and the many pure states can be seen by computing
which is the probability distribution of finding two pure states, and with a given overlap . The subscript indicates that this expression is not self-averaging; unlike the free energy, it depends on the particular realization for the coupling matrix. After imposing the Parisi ansatz for , taking the disorder average, and the limit, the overlap probability can be shown to be expressible as
With , the sequence of overlaps has been converted to a continuous function of a single variable, , and moreover that function has been related to the probability of finding two states with a given overlap. The inverse function is the cumulative distribution function for the random variable , and gives the probability of finding an overlap less than or equal to . The overlap is the quantile function for , with domain and range .777A subtley here is the fact that in the zero-bias case, for every pure state there is a partner state with the each spin reversed. This means that the overlap functions should lie within the interval . The replica symmetry breaking ansatz implicitly assumes that , and so the probability distribution across the full interval is related to those of Eq. 37 via . See p. 32 of mezard1987spin () for more details on this point. Thus, breaking of replica symmetry has allowed for a multitude of distinct pure states with different overlaps between each other.
Perhaps the most fascinating feature of Parisi’s replica symmetry breaking solution of the SK model is the fact that the space of pure states is ultrametric mezard1984nature (); mezard1987replica (). Just as is defined to be the disorder-averaged probability that two arbitrary pure states, and , will have an overlap , a 3-state overlap distribution may also be defined. For pure states , may be defined to be the probability that , , and . Interestingly, unlike the case of which can only be analytically calculated near criticality, the 3-state overlap distribution of the SK model may be calculated in the spin glass phase for any value of the temperature mezard1984nature (); mezard1987replica ():
where is the Heaviside theta function.
The simple expression for the 3-overlap distribution, Eq. 38, implies that the pure states of the model are arranged in a tree-like structure. This can be best seen by first introducing the distance between pure states:
where is the state-independent Edwards-Anderson order parameter. Assuming without loss of generality that , Eq. 38 implies that is non-zero only for
Thus, all triangles in the space of states are either equilateral or isosceles with the unequal side shorter than the equal sides.
Metric spaces satisfying Eq. 40 are called ultrametric.888The defining equation for ultrametric spaces is normally written as , which is equivalent to Eq. 40. For a review of ultrametricity from the perspective of physics, see rammal1986ultrametricity (). One interesting property of ultrametric spaces which is relevant here is that points in such a space may be associated with the leaves of a tree diagram, and therefore, the dendrogram in Fig. 2 serves to illustrate how the assumption of ultrametricity is implicitly contained within the Parisi ansatz, Eq. 35. The phenomenon of ultrametricity may be understood dynamically. As the temperature is lowered, ancestor pure states split into descendant pure states according to a stochastic branching process mezard1987spin () depicted in Fig. 3 below.
2.2.2 Free Energy
Given the form of Eq. 21, it is natural to separately assume the Parisi ansatz for each of the two overlap matrices, , . We will not attempt to rigorously prove the correctness of this ansatz in the bipartite model, but in Sec. 4 we show good agreement with Monte Carlo simulations. One immediate implication of the ansatz is that the pure states in the bipartite model will also be ultrametrically distributed. The derivation of Eq. 38 only relies on the properties of the algebra of Parisi matrices (matrices of the form Eq. 35), and not on the functional form of the Hamiltonian. Thus, by repeating the original calculation for the marginal distributions , , one finds that the pure states in each reduced system will exhibit the same ultrametric structure.
The calculation of the free energy proceeds analogously to the SK calculation, which can be found in many textbooks mezard1987spin (); dotsenko2005introduction (); nishimori2001statistical (), and so we will not repeat the derivation here. A set of Parisi equations may be derived for infinite breaking of replica symmetry:
where we have introduced and to simplify the expressions. The functions , for , are determined to be a solution of the differential equation:
The boundary condition is that .
These equations may be solved to a high degree of numerical precision using the methods of crisanti2002analysis (). As these techniques are somewhat involved, we will instead be content to solve the equations in the RSB-1 scheme. The variational parameters are (which should not be confused with the magnetization), and a sequence of -values for each spin species: , . The RSB-1 expression for the free energy is
where is a Gaussian measure, and the variables are related to the variables as in the above. The variational parameters are , and are chosen to maximize999In RSB calculations the free energy must be maximized rather than minimized due to the peculiarity of the limit. , subject to the constraint that , and , and similarly for the . We present the numerical solution of this equation, along with the results of MC simulations, in Sec. 4 below.
2.2.3 Near-Critical Solution
Analytic results for the spin glass phase are in general not possible for the replica symmetry breaking analysis. However, they are obtainable in the vicinity of the spin-glass phase transition. Here we will work out the near-critical overlap distributions, and verify that in the RSB analysis the bipartite SK model exhibits a spin glass transition which is very similar to that of the standard, unipartite model. For simplicity, we will again set .
The starting point is the free energy, Eq. 21. In the paramagnetic phase, the overlap matrices will be zero, and in the spin glass phase, they will be non-zero. In the spin glass phase and near the phase boundary , they will be non-zero but small, and therefore the free energy may be expanded as