Approximate Survey Propagation for Statistical Inference

# Approximate Survey Propagation for Statistical Inference

Fabrizio Antenucci Institut de physique théorique, Université Paris Saclay, CNRS, CEA, F-91191 Gif-sur-Yvette, France Soft and Living Matter Lab., Rome Unit of CNR-NANOTEC, Institute of Nanotechnology, Piazzale Aldo Moro 5, I-00185, Rome, Italy    Florent Krzakala Laboratoire de Physique Statistique, CNRS & Sorbonnes Universités & École Normale SupÃ©rieure, PSL University, Paris, France.    Pierfrancesco Urbani Institut de physique théorique, Université Paris Saclay, CNRS, CEA, F-91191 Gif-sur-Yvette, France    Lenka Zdeborová Institut de physique théorique, Université Paris Saclay, CNRS, CEA, F-91191 Gif-sur-Yvette, France
###### Abstract

Approximate message passing algorithm enjoyed considerable attention in the last decade. In this paper we introduce a variant of the AMP algorithm that takes into account glassy nature of the system under consideration. We coin this algorithm as the approximate survey propagation (ASP) and derive it for a class of low-rank matrix estimation problems. We derive the state evolution for the ASP algorithm and prove that it reproduces the one-step replica symmetry breaking (1RSB) fixed-point equations, well-known in physics of disordered systems. Our derivation thus gives a concrete algorithmic meaning to the 1RSB equations that is of independent interest. We characterize the performance of ASP in terms of convergence and mean-squared error as a function of the free Parisi parameter . We conclude that when there is a model mismatch between the true generative model and the inference model, the performance of AMP rapidly degrades both in terms of MSE and of convergence, while ASP converges in a larger regime and can reach lower errors. Among other results, our analysis leads us to a striking hypothesis that whenever (or other parameters) can be set in such a way that the Nishimori condition is restored, then the corresponding algorithm is able to reach mean-squared error as low as the Bayes-optimal error obtained when the model and its parameters are known and exactly matched in the inference procedure.

## I Introduction

### i.1 General Motivation

Many problems in data analysis and other data-related science can be formulated as optimization or inference in high-dimensional and non-convex landscapes. General classes of algorithms that are able to deal with some of these problems include Monte Carlo and Gibbs-based sampling neal2000markov (), variational mean field methods wainwright2008graphical (), stochastic gradient descent bottou2010large () and their variants. Approximate message passing (AMP) algorithms ThoulessAnderson77 (); DonohoMaleki09 () is another such class that has one very remarkable advantage over all the before mentioned: for a large class of models where instances are generated from a probabilistic model, the performance of AMP on large such instances can be tracked via a so-called state evolution bolthausen2014iterative (); bayati2011dynamics (); javanmard2013state (). This development has very close connections to statistical physics of disordered systems because the state evolution that describes AMP coincides with fixed-point equations that arise from the replica and cavity methods as known in the theory of glasses and spin glasses MPV87 ().

AMP and its state evolution have been so far mainly used in two contexts. On the one hand, for optimization in cases where the associated landscape is convex. This the the case e.g. in the original work of Donoho, Maleki, Montanari DonohoMaleki09 () where -norm minimization for sparse linear regression is analyzed, or in the study of the so-called M-estimators donoho2016high (); advani2016statistical (). On the other hand, in the setting of Bayes-optimal inference where the model that generated the data is assumed to be known perfectly, see e.g. zdeborova2015statistical (); LesKrzZde17 (), where the so-called Nishimori conditions ensure that the associated posterior probability measure is of so-called replica symmetric kind in the context of spin glasses MPV87 (); NishimoriBook (). Many (if not most) of inference and optimization problems that are solved in the currently most challenging applications are highly non-convex and the underlying model that generated the data is not known. It is hence an important general research direction to understand the behavior of algorithms and find their more robust generalizations encompassing such settings.

In the present paper we make a step in this general direction for the class of AMP algorithms. We analyze in detail the phase diagram and phases of convergence of the AMP algorithm on a prototypical example of a problem that is non-convex and not in the Bayes-optimal setting. The example we choose is rank-one matrix estimation problem that has the same phase diagram as the Sherrington-Kirkpatrick (SK) model with ferromagnetic bias SherringtonKirkpatrick75 (). AMP reproduces the corresponding replica symmetric phase diagram with region of non-convergence being given by the instability towards replica symmetry breaking (RSB). We note that while this phase diagram is well known in the literature on spin glasses, its algorithmic consequences are obscured in the early literature by the fact that unless the AMP algorithm is iterated with the correct time-indices bolthausen2014iterative () convergence is not reached, see discussion e.g. in zdeborova2015statistical ().

Our main contribution is the development of a new type of approximate message passing algorithm that takes into account breaking of the replica symmetry and reduces to AMP for a special choice of parameters. We call this the approximate survey propagation (ASP) algorithm, following up on the influential work on survey propagation in sparse random constraint satisfaction problems MPZ02 (); BMZ05 (). We show that there are regions (away from the Nishimori line) in which ASP converges while AMP does not, and where at the same time ASP provides lower estimation error than AMP. We show that the state evolution of ASP leads to the one-step-replica symmetry breaking (1RSB) fixed-point equations well known from the study of spin glasses. This is the first algorithm that provably converges towards fixed-points of the 1RSB equations. Again we stress that, while the 1RSB phase diagram and corresponding physics of the ferromagnetically biased SK model is well-known, its algorithmic confirmation is new to our knowledge: even if the 1RSB versions of the Thouless-Anderson-Palmer (TAP) ThoulessAnderson77 () equations was previously discussed, e.g. in MPV87 (); yasuda2012replica (), the corresponding time-indices in ASP are crucial in order to reproduce this phase diagram algorithmically. Our work gives a concrete algorithmic meaning to the 1RSB fixed-point equations, and can thus be potentially used to understand this concept independently of the heuristic replica or cavity methods.

### i.2 The low-rank estimation model and Bayesian setting

As this is a kind of exploratory paper we focus on the problem of rank-one matrix estimation. This problem is (among) the simplest where the ASP algorithm can be tested. In particular, for binary (Ising) variables it is equivalent to the Sherrington-Kirkpatrick model with ferromagnetically biased couplings as studied e.g. in SherringtonKirkpatrick75 (); de1978stability (); NishimoriBook (). Low-rank matrix estimation is a problem ubiquitous in modern data processing. Commonly it is stated in the following way: Given an observed matrix one aims to write it as , where with being the rank, and is a noise or a part of the data matrix that is not well explained via this decomposition. Commonly used methods such as principal component analysis or clustering can be formulated as low-rank matrix estimation. Applications of these methods range from medicine over to signal processing or marketing. From an algorithmic point of view and under various constraints on the factors and the noise the problem is highly non-trivial (non-convex and NP-hard in worst case). Development of algorithms for low-rank matrix estimation and their theoretical properties is an object of numerous studies in statistics, machine learning, signal processing, applied mathematics, etc. wright2009robust (); candes2011robust (); candes2009exact (). In this paper we focus on the symmetric low-rank matrix estimation where the matrix is symmetric, i.e. and the desired decomposition is , where .

A model for low-rank matrix estimation that can be solved exactly in the large size limit using the techniques studied in this paper is the so-called general-output low-rank matrix estimation model lesieur2015mmse (); LesKrzZde17 () where the matrix is generated from a given probability distribution with

 w(0)ij≡(x(0)i)Tx(0)j√N, (1)

and where each component is generated independently from a probability distribution . The matrix can then be seen as an unknown signal that is to be reconstructed from the observation of the matrix .

Following Bayesian approach, a way to infer given is to analyze the posterior probability distribution

 P(X|Y)=1ZX(Y)∏1≤i≤NPX(xi)∏1≤i

where is the assumed prior on signal components and is the assumed model for the noise. The normalization is the partition function in the statistical physics notation. And the posterior distribution (2) is nothing else but the Boltzmann measure on the -dimensional spin variables .

In a Bayes-optimal estimation, we know exactly the model that generated the data, i.e.

 Px=P0,Pout=P(0)out,r=r0. (3)

The Bayes-optimal estimator is defined as the one that among all the estimators minimizes the mean-squared error to the ground truth signal . This is computed as the mean of the posterior marginals or, in physics language, the local magnetizations.

In this paper we will focus on inference with the mismatching model where at least one of the above equalities (3) does not hold. Yet, we will still aim to evaluate the estimator that computes the mean of the posterior marginals

 ^xi=∫xiP(X|Y)N∏j=1dxj. (4)

We will call this the marginalization estimator.

Another common estimator in inference is the maximum posterior probability (MAP) estimator where

 ^XMAP=argmaxXP(X|Y). (5)

Generically, optimizing a complicated cost function is simpler than marginalizing over it, because optimization is in general simpler than counting. Thus MAP is often thought of as a simpler estimator to evaluate. Moreover, in statistics the problems are usually set in a regime where the MAP and the marginalization estimator coincide. However, this is not the case in the setting considered in the present paper and we will comment on the difference in the subsequent sections.

In our analysis we are interested in the high-dimensional statistics limit, whereas . The factor in Eq. (1) in this limit ensures that the estimation error we are able to obtain is in a regime where it goes from zero to randomly bad as the parameters vary. This case is different from traditional statistics, where one is typically concerned with estimation error going to zero as grows.

In the limit, the above defined model benefits of an universality property in the noise channel lesieur2015mmse (); LesKrzZde17 () (see krzakala2016mutual () for a rigorous proof) as the estimation error depends on the function only through the matrices

 Sij=∂g(Yij|w)∂w∣∣∣w=0, ^Rij=−∂2g(Yij|w)∂w2∣∣∣w=0. (6)

Because of this universality, in the following we will restrict the assumed output channel to be Gaussian with

 g(Y|w)=−(Y−w)22Δ−12log2πΔ, Sij=YijΔ, ^Rij=1Δ. (7)

The ground truth channel that was used to generate the observation is also Gaussian centered around with variance .

### i.3 Ferromagnetically biased SK model and related work

The numerical and example section of this paper will focus on one of the simplest cases of of rank-one estimation with binary signal, i.e.

 P0(xi)=PX(xi)=12[δ(xi−1)+δ(xi+1)]. (8)

In physics language such a prior corresponds to the Ising spins and the Boltzmann measure (2) is then the one of a Sherrington-Kirkpatrick (SK) model SherringtonKirkpatrick75 () with interaction matrix . After a gauge transformation , this is equivalent to the SK model at temperature with random iid interactions of mean and variance (see for instance the discussion in Sec. II.B of zdeborova2015statistical () or Sec. II.F of zdeborova2010generalization ()).

This variant of the SK model with ferromagnetically biased interactions is very well known in the statistical physics literature. The original paper SherringtonKirkpatrick75 () presents the replica symmetric phase diagram of this model, de1978stability () computes the AT line below which the replica symmetry is broken and Parisi famously presents the exact solution of this model below the AT line in parisi1979infinite (). A rather complete account for the physical properties of this model is reviewed in MPV87 ().

Note that while the replica solution and phase diagram of this model is very well known in the physics literature, the algorithmic interpretation of the phase diagram in terms of the AMP algorithm is recent. It is due to Bolthausen bolthausen2014iterative () who noticed that in order for the TAP equations ThoulessAnderson77 () to converge and to reproduce the features of the well known phase diagram, one needs to adjust the iteration indices in the TAP equations. We will call the TAP equations with adjusted time indices the AMP algorithm. Bolthausen proved that in the Sherrington-Kirkpatrick model the AMP algorithm converges if and only if above the de Almeida-Thouless (AT) line de1978stability ().

The work of Bolthausen together with the development of AMP for sparse linear estimation and compressed sensing DonohoMaleki09 () revived the interest in the algorithmic aspects of dense spin glasses. For a review of the recent progress see e.g. zdeborova2015statistical (). AMP for low-rank matrix estimation was studied e.g. in rangan2012iterative (); NIPS2013_5074 (); DeshpandeM14 (); lesieur2015phase (); deshpande2015asymptotic (); lesieur2015mmse (), its rigorous asymptotic analysis called the state evolution in bayati2011dynamics (); rangan2012iterative (); deshpande2015asymptotic (); javanmard2013state ().

Correctness of the replica theory in the Bayes-optimal setting was proven rigorously in a sequence of work DeshpandeM14 (); deshpande2015asymptotic (); krzakala2016mutual (); barbier2016mutual (); LelargeMiolane16 (); barbier2017stochastic (); alaoui2018estimation (). While the first complete proof is due to krzakala2016mutual (), the Ising case discussed here is equivalent the Gauge-Symmetric Sherrington-Kirkpatrick proven earlier in korada2009exact (). Various applications and phase diagrams for the problem are discussed in detail in e.g. LesKrzZde17 ().

While it is well known in the physics literature de1978stability (); parisi1979infinite (); MPV87 () that below the AT line replica symmetry breaking is needed to describe correctly the Boltzmann measure (2) the algorithmic consequences of that stay unexplored up to date. There are several versions of the TAP equations embodying a replica symmetry breaking structure in the literature MPV87 (); yasuda2012replica () but they do not include the proper time-indices and hence will not be able to reproduce quantitatively the RSB phase diagram (just as it was the case for the TAP equations before the work of DonohoMaleki09 (); bolthausen2014iterative ()).

In this paper we close this gap and derive approximate survey propagation, an AMP-type of algorithm that takes into account replica symmetry breaking. Using the state evolution theory rangan2012iterative (); DeshpandeM14 (); deshpande2015asymptotic () we prove that the ASP algorithm reproduces the 1RSB phase diagram in the limit of large system sizes. We study properties of the ASP algorithm, resulting estimation error as a function of the Parisi parameter, its convergence and finite size behavior.

## Ii Properties of approximate message passing

### ii.1 Reminder of AMP and its state evolution

In this section we recall the standard Approximate Message Passing (AMP) algorithm. Within the context of low-rank matrix estimation, the AMP equations are referred as Low-RAMP and are discussed extensively in LesKrzZde17 (). In the physics literature, the Low-RAMP would be equivalent to the TAP equations ThoulessAnderson77 () (with corrected iteration indices) for a model of vectorial spins with local magnetic fields and general kind of two-body interactions. In this sense, the Low-RAMP equations encompass as a special case the original TAP equations for the Sherrington-Kirkpatrick model SherringtonKirkpatrick75 (), for the Hopfield model Mezard17 (); MPV87 () and for the restricted Boltzmann machine GabTraKrz15 (); Traelal16 (); Mezard17 (); TubMon17 ().

#### ii.1.1 Low-RAMP equations.

Let us state the Low-RAMP equations to emphasize the differences and similarities with the replica symmetry breaking approach of Sec. III. The Low-RAMP algorithm evaluates the marginals of Eq. (2) starting from the belief propagation (BP) equations yedidia2003understanding () (cf. the factor graph of Fig. 5 in the case where ). The main assumptions of BP is that the incoming messages are probabilistically independent when conditioned on the value of the root. In the present case the factor in Eq. (1) makes the interactions sufficiently weak so that the assumption of independence of incoming messages is plausible at the leading order in the large size limit. Moreover, this assumption is particularly beneficial in the case of continuous variables: since we have an accumulation of many independent messages, the central limit theorem assures that it is sufficient to consider only means and variances to represent the result (relaxed-belief propagation) instead of dealing with whole probability distributions. To finally obtain the Low-RAMP equations, the further step is the so-called TAPification: from the relaxed-belief propagation equations one notices that the algorithm can be further simplified if instead of dealing with messages associated to each directed edge, one works with only node-dependent quantities. This generates the so-called Onsager terms. Keeping track of the correct time indices under iteration in order to preserve convergence of the iterative scheme zdeborova2015statistical (), one ends up with the Low-RAMP equations LesKrzZde17 ()

 Bti =1√NN∑k=1Ski^xtk−(1NN∑k=1S2kiσtk)^xt−1i, (9) Ati =1NN∑k=1[^Rki(^xtk^xt,⊤k+σtk)−S2kiσtk], (10) ^xt+1i =∂fin∂B[Ati,Bti], (11) σt+1i =∂2fin∂B2[Ati,Bti], (12)

where

 fin[A,B] ≡log[∫dxPX(x)exp(B⊤x−12x⊤Ax)]. (13)

Note also that these equations can be further simplified replacing by its mean, without changing the leading order in of the expressions. This simplification is also exploited in the rigours derivation of the state evolution, cf. Sec. II.1.4.

Practically, one initializes and to some small numbers, then evaluates and , then and and keep going till convergence. The values of and at convergence are the estimators of the mean and the covariance matrix of the variable . The mean squared error (MSE) with respect to the ground truth that is reached by the algorithm is then

 MSE(^x)=1NN∑i=1∥∥^xi−x(0)i∥∥2. (14)

#### ii.1.2 Bethe Free Energy.

The fixed-points of the Low-RAMP equations are stationary points of the Bethe free energy of the model. In general, the free energy of a probability measure is defined as the logarithm of its normalization111Note as in physics the free energy is usually defined as the negative logarithm.. Within the same assumptions of Low-RAMP, the free energy can be approximated using the Plefka expansion plefka1982convergence (); georges1991expand (), obtaining the Bethe free energy LesKrzZde17 ()

 ΦBethe= maxAi,Bi∑1≤i≤Nfin(Ai,Bi)−B⊤i^xi+12Tr[Ai(^xi^x⊤i+σi)] (15) +12∑1≤i,j≤N{Sij√N^x⊤i^xj−^Rij2NTr[(^xi^x⊤i+σi)(^xj^x⊤j+σj)] (16) +S2ij2NTr[^xi^x⊤iσj+σi^xj^x⊤j−σiσj]} (17)

where and are considered explicit functions of and as in Eqs. (11), (12). The Bethe free energy is useful to analyze situations in which the AMP equations have more than one fixed-point: the best achievable mean squared error is associated to the largest free energy. The Bethe free energy is also useful in order to use adaptive damping to improve the convergence of the Low-RAMP equations rangan2016fixed (); vila2015adaptive ().

#### ii.1.3 State Evolution.

One of the advantages of AMP-type algorithms is that one can analyze their performance in the large size limit via the so-called State Evolution (SE), equivalent to the cavity method in the physics literature. Assume that is generated from the following process: the signal is extracted from a probability distribution and then it is measured through a Gaussian channel of zero mean and variance , so that the probability distribution of the matrix elements is given by

 Pout(Yij)=exp⎡⎢⎣g(0)⎛⎜⎝Yij∣∣∣x(0)ix(0)j√N⎞⎟⎠⎤⎥⎦, g(0)(Y|w)=−12ln(2πΔ0)−12Δ0(Y−w)2. (18)

Note as here we are considering the general situation in which the prior and the noise channel (and possibly also the rank ) are not known and are in principle different from the ones used in the posterior Eq. (2). If both the prior and the channel are exactly known, we say to be in the Bayes-optimal case.

Central limit theorem assures that the averages of and over are Gaussian with

 ¯¯¯¯¯¯Bti =MtΔx(0)i, ¯¯¯¯¯¯¯¯¯¯¯¯(Bti)2−¯¯¯¯¯¯Bti2 =Δ0Δ2Qt, ¯¯¯¯¯¯Ati =Δ0Δ2Qt−Δ0−ΔΔ2ΣtX, (19)

while the variance of is of smaller order in and where we have defined the order parameters

 Mt=1NN∑k=1^xtkx(0),⊤k∈Rr×r0, Qt=1NN∑k=1^xtk^xt,⊤k∈Rr×r, ΣtX=1NN∑k=1σtk∈Rr×r. (20)

Using then Eqs. (11), (12) to fix self-consistently the values of the order parameters, one obtains the state evolution equations LesKrzZde17 ()

 Mt+1=Ex(0),W[∂fin∂B(At,Bt)x(0),⊤], (21) Qt+1=Ex(0),W[∂fin∂B(At,Bt)∂fin∂B(At,Bt)⊤], (22) Σt+1X=Ex(0),W[∂2fin∂B2(At,Bt)], (23)

where is distributed according to , is a Gaussian noise of zero mean and unit covariance matrix and we have defined

 At≡Δ0Δ2Qt−Δ0−ΔΔ2ΣtX, Bt≡MtΔx(0)+√Δ0Δ2QtW. (24)

Similarly, the large size limit of the Bethe free energy Eq. (17) is given by

 ΦRS=max{ϕRS(M,Q,Σ),∂ϕRS∂M=∂ϕ% RS∂Q=∂ϕRS∂Σ=0} (25)

with

 ϕRS(M,Q,Σ) =Ex0,W[fin(A,B)]+Δ04Δ2Tr[QQ⊤]−12ΔTr[MM⊤]−Δ0−Δ4Δ2Tr% [(Q+Σ)(Q+Σ)⊤]. (26)

This free energy coincides with the one obtained by replica theory under replica symmetric (RS) ansatz.

For Low-RAMP, the mean squared error (MSE) can be evaluated from the state evolution as

 MSELow-RAMP=Tr[Ex0[x0x⊤0]−2M+Q] (27)

allowing us to assess the typical performance of the algorithm.

#### ii.1.4 The rigorous approach

The fact that state evolution accurately tracks the behavior of the AMP algorithm has been proven bayati2011dynamics (); bolthausen2014iterative (); rangan2012iterative (); DeshpandeM14 (); deshpande2015asymptotic (). In this section, we shall discuss the main lines of this progress.

First, let us rewrite the AMP equations (912) as follows, using a vectorial form:

 Bt = S√N^xt−bt^xt−1, (28) xt+1 = ηt(Bt), (29)

where the scalar quantity is computed as

 bt+1=E[S2]NN∑i=1η′t(Bti), (30)

with is the average value of the square of each element of the matrix . The link with the AMP equations written previously is direct when one choose the denoising function to be

 ηt(B) \vcentcolon= ∂Bfin(B,At). (31)

The strong advantage of the rigorous theorem is that it can be stated for any function (under some Lipschitz conditions), and we shall make advantage of this in the next chapter.

With these notations, the state evolution is rigorous thanks to a set of works due to Rangan and Fletcher rangan2012iterative (), Deshpande and Montanari DeshpandeM14 (), and Deshpande, Abbe, and Montanari deshpande2015asymptotic () 111See in particular lemma 4.4 in deshpande2015asymptotic (). One can go from their notation to ours by a simple change or variable. First what they denote as corresponds to , with , so that . The message passing is then easily mapped by the change of variable: , and the denoising function is replaced by .

###### Theorem 1 (State Evolution for Low-RAMP rangan2012iterative (); DeshpandeM14 (); deshpande2015asymptotic ()).

Consider the problem as in (1), and define . Let be a sequence of functions such that both and and Lipschitz continuous, then the empirical averages

 Mt =1N∑i^xtix(0)i, Qt =1N∑i(^xti)2, Ψt =1N∑iψ(Bti,x(0)i), (32)

for a large class of function (see deshpande2015asymptotic ()) converge, when , to their state evolution predictions where

 MtSE =E[x(0)ηt(Z)], QtSE =E[ηt(Z)2], ΨtSE =E[ψ(Z,x(0))], (33)

where is a random Gaussian variable with mean and variance , and is distributed according to the prior .

This means that the variable converges to a Gaussian with mean and variance as predicted within the cavity method in (19). This provides a rigorous basis for the analysis of AMP. In particular, one can choose with and this covers the AMP algorithm discussed in this section.

### ii.2 Phase diagram and convergence of AMP out of the Nishimori line

In the Bayes-optimal setting, where the knowledge of the model is complete as in Eq. (3), the statistical physics analysis of the problem presents important simplifications known as the Nishimori conditions NishimoriBook (). Specifically, as direct consequence of the Bayesian formula and basic properties of probability distributions, it is easy to see that zdeborova2015statistical ()

 Bayes-optimal: E[f(x1,x2)]=E[f(x(0),x)], (34)

where is a generic function and , and are distributed according the posterior while is distributed according . The most striking property of systems that verify the Nishimori conditions is that there cannot be any replica symmetry breaking in the equilibrium solution of these systems222Note, however, that the dynamics can still be complicated, and in fact a dynamic spin glass (d1RSB) phase can appear also in the Bayes-optimal case LesKrzZde17 (). nishimori2001absence (). This simplifies considerably the analysis of the Bayes-optimal inference. From the algorithmic point of view, the replica symmetry assures that the marginals are asymptotically exactly described by the belief-propagation algorithms zdeborova2015statistical (). In this sense, AMP provides an optimal analysis of Bayes-optimal inference. In real-world situations it is, however, difficult to ensure the satisfaction of the Nishimori conditions, as the knowledge of the prior and of the noise is limited and/or the parameter learning is not perfect. The understanding of what happens beyond the Bayes-optimal condition is then crucial.

Using state evolution Eqs. (21)-(23) - or equivalently replica theory for a replica symmetric ansatz - one can obtain the phase diagram of Low-RAMP in a mismatching models setting. As soon as we deviate from the Bayes-optimal setting and mismatch the models, the Nishimori conditions are not valid anymore and so the replica symmetry is not guaranteed and Low-RAMP is typically not optimal. If the mismatch is substantial, the replica symmetry gets broken and we enter in a glassy phase.

To illustrate the behaviour of AMP out of the Bayes-optimal case, let us consider the planted Sherrington-Kirkpatrick (SK) model SherringtonKirkpatrick75 (), a well-known model both in computer science and physics communities. This is a particular case of the low-rank matrix estimation model Eq. (1) for a bimodal prior, cf. Eq. (8). The output noise is Gaussian with the assumed variance mismatching the ground-truth variances , cf. Eqs. (7),(18). In Fig. 1 we show the phase diagram of the model obtained by the SE Eqs. (21)-(23) as and are independently changed. The dashed line is the Nishimori line where the Bayes-optimal condition holds. For the signal/noise ratio is too low and the estimation is impossible, is always zero and . Note that a MSE larger than 1 is worse than the error achieved by a random guess from the prior. At , the inference starts to be successful only on the Nishimori line. As one is further and further away from the Nishimori line, the algorithm needs a lower and lower value of noise in order to achieve a positive overlap and a MSE lower than random guess. In particular, if is too large, i.e. , one always reaches the trivial fixed point . The case is also especially meaningful for an evaluation of the MAP estimator Eq. (5). In this case, the value of the overlap with the ground truth is zero till NishimoriBook () while the MSE is larger than 1 till , almost half than the equivalent threshold on the Nishimori line.

The SE equations describe the average (or large size) behaviour of Low-RAMP and as such they converge in all the phase diagram. Single finite size instances of Low-RAMP do not converge point-wise too far from Nishimori line, cf. Fig. 2.

The analysis of the point-wise convergence on single instances of Low-RAMP can be obtained looking at the Hessian eigenvalues of Low-RAMP equations. A simple way to derive a necessary convergence criterium is to ask whether or not the fixed-point of AMP is stable with respect to weak, infinitesimal, random perturbations. Let us see how it can be derived with the notations of section II.1.4. If we perform, in Eq. (28), the perturbation , where is a i.i.d. infinitesimal vector sampled from , then we may ask how this perturbation is carried out at the next step of the iteration. From the recursion (28)-(31), we see that (a) is not modified to leading order, as and the average in (30) makes the perturbation of and (b) with . The norm of the perturbation has thus been multiplied, up to a constant , by the empirical average of . The fixed-point will be stable if the perturbation does not grow. This yields, coming back to the main notations of the article, to the following criterion:

 λ=1−Δ0Δ2Ex(0),W⎡⎢⎣⎛⎝∂2fin(A,B)∂B2∣∣∣A=Δ0Δ2Q−Δ0−ΔΔ2Σ,B=MΔx(0)+√Δ0QΔ2W⎞⎠2⎤⎥⎦. (35)

For positive the perturbation decreases and AMP algorithm converges, for negative it grows and the algorithm does not converge. Interestingly, condition (35) is equivalent to the stability of replica symmetric (RS) solutions in the replica theory given by RS replicon de1978stability () (and indeed bolthausen2014iterative () has shown rigorously in the SK model, convergence of the AMP algorithm in a phase where the replica symmetric solution is stable). The line where the RS solution becomes unstable (and Low-RAMP stops converging point-wise) is shown in Fig. 1 with a yellow line. Note that, as expected, the RS stability line lies always below the Nishimori line and the two touch at the tri-critical point . For small , the stability line is roughly given by and it touches the axis only for . This means that for any finite the RS solution becomes always unstable for small enough. The stability line is always above the line, cf. Fig. 1. Note that it is possible to get RS stable solutions with : so it is possible that AMP converges point-wise to something worse than random guess.

In Fig. 2 we show the results obtained by running Low-RAMP for single instances of the same problem for . We iterate the Low-RAMP equations, without any damping, for steps or we stop when the average change of the variables in a single step of iteration is less than . The four regions highlighted in Fig. 1 are well distinguishable. We also show the value of , which is zero on the Nishimori line. In particular, it is clear that the point-wise convergence is very fast (less than 100 iterations) well inside the RS stability region, then the convergence time increases rapidly at the boundary of the RS stability region and then stops converging point-wise outside of it.

The scenario illustrated in this section for SK is quite general within low-rank matrix estimation problems. Some differences arise when is very sparse: in this case on the Nishimori line there is a first order transition LesKrzZde17 () and the scenario becomes more complicated, with the coexistence of different phases close to the transition.

### ii.3 Optimality and restored Nishimori condition

The Bayes-optimal setting, as reminded in Sec. II.2, is a very special situation: the Nishimori condition Eq. (34) guarantees that the replica symmetry is preserved and that AMP is optimal (in absence of a 1st order phase transition). One consequence of the Nishimori condition is that the typical overlap with the ground truth and the self-overlap , as defined in Eqs. (20), are equal. In this case the MSE is then given by

 MSE=E[x20]−2M+Q=E[x20]−M, (36)

and the minimum MSE is obtained at the maximum overlap . For mismatching models and are typically different from each other and it is not immediate to realize under what conditions the MSE is minimized. We highlighted this property in the main panel of Fig. 2: apart from the trivial solutions (for large and ) and (for very small ), the overlap with the ground truth and the self-overlap are very close only near the Nishimori line. The Nishimori line is also the region in which AMP is optimal, so it is spontaneous to ask what is the relation (if any) between these two properties in a general setting.

Consider a situation in which we have two or more parameters to tune. These can be parameters in the prior, the variance of the noise or also the free parameters in the general belief-propagation equations, see for example the parameter for the ASP algorithm that we will describe in Eqs. (59)-(64). The fundamental question is then for which set of values for the parameters it is possible to obtain optimal estimation error, and how to find them algorithmically. One evident answer is that if all the parameters are chosen to exactly match the values of the ground truth distribution, one ends up on the Nishimori line and then we know that the inference is optimal, and the MSE minimum. But this is not the unique answer.

Let us illustrate what we found with a simple example. The data are generated by the planted SK model with . In this case the Bayes-optimal inference, knowing the correct prior and noise, would get . Consider now instead a situation in which one (wrongly!) assumes that the variance of the noise is but also, being not sure about the correct prior, assume a more general Rademacher-Bernoulli prior

 P(x)=ρ2δ(x−1)+ρ2δ(x+1)+(1−ρ)δ(x) (37)

with some unknown parameter . In Fig. 4 we show the values of , and MSE obtained by the SE Eqs. (21)-(23) as a function of while is kept fixed at . Taking , which corresponds to the correct prior, we obtain , and , considerably worse than the Bayes-optimal error. The maximum is obtained for , where and . But, if we look to restore the optimality condition , we are led to accept the value : here and , equal to the value obtained by Bayes-optimal inference.

The above observations leads us to a hypothesis: Any combination of the values of parameters that restores the condition , achieves the same performances of the Bayes-optimal setting. We tested this hypothesis in several settings of the low-rank matrix estimation problem, including sparse cases with first order transition LesKrzZde17 (). While we always found it true, the underlying reason for this eludes us and is left for future work. The optimality from the restoration of the condition is relevant in cases in which we assume a wrong functional form of prior, as in the previous example, so that it is indeed impossible to actually reduce to the Bayes-optimal setting.

Note that it is nontrivial to turn the above observation into an algorithmic procedure. A common parameter learning procedure is to maximize the Bethe free energy of the model, Eq. (17). This procedure gives asymptotically the Bayes-optimal parameters, if learning of the exact prior and noise is possible. Nevertheless, if the functional form of the prior and noise is incorrect, this procedure does not return the optimal values of the parameters - that, in our hypothesis, would be the ones associated with the restoration of the condition. In the previous example, the Bethe free energy is monotonically increasing in and has a local maximum only at , where the prior is not a well-defined probability distribution and the MSE is larger than 1, cf. Fig. 4. It is interesting to look at the RS eigenvalue Eq. (35) for this solution, that is associated with the point-wise convergence of AMP. The eigenvalue is positive for , and in particular is positive for the value , where the optimality condition is restored. Moreover, for very low , in this case for , the only solution is the trivial , . These two extremes give the finite interval in which one should indeed look to find the optimal . We will discuss further about this observation in Sec. III.3 about the use of ASP.

## Iii The Approximate Survey Propagation: a 1RSB version of AMP

AMP is an established approach to analyze systems with a ferromagnet-like transition, where one expects to have just two possible fixed-points of the iterations. There are, however, situations where there exists a huge number of fixed-points for the AMP equations. In particular, we have shown in the previous section that in the low-rank matrix estimation problem with mismatching models there is a region where the replica symmetry is broken and the AMP algorithm does not converge. In this case one needs to use the cavity method in conjunction with a replica symmetry breaking (RSB) approach, as was introduced by Mézard and Parisi MP01 (); mezard2003cavity (); MPZ02 (). In the following we show how this approach can be carried out for the low-rank matrix estimation problem and how this provides a systematic method to deal with mismatching models in a natural way. Note that we need to insist on the notion of independence of the noise elements, that is an essential for belief-propagation-based approaches to work.

### iii.1 Derivation of the ASP algorithm for the low-rank matrix estimation problem

We derive here the 1-step replica symmetry breaking (1RSB) approximate massage passing, that we call approximate survey propagation (ASP) algorithm, for the low-rank matrix estimation problem. To derive the correct equations in this case, let us consider a replicated inference problem, basically turning the method of Monasson Mo95 () into a message passing algorithm. Given the matrix and , being the number of real replicas, we assume that

 Yij=x(a)ix(a)j√N+ξ(a)ij, (38)

where are independent Gaussian noises with zero mean and variance . The partition function becomes

 Zrep(Y)=∫[s∏a=1N∏i=1dx(a)iPX(x(a)i)]exp[∑i≤js∑a=1g(Yij|w(aa)ij)], (39)

where and and have the same definition as the previous sections, cf. Eq. (2). If we get back the usual inference problem of Eq. (1).

In order to evaluate the marginals over the variables , we need to write the BP equations for the replicated system. In the following let us omit normalization factors when they are irrelevant (they can be determined a posteriori). The BP equations are

 mi→ij(x––i) ∼[s∏a=1PX(x(a)i)]∏k≠jmki→i(x––i), (40) mij→i(x––i) ∼∫dx––jmj→ij(–xj)exp[s∑a=1g(Yij|w(aa)ij)], (41)

where we have introduced the notation . This follows directly from the factor graph represented in Fig. 5.

At this point we introduce a 1RSB parametrization for the cavity distributions

 mj→ij(x––j)∼∫dhexp⎡⎢⎣−12Δ(0)j→ij(h−^xj→ij)2⎤⎥⎦s∏a=1exp⎡⎢⎣−12Δ(1)j→ij(x(a)j−h)2⎤⎥⎦. (42)

The form of this ansatz elucidates why we are considering a replicated system: if the posterior measure Eq. (2) develops a 1RSB structure, the phase space of the solutions of the inference problem gets clustered in exponentially many basins of solutions MPV87 (). Therefore, if we consider a set of real copies of the same inference problem and we infinitesimally couple them, they will acquire a finite probability to fall inside the same cluster of solutions. This line of reasoning is exactly the same as the one considered to describe the real replica approach for glasses Mo95 (); CKPUZ17 () and the ansatz of Eq. (42) reproduces the infinite dimensional caging ansatz (at the 1RSB level) for hard spheres in infinite dimension CKPUZ14 (); CKPUZ13 ().

The ansatz of Eq. (42) allows for three relevant correlation functions

 ⟨x(a)j⟩ =^xj→ij, =Δ(0)j→ij+^xj→ij^x⊤j→ij, (43) ⟨x(a)jx(a),⊤j⟩ =Δ(1)j→ij+Δ(0)j→ij+^xj→ij^x⊤j→ij,

where the averages are taken with the measure defined in Eq. (42). The last two lines give access to the correlation between cavity messages from one copy to another. Note that if different replicas become uncorrelated. Therefore in the limit one recovers the replica symmetric AMP. Equivalently, if one fixes the ansatz reduces naturally to the RS parametrization for the cavity messages and one gets back again to the AMP algorithm.

Given the 1RSB ansatz, we can plug Eq. (42) in Eq. (41) to get

 mij→i(x––i)∼ ∼ exp[1√NSij^xj→ijs∑a=1x(a)j−12N^Rij(Δ(1)j→ijΔ(0)j→ij+(^xj→ij)2)s∑a=1(x(a)j)2 +12NS2ijΔ(0)j→ij(s∑a=1x(a)j)2+12NS2ijΔ(1)j→ijs∑a=1(x(a)j)2⎤⎦, (44)

where the matrices and are the same as introduced in Eq. (7). Plugging this result inside the first one of Eqs. (41) we get

 mi→ij(x––i)∼[s∏a=1PX(x(a)i)]exp⎡⎣Ti→ijs∑a=1x(a)i−12V(1)i→ijs∑a=1(x(a)i)2+12V(0)i→ij(s∑a=1x(a)i)2⎤⎦,

where we have defined

 Ti→ij =1√N∑k≠jSik^xk→ik, (45) V(1)i→ij =1N∑k≠j[^Rik(Δ(1)k→ik+Δ(0)k→ik+(^xk→ik)2)−S2ikΔ(1)k→ik], (46) V(0)i→ij =1N∑k≠jS2ikΔ(0)k→ik. (47)

We can now close the equations. We have

 ⟨x(a)i⟩ =1s∂∂Ti→ijf%in[Ti→ij,V(1)i→ij,V(0)i→ij], (48) ⟨(x(a)i)2⟩ =−2s∂∂V(1)i→ijfin[Ti→ij,V(1)i→ij,V(0)i→ij], (49) ⟨x(a)ix(b)i⟩ =2s(s−1)⎡⎢⎣∂∂V(0)i→ij+∂∂V(1)i→ij⎤⎥⎦fin% [Ti→ij,V(1)i→ij,V(0)i→ij], (50)

where we have introduced the function

 fin[T,V(1),V(0)] =ln∫dhe−12V(0)h2√V(0)2π{∫dxPX(x)exp[−12V(1)x2+(T+V(0)h)x]}s. (51)

Using Eqs. (43), we obtain

 ^xi→ij =1s∂∂Ti→ijf%in[Ti→ij,V(1)i→ij,V(0)i→ij], (52) Δ(0)i→ij =2s(s−1