Constrained Lowrank Matrix Estimation:
Phase Transitions, Approximate
Message Passing and Applications.
Abstract
This article is an extended version of previous work of the authors lesieur2015phase (); lesieur2015mmse () on lowrank matrix estimation in the presence of constraints on the factors into which the matrix is factorized. Lowrank matrix factorization is one of the basic methods used in data analysis for unsupervised learning of relevant features and other types of dimensionality reduction. We present a framework to study the constrained lowrank matrix estimation for a general prior on the factors, and a general output channel through which the matrix is observed. We draw a paralel with the study of vectorspin glass models – presenting a unifying way to study a number of problems considered previously in separate statistical physics works. We present a number of applications for the problem in data analysis. We derive in detail a general form of the lowrank approximate message passing (LowRAMP) algorithm, that is known in statistical physics as the TAP equations. We thus unify the derivation of the TAP equations for models as different as the SherringtonKirkpatrick model, the restricted Boltzmann machine, the Hopfield model or vector (xy, Heisenberg and other) spin glasses. The state evolution of the LowRAMP algorithm is also derived, and is equivalent to the replica symmetric solution for the large class of vectorspin glass models. In the section devoted to result we study in detail phase diagrams and phase transitions for the Bayesoptimal inference in lowrank matrix estimation. We present a typology of phase transitions and their relation to performance of algorithms such as the LowRAMP or commonly used spectral methods.
Contents
 I Introduction
 II Lowrank approximate message passing, LowRAMP
 III State Evolution
 IV General results about lowrank matrix estimation
 V Phase diagrams for Bayesoptimal lowrank matrix estimation
 VI Discussion and perspective
 VII Acknowledgement
 A Mean Field equations
 B Bethe free energy derived by the Plefka expansion
 C Replica computation case.
 D Small expansion
 E Large rank behavior for the symmetric community detection
I Introduction
i.1 Problem Setting
In this paper we study a generic class of statistical physics models having Boltzmann probability measure that can be written in one of the two following forms:

Symmetric vectorspin glass model:
(1) Here and are real valued matrices. In this case is a symmetric matrix. In statistical physics is called the quenched disorder. In the whole paper we denote by the vectorspin (dimensional column vector) that collects the elements of the th row of the matrix , is the scalar product of the two vectors. is the corresponding partition function playing role of the normalization.

Bipartite vectorspin glass model:
(2) Defined as above, this time and , . Again we denote by , the vectorspins of dimension that collect rows of matrices , . In this case the graph of interactions between spins is bipartite.
The main motivation on this work is twofold. On the one hand, the above mentioned probability measures are posterior probability measures of an important class of highdimensional inference problems known as constrained lowrank matrix estimation. In what follows we give examples of applications of these matrix estimation problems in data processing and statistics. On the other hand, our motivation from the physics point of view is to present a unified formalism providing the (replica symmetric) solution for a large class of meanfield vectorial spin models with disorder.
The general nature of the present work stems from the fact that the probability distributions , , and the function are very generic (assumptions are summarize in section I.3). These functions can even depend on the node or edge . For simplicity we will treat siteindependent functions , , and , but the theory developed here generalizes very straightforwardly to the site or edge dependent case. From a statistical physics point of view the terms , , play a role of generic local magnetic fields acting on the individual spins. Distributions , , describe the nature of the vectorspin variables and the fields that act on them. The simplest example is the widely studied Ising spins for which and , where here would be related to the usual magnetic field and inverse temperature as . In this paper we treat a range of other examples with and elements of being both discrete or continuous. This involves for instance spherical spin models with being Gaussian, or Heisenberg spins where and each is confined to the surface of a sphere.
Denoting
(3) 
according to symmetric or bipartite context, the terms are then interactions between pairs of spins that depend only on the scalar product between the corresponding vectors. The most commonly considered form of interaction in statistical physics is simply
(4) 
with being a constant called inverse temperature, leading to a range of widely considered models with pairwise interactions. We will refer to this form of function as the conventional Hamiltonian.
In order to complete the setting of the problem we need to specify how is the quenched disorder chosen. We will consider two main cases of the quenched disorder defined below. We note that even for problems where the matrix is not generated by either of the below our approach might still be relevant, e.g. for the restricted Boltzmann machine that is equivalent to the above bipartite model with that were learned from data (see for instance gabrie2015training (); tramel2016inferring () and the discussion below).

Randomly quenched disorder: In this case the matrix elements of are chosen independently at random from some probability distribution . In this paper we will consider this distribution to be independent of and in later parts for simplicity we will restrict its mean to be zero. This case of randomly quenched disorder will encompass many well known and studied spin glass models such as the Ising spin glass, Heisenberg spin glass or the spherical spin glass or the Hopfield model.

Planted models: Concerning applications in data science this is the more interesting case and most of this paper will focus on it. In this case we consider that there is some ground truth value of (or , ) with rows that are generated independently at random from some probability distribution (or , ). Then the disorder is generated elementwise as a noisy observation of the product (or ) via an output channel characterized by the output probability distribution .
i.2 Preliminaries on the planted setting
i.2.1 Bayes optimal inference
Many applications, listed and analyzed below, in which the planted setting is relevant, concern problems where we aim to infer some ground truth matrices , , from the observed data and from the information we have about the distributions , , and . The informationtheoretically optimal way of doing inference if we know how the data and how the groundtruth matrices were generated is to follow the Bayesian inference and compute marginals of the corresponding posterior probability distribution. According to the Bayes formula, the posterior probability distribution for the symmetric case is
(5) 
For the bipartite case it is
(6) 
Making link with the Boltzmann probability measures (1) and (2) we see that the Bayes optimal inference of the planted configuration is equivalent to the statistical physics of the above vectorspin models with
(7) 
This approach is optimal in the sense that the statistical estimator computed from the data that minimizes the expected meansquared error between the estimator and the ground truth is given by the mean of the marginal of variable in the probability distribution (5)
(8) 
Analogously for the bipartite case.
In the Bayesoptimal setting defined by conditions (7) the statistical physics analysis of the problem presents important simplifications known as the Nishimori conditions NishimoriBook (); zdeborova2015statistical (), which will be largely used in the present paper. These conditions can be proven and stated without the usage of the methodology developed below, they are a direct consequence of the Bayesian formula for conditional probability and basic properties of probability distributions.
Assume Bayesoptimality of the output channel, that is . First let us notice that every probability distribution has to be normalized
(9) 
By deriving the above equation with respect to one gets.
(10)  
(11) 
Anticipating the derivation in the following we also define the inverse Fisher information of an output channel at as
(12) 
Let us now assume Bayesoptimality also for the prior distributions, i.e. and , , . Then under averages over the posterior distribution (5) or (6) we can replace the random variables , , for the ground truth , , and vice versa. To prove this let us consider the symmetric case, the generalization to the bipartite case is straightforward. Take to be three independent samples from the posterior probability distribution , eq. (5). We then consider some function of two configurations of the variables . Consider the following two expectations
(13)  
(14)  
where we used the Bayes formula. We further observe that because is independent of when conditioned on . In the Bayes optimal case, i.e. when and , we then obtain
(15) 
meaning that under expectations there is no statistical difference between the ground truth assignment of variables and an assignment sampled uniformly at random from the posterior probability distribution (5). This is a simple yet important property that will lead to numerous simplifications in the Bayes optimal case and it will be used in several places of this paper, under the name Nishimori condition.
From the point of view of statistical physics of disordered systems the most striking property of systems that verify the Nishimori conditions is that there cannot be any replica symmetry breaking in the equilibrium solution of these systems nishimori2001absence (); NishimoriBook (); zdeborova2015statistical (). This simplifies considerably the analysis of the Bayesoptimal inference. Note, however, that metastable (outofequilibrium) properties of Bayesoptimal inference do not have to satisfy the Nishimori conditions and replica symmetry breaking might be needed for their correct description (this will be relevant in the cases of first order phase transition described in section IV.3).
i.2.2 Principal component analysis
In the above probabilistic inference setting the Bayesian approach of computing the marginals is optimal. However, in a majority of the interesting cases it is computationally intractable (NP hard) to compute the marginals exactly. In all lowrank estimation problems the method that (for the bipartite case) comes to mind as first when we look for a lowrank matrix close to an observe matrix is the singular value decomposition (SVD) where a rank matrix that minimizes the squared difference in computed
(16) 
where is the th largest singular value of , and , are the corresponding leftsingular and rightsingular vectors of the matrix . The above property is know as the EckartYoungMirsky theorem eckart1936approximation (). In the symmetric case we simply replace the singular values by eigenvalues and the singular vectors by the eigenvectors. The above unconstrained lowrank approximation of the matrix , eq. (16), is also often referred to as principal component analysis (PCA), because indeed when the matrix is interpreted as samples of dimensional data then the rightsingular vectors are directions of greatest variability in the data.
PCA and SVD are methods of choice when the measure of performance is the sum of square differences between the observe and estimated matrices, and when there are no particular requirements on the elements of the matrices , or .
The methodology developed in this paper for the planted probabilistic models, generalizes to arbitrary cost function that can be expressed as a product of elementwise terms and to arbitrary constraints on the rows of the matrices , , as long as they can be described by rowwise probability distributions , , . Systematically comparing our results to the performance of PCA is useful because PCA is well known and many researcher have good intuition about what are its strengths and limitations in various settings.
i.3 The large size limit, assumptions and channel universality
In this article we focus on the thermodynamic limit where whereas , and and all the elements of , , and are of order 1. The functions , , and do not depend on explicitly. In the planted model also the distribution , , and do not depend on explicitly. The only other requirement we impose on the distributions , , and , , is that they all have a finite second moment.
The factor in the second argument of the function ensures that the behaviour of the above models is nontrivial and that there is an interesting competition between the number of local magnetic fields , , and the number of interactions. To physics readership familiar with the SherringtonKirkpatrick (SK) model this factor will be familiar because in the SK model the interaction between the Ising spins that lead to extensive free energy are also of this order (with mean that is of order ). This is compared to the ferromagnetic Ising model on a fully connected lattice for which the interactions leading to extensive free energy scale as .
For readers interested in inference problems, i.e. the planted setting, the factor is the scaling of the signaltonoise ratio for which inference of unknown from measurements is neither trivially easy nor trivially impossible. In the planted setting can be viewed as a random matrix with a rank perturbation. The regime where the eigenvalues of dense random matrices with lowrank perturbations split from the bulk of the spectral density is precisely when the strength of the perturbation is , see e.g. baik2005phase ().
We are therefore looking at statistical physics models with pairwise interactions where each of the interactions depend only weakly on the configuration of the vectorspins. As a consequence, properties of the system in the thermodynamic limit depend only weakly on the details of the interaction function with given by (3). The results of this paper hold for every function for which the following Taylor expansion is well defined
(17) 
In order to simplify the notation in the following we denote
(18)  
(19) 
We will refer to the matrix as the Fisher score matrix. The above expansion can now be written in a more compact way
(20) 
Let us now analyze the orders in this expansion. In the Boltzmann measure (1) and (2) the terms appears in a product over terms and . At the same time only terms of order in the exponent of the Boltzmann measure influence the leading order (in ) of the marginals (local magnetizations), therefore all the terms that depend on 3rd and higher order of are negligible. This means that the leading order of the marginals depend on the function only trough the matrices of its first and second derivatives at , denoted and (1819). This means in particular that in order to understand the phase diagram of a model with general we only need to consider one more order than in the conventional Hamiltonian considered in statistical physics (4).
In the sake of specificity let us state here the two examples of the output channels considered most prominently in this paper and their corresponding matrices and . The first example corresponds to observations of lowrank matrices with additive Gaussian noise, we will refer to this as the Gaussian output channel
(21) 
where is the variance of the noise, for the specific case of the Gaussian output channel is also the Fisher information as defined in eq. (12). The second example is the one most often considered in physics given by eq. (4)
(22) 
with being a constant called inverse temperature. Another quite different example of the output channel will be given to model community detection in networks in section I.4.2.
i.4 Examples and applications
The Boltzmann measures (1) and (2) together with the model for the disorder can be used to describe a range of problems of practical and scientific interest studied previously in physics and/or in data sciences. In this section we list several examples and applications for each of the four categories – the symmetric and bipartite case, and the randomly quenched and planted disorder.
i.4.1 Examples with randomly quenched disorder
SherringtonKirkpatrick (SK) model.
The SK model SherringtonKirkpatrick75 () stands at the roots of the theory of spin glasses. It can be described by the symmetric Boltzmann measure (1) with the conventional Hamiltonian .
The are Ising spins, i.e. , with distribution
(23) 
The parameter is related to the inverse temperature and an external magnetic field as . Note that the parameter could also be sitedependent and our approach would generalize, but in this paper we work with site independent functions .
The elements of the (symmetric) matrix are the quenched random disorder, i.e. they are generated independently at random from some probability distribution. Most usually considered distributions of disorder would be the normal distribution , or binary with probability and otherwise.
The algorithm developed in this paper for the general case corresponds to the ThoulessAndersonPalmer ThoulessAnderson77 () equations for the SK model. The theory developed here correspond to the replica symmetric solution of SherringtonKirkpatrick75 (). Famously this solution is wrong below certain temperature where effects of replica symmetry breaking (RSB) have to be taken into account. In this paper we focus on the replica symmetric solution, that leads to exact and novel phase diagrams for the planted models. The RSB solution in the present generic setting will be presented elsewhere. We present the form of the TAP equations in the general case encompassing a range of existing works.
Spherical spin glass.
Next to the SK model, the spherical spin glass model kosterlitz1976spherical () stands behind large fraction of our understanding about spin glass. Mathematically much simpler than the SK model this model stands as a prominent case in the development in mathematical physics. The spherical spin glass is formulated via the symmetric Boltzmann measure (1) with the conventional Hamiltonian . The function with enforces (canonically) the spherical constraint . External magnetic field can also be included in .
The disorder is most commonly randomly quenched in physics studies of the spherical spin glass model.
Heisenberg spin glass.
In Heisenberg spin glass sommers1981theory () the Hamiltonian is again the conventional symmetric one with randomly quenched disorder. The spins are 3dimensional vectors, , of unit length, . Magnetic field influences the direction of the spin so that
(24) 
where . The more general component model was also astudied extensively in the spin glass literature gabay1981coexistence ().
Restricted Boltzmann Machine.
Restricted Boltzmann machines (RBMs) are one of the triggers of the recent revolution in machine learning called deep learning hinton2010practical (); hinton2006fast (); lecun2015deep (). The way RBMs are used in machine learning is that one considers the bipartite Boltzmann measure (2). In the training phase one searches a matrix such that the data represented as a set of configurations of the variable have high likelihood (low energy). The variable are called the hidden units and columns of the matrix (each corresponding to one hidden unit) are often interpreted as features that are useful to explain the structure of the data.
The RBM is most commonly considered for the conventional Hamiltonian and for binary variables and . But other distributions for both the datavariables and the hidden variables were considered in the literature and the approach of the present paper applies to all of them.
We note that the disorder that was obtained for an RBM trained on real datasets does not belong to the classes for which the theory developed in this paper is valid (training introduces involved correlations). However, it was shown recently that the LowRAMP equations as studied in the present paper can be used efficiently for training of the RBM kappen1998boltzmann (); gabrie2015training ().
The RBM with Gaussian hidden variables is related to the well known Hopfield model of associative memory hopfield1982neural (). Therefore the properties of the bipartite Boltzmann measure (2) with a randomly quenched disorder are in onetoone correspondence with the properties of the Hopfield model. This relation in the view of the TAP equations was studied recently in mezard2016mean ().
i.4.2 Examples with planted disorder
So far we covered examples where the disorder was randomly quenched (or more complicated as in the RBM). The next set of examples involves the planted disorder that is more relevant for applications in signal processing or statistics, where the variables , , represent some signal we aim to recover from its measurements . Sometimes it is the lowrank matrix that we aim to recover from its noisy measurements . In the literature the general planted problem can be called lowrank matrix factorization, matrix recovery, matrix denoising or matrix estimation.
Gaussian estimation
The most elementary examples of the planted case is when the measurement channel is Gaussian as in eq. (21), and the distributions , and are also Gaussian i.e.
(25)  
(26)  
(27) 
where are the means of the distributions, are the covariance matrices, with being given by (3).
We speak about the estimation problem as Bayesoptimal Gaussian estimation if the disorder was generated according to
(28) 
where is given by eq. (21), and
(29) 
with , , and being generated from probability distributions , , . The goal is to estimate matrices , , and from .
Gaussian Mixture Clustering
Another example belonging to the class of problems discussed in this paper is the model for Gaussian mixture clustering. In this case the spin variables are such that
(30) 
where is the number of clusters, and is a unit dimensional vector with all components except equal to zero, and the component equal to 1, e.g. for we have , , . Having is interpreted as data points belongs to cluster . We have data points.
The columns of the matrix then represent centroids of each of the clusters in the dimensional space. The distribution can as an example take the Gaussian form (27) with the covariance being an identity and the mean being zero. The output channel is Gaussian as in (21). All together this means the collects positions of points in dimensional space that are organized in Gaussian clusters. The goal is to estimate the centers of the clusters and the cluster membership from .
Standard algorithms for data clustering incluse those based on the spectral decomposition of the matrix such as principal component analysis hastie2005elements (); wasserman2013all (), or Loyd’s means lloyd1982least (). Works on Gaussian mixtures that are methodologically closely related to the present paper include application of the replica method to the case of two clusters in watkin1994optimal (); barkai1994statistical (); biehl1994statistical () or the AMP algorithm of NIPS2013_5074 (). Note that for two clusters with the two centers being symmetric around the origin, the resulting Boltzmann measure of the case with randomly quenched disorder is equivalent to the Hopfield model as treated e.g. in mezard2016mean ().
Note also that there are interesting variants of the Gaussian mixture clustering such as subspace clustering parsons2004subspace () where only some of the directions are relevant for the clustering. This can be modeled by a prior on the vectors that have a nonzero weight of being the null vector.
The approach described in the present paper on the Bayesoptimal inference in the Gaussian mixture clustering problem has been used in a work of the authors with other collaborators in the work presented in lesieur2016phase ().
Sparse PCA
Sparse Principal Component Analysis (PCA) johnstone2004sparse (); zou2006sparse () is a dimensionality reduction technique where one seeks a lowrank representation of a data matrix with additional sparsity constraints on the obtained representation. The motivation is that the sparsity helps to interpret the resulting representation. Formulated within the above setting sparse PCA corresponds to the bipartite case (2). The variables is considered unconstrained, as an example one often considers a Gaussian prior on (26). The variables are such that many of the matrixelements are zero.
In the literature the sparse PCA problem was mainly considered in the rankone case for which a series of intriguing results was derived. The authors of johnstone2004sparse () suggested an efficient algorithm called diagonal thresholding that solves the sparse PCA problem (i.e. estimates correctly the position and the value of the nonzero elements of ) whenever the number of data samples is amini2008high (), where is the number of nonzeros and is some constant. More recent works show existence of efficient algorithm that only need samples deshpande2014sparse (). For very sparse systems, i.e. small , this is a striking improvement over the conventional PCA that would need samples. This is why sparsity linked together with PCA brought excitement into data processing methods. At the same time, this result is not as positive as it may seem, because by searching exhaustively over all positions of the nonzeros the correct support can be discovered with high probability with number of samples .
Naively one might think that polynomial algorithms that need less thank samples might exist and a considerable amount of work was devoted to their search without success. However, some works suggest that perhaps polynomial algorithm that solve the sparse PCA problems for number of samples do not exist. Among the most remarkable one is the work krauthgamer2015semidefinite () showing that the SDP algorithm, that is otherwise considered rather powerful, fails in this regime. The work of berthet2013computational () goes even further showing that if sparse PCA can be solved for then also a problem known as the planted clique problem can be solved in a regime that is considered as algorithmically hard for already several decades.
The problem of sparse PCA is hence one of the first examples of a relatively simple to state problem that currently presents a wide barrier between computational and statistical tractability. Deeper description of the origin of this barrier is likely to shed light on our understanding of typical algorithmic complexity in a broader scope.
The above works consider the scaling when and is fixed (or growing slower than ). A regime natural to many applications is when where . This regime was considered in DeshpandeM14 () where it was shown that for an efficient algorithm that achieves the information theoretical performance exists. This immediately bring a question of what exactly happens for and how does the barrier described above appear for ? This question was illuminated in a work by the present authors lesieur2015phase () and will be developed further in this paper.
We consider sparse PCA in the bipartite case, with Gaussian (27) and sparse
(31) 
as corresponds to the formulation of johnstone2004sparse (); zou2006sparse (); amini2008high (); berthet2013computational () and others. This probabilistic setting of sparse PCA was referred to as spiked Wishart model in DeshpandeM14 (), notation that we will adopt in the present paper. This model is also equivalent to the one studied recently in monasson2015estimating () where the authors simply integrate over the Gaussian variables.
In DeshpandeM14 () the authors also considered a symmetric variant of the sparse PCA, and refer to it as the spiked Wigner model. The spiked Wigner model is closer to the planted clique problem, that can be formulated using (1) with having many zero elements. In the present work we will consider several models for the prior distribution . The Bernoulli model as in DeshpandeM14 () where
(32) 
The spiked Bernoulli model can also be interpreted as a problem of submatrix localization where a submatrix of size of the matrix has a larger mean than a randomly chosen submatrix. The submatrix localization is also relevant in the bipartite case, where it has many potential applications. The most striking ones being in gene expression where largemean submatrices of the matrix of gene expressions of different patients may correspond to groups of patients having the same type of disease madeira2004biclustering (); cheng2000biclustering ().
In this paper we will also consider the spiked RademacherBernoulli model with
(33) 
as well as the spiked GaussBernoulli
(34) 
where is the Gaussian distribution with zero mean and unit variance.
So far we discussed the sparse PCA problem in the case of rank one, , but the case with larger rank is also interesting, especially in the view of the question of how does the algorithmic barrier depend on the rank. To investigate this question in lesieur2015phase () we also considered the jointlysparse PCA, where the whole dimensional lines of are zero at once, the nonzeros are Gaussians of mean and covariance being the identity. Mathematically, with
(35) 
Another example to consider is the independentlysparse PCA where each of the components of the lines in is taken independently from the GaussBernoulli distribution, for we have then
(36) 
Community detection
Detection of communities in networks is often modeled by the stochastic block model (SBM) where pairs of nodes get connected with probability that depends on the indices of groups to which the two nodes depend. Community detection is a widely studied problem, see e.g. the review fortunato2010community (). Studies of statistical and computationally barriers in the SBM recently became very active in mathematics and statistics starting perhaps with a series of statistical physics conjectures about the existence of phase transitions in the problem decelle2011inference (); decelle2011asymptotic (). The particular interest of those works is that they are set in the sparse regime where every node has neighbors.
Also the dense regime where every node has neighbors is theoretically interesting when the difference between probabilities to connect depends only weakly on the group membership. The relevant scaling is the same as in the Potts glass model studied in GrossKanter85 (). In fact the dense community detection is exactly the planted version of this Potts glass model. In the setting of the present model (1) the community detection was already considered in deshpande2016asymptotic () for two symmetric groups, in lesieur2015mmse (), and in barbier2016mutual (). In the present paper we detail the results reported briefly in lesieur2015mmse () and in barbier2016mutual (). We consider the case with a general number of equal sized groups, the symmetric case. And also a case with two groups of different sizes, but such that the average degree in each of the groups is the same.
To set up the dense community detection problem we consider a network with nodes. Each node belongs to a community indexed by . For each pair we create an edge with probability . Where is an matrix called the connectivity matrix. In the examples of this paper we will consider two special cases of the community detection problem.
One example with symmetric equally sized groups where for each pair of nodes we create an edge between the two nodes with probability if they are in the same group and with probability if not:
(37) 
where is a dimensional identity matrix, and is a dimensional matrix filled with unit elements. The scaling we consider here is
(38)  
(39) 
so that the average degree in the graph is extensive. Note, however, that by rather generic correspondence between diluted and dense models, that has been made rigorous recently deshpande2016asymptotic (); caltagirone2016recovering (), the results derived in this case hold even for average degrees that diverge only very mildly with . The goal is to infer the group index to which each node belongs purely from the adjacency matrix of the network (up to a permutation of the indices). This problem is transformed into the lowrank matrix factorization problem through the use of the following prior probability distribution
(40) 
where is the vector with everywhere except a at position . Eq. (40) is just a special case of (30). The output channel that describes the process of creation of the graph is
(41)  
(42) 
Next to the conventional Hamiltonian (22) and the Gaussian noise (21), the SBM output (4142) is a third example of an output channel that we consider in this article. It will be used to illustrate the simplicity that arises due to the channel universality, as also considered in deshpande2016asymptotic () and lesieur2015mmse (). Here, we obtain for the output matrices
(43)  
(44) 
Here is parameter that can be used to fix the signal to noise ratio.
Another example of community detection is the one with two balanced communities, i.e. having different size but the same average degree. In that setting there are two communities of size and with . The connectivity matrix of this model is given by
(45) 
This can be modelled at the symmetric matrix factorization with rank and the prior given as
(46) 
The values in are chosen so that each community has an average degree of . The fact that in both of these cases each community has the same average degree means that one can not hope to just use the histogram of degrees to make the distinction between the communities. The output channel here is identical to the one given in (4142)
A third example of community detection is locating one denser community in a dense network, as considered in montanari2015finding () (specifically the large degree limit considered in that paper). We note that thanks to the output channel universality (Sec. I.3) this case is equivalent to the spiked Bernoulli model of symmetric sparse PCA.
As a side remark we note that the community detection setting as considered here is also relevant in the bipartite matrix factorization setting where it becomes the problem of biclustering madeira2004biclustering (); cheng2000biclustering (). The analysis developed in this paper can be straightforwardly extended to the bipartite case.
i.5 Main results and related work
The present paper is built upon two previous shorter papers by the same authors lesieur2015phase (); lesieur2015mmse (). it focuses on the study of the general type of models described by probability measures (1) and (2). On the one hand, these represent Boltzmann measures of vectorialspin systems on fully connected symmetric or bipartite graphs. Examples of previously studied physical models that are special cases of the setting considered here would be the SherringtonKirkpatrick model SherringtonKirkpatrick75 (), the Hopfield model hopfield1982neural (); mezard2016mean (), the inference (not learning) in the restricted Boltzmann machine gabrie2015training (); mezard2016mean (); tubiana2016emergence (). Our work hence provides a unified replica symmetric solution and TAP equations for generic class of prior distributions and Hamiltonians.
On the other hand, these probability distributions (1) and (2) represent the posterior probability distribution of a lowrank matrix estimation problem that finds a wide range of applications in highdimensional statistics and data analysis. The thermodynamic limit with whereas was widely considered in the studies of spin glasses, in the context of lowrank matrix estimation this limit correspond to the challenging highdimensional regime, whereas traditional statistics considers the case where . We focus on the analysis of the phase diagrams and phase transitions of lowrank matrix estimation corresponding to the Bayes optimal matrix estimation. We note that because we assume the data were generated from random factors , , we obtain much tighter, including the constant factors, control of the highdimensional behaviour , , than some traditional bounds in statistics that aim not to assume any generative model but instead craft proofs under verifiable conditions of the observed data .
We note at this point that methodologically closely related series of work on matrix factorization is concerned with the case of high rank, i.e. when krzakala2013phase (); parker2014bilinear (); kabashima2014phase (). While that case also has a set of important application (among then the learning of overcomplete dictionaries) it is different from the lowrank case considered here. The theory developed for the high rank case requires the prior distribution to be separable componentwise. The high rank case also does not present any known output channel universality, the details of the channel enter explicitly the resulting state evolution. Whereas the lowrank case can be viewed as a generalization of a spin model with pairwise interaction, in the graphical model for the highrank case the interactions involve variables.
No attempt is made at mathematical rigor in the present article. Contrary to the approach traditionally taken in statistics most of the analysis in this paper is nonrigorous, and rather relies of the accepted assumptions of the replica and cavity methods. It it worth, however, mentioning that for the case of Bayesoptimal inference, a large part of the results of this paper were proven rigorously in a recent series of works rangan2012iterative (); javanmard2013state (); DeshpandeM14 (); krzakala2016mutual (); deshpande2016asymptotic (); barbier2016mutual (); LelargeMiolane16 (); miolane2017fundamental (). These proofs include the mutual information (related to the replica free energy) in the Bayesoptimal setting and the corresponding minimum meansquarederror (MMSE), and the rigorous establishment that the state evolution is indeed describing asymptotic evolution of the LowRAMP algorithm. The study out of the Bayesoptimal conditions (without the Nishimori conditions) are move involved.
It has become a tradition in related literature zdeborova2015statistical () to conjecture that the performance of the LowRAMP algorithm cannot be improved by other polynomial algorithms. We do analyze here in detail the cases where LowRAMP does not achieve the MMSE, and we remark that since effects of replica symmetry breaking need to be taken into account when evaluating the performance of the best polynomial algorithms, the conjecture of the LowRAMP optimality among the polynomial algorithms deserves further detailed investigation.
This section gives a brief summary of our main results and their relation to existing work.

Approximate Message Passing for LowRank matrix estimation ( LowRAMP): In section II we derive and detail the approximate message passing algorithm to estimate marginal probabilities of the probability measures (1) and (2) for general prior distribution, rank and Hamiltonian (output channel). We describe various special case of these equations that arise due to the Nishimori conditions or due to selfaveraging. In the physics literature this would be the TAP equations ThoulessAnderson77 () generalized to vectorial spins with general local magnetic fields and generic type of pairwise interactions. The LowRAMP equations encompass as a special case the original TAP equations for the SherringtonKirkpatrick model, TAP equations for the Hopfield model MezardParisi87b (); mezard2016mean (), or the restricted Boltzmann machine gabrie2015training (); tramel2016inferring (); mezard2016mean (); tubiana2016emergence (). Within the context of lowrank matrix estimation, the AMP equations were discussed in rangan2012iterative (); NIPS2013_5074 (); DeshpandeM14 (); lesieur2015phase (); lesieur2015mmse (); deshpande2016asymptotic (); lesieur2016phase (). Recently the LowRAMP algorithm was even generalized to spinvariables that are not real vectors but live on compact groups perry2016message ().
AMP type of algorithm is a promising alternative to gradient descent type of methods to minimize the likelihood. One of the main advantage of AMP is that it provides estimates of uncertainty which is crucial for accessing reliability and interpretability of the result. Compared to other Bayesian algorithms, AMP tends to be faster than MonteCarlo based algorithms and more precise than variational meanfield based algorithms.
We distribute two opensource Julia and Matlab versions of LowRAMP at http://krzakala.github.io/LowRAMP/. We strongly encourage the reader to download, modify, and improve on it.

In physics, message passing equations are always closely linked with the Bethe free energy who’s stationary points are the message passing fixed point equations. In the presence of multiple fixed points it is the value of the Bethe free energy that decides which of the fixed points is the correct one. In section II.5 and appendix B we derive the Bethe free energy on a single instance of the lowrank matrix estimation problem. The form of free energy that we derive has the convenience to be variational in the sense that in order to find the fixed point we are looking for a maximum of the free energy, not for a saddle. Corresponding free energy for the compressed sensing problem was derived in krzakala2014variational (); DBLP:journals/corr/abs13016295 () and can be used to guide the iterations of the LowRAMP algorithm DBLP:journals/corr/VilaSRKZ14 ().

In section III we derive the general form of the state evolution (under the replica symmetric assumption) of the LowRAMP algorithm, generalizing previous works, e.g. rangan2012iterative (); DeshpandeM14 (); lesieur2015phase (); lesieur2015mmse (). We present simplifications for the Bayesoptimal inference and for the conventional form of the Hamiltonian. We also give the corresponding expression for the free energy. We derive the state evolution and the free energy using both the cavity and the replica method.
For the Bayesoptimal setting the replica Bethe free energy is up to a simple term related to the mutual information from which one can deduce the value of the minimum informationtheoretically achievable mean squared error. Specifically, the MMSE correspond to the global maximum of the replica free energy (defined here with the opposite sign than in physics), the performance of LowRAMP correspond to the maximum of the replica free energy that has the highest MSE.
We stress here that LowRAMP algorithm belongs to the same class of approximate Bayesian inference algorithms as generic type of Monte Carlo Markov chains of variational meanfield methods. Yet LowRAMP is very particular compared to these other two because of the fact that on a class of random models considered here its performance can be analyzed exactly via the state evolution and (out of the hard region) LowRAMP asymptotically matches the performance of the Bayesoptimal estimator. Study of AMPtype of algorithms hence opens a way to put the variational mean field algorithms into more theoretical framework.

We discuss the output channel universality as known in special cases in statistical physics (replica solution of the SK model depends only on the mean and variance of the quenched disorder not on other details of the distribution) and statistics deshpande2016asymptotic () (for the two group stochastic block model). The general form of this universality was first put into light for the Bayesoptimal estimation in lesieur2015mmse (), proven in krzakala2016mutual (), in this paper we discuss this universality out of the Bayesoptimal setting.

In section IV.1 we show that the state evolution with a Gaussian prior can be use to analyse the asymptotic performance of spectral algorithms such as PCA (symmetric case) or SVD (bipartite case) and derive the corresponding spectral meansquared errors and phase transitions as studied in the random matrix literature baik2005phase (). For a recent closely related discussion see perry2016optimality ().

In section IV.2 and IV.3 we discuss the typology of phase transition and phases that arise in Bayesoptimal lowrank matrix estimation. We provide sufficient criteria for existence of phases where estimation better than random guesses from the prior distribution is not informationtheoretically possible. We analyze linear stability of the fixed point of the LowRAMP algorithm related to this phase of undetectability, to conclude that the threshold where LowRAMP algorithm starts to have better performance than randomly guessing from the prior agrees with the spectral threshold known in the literature on lowrank perturbations of random matrices. We also provide sufficient criteria for existence of first order phase transitions related to existence of phases where informationtheoretically optimal estimation is not achievable by existing algorithms. And analyze the three thresholds , and related to the first order phase transition.

In section V.1 we give a number of examples of phase diagrams for the following models: rankone symmetric case with prior distribution being Bernoulli, RademacherBernoulli, GaussBernoulli, corresponding to balanced 2groups. For generic rank we give the phase diagram for the symmetric case for the jointlysparse PCA, and for the symmetric community detection.

Section V.2 and appendix D is devoted to small analysis of the above models. This is motivated by the fact that in most existing literature with sparsity constraints the number of nonzeros is usually considered to be a vanishing fraction of the system size. In our analysis the number of nonzeros is a finite fraction of the system size, we call the the regime of linear sparsity. We investigate whether the limit corresponds to previously studied cases. What concerns the informationtheoretically optimal performance and related threshold our small limit agrees with the results known for sublinear sparsity. Concerning the performance of efficient algorithms, from our analysis we conclude that for linear sparsity in the leading order the existing algorithms do not beat the threshold of the naive spectral method. This correspond to the known results for the planted dense subgraph problem (even when the size of the planted subgraph is sublinear). However, for the sparse PCA problem with sublinear sparsity algorithms such as covariance thresholding are known to beat the naive spectral threshold deshpande2014sparse (). In the regime of linear sparsity we do not recover such behaviour, suggesting that for linear sparsity, , efficient algorithms that take advantage of the sparsity do not exist.

In section V.3 and appendix E we discuss analytical results that we can obtain for the community detection, and jointsparse PCA models in the limit of large rank . These large rank results are matching the rigorous bounds derived for these problems in banks2016information ().
Ii Lowrank approximate message passing, LowRAMP
In this section we derive the approximate message passing for the lowrank matrix estimation problems, i.e. estimation of marginals of probability distributions (1) and (2). We detail the derivation for the symmetric case (1) and state the resulting equations for the bipartite case (2). We derive the algorithm in several stages. We aim to derive a tractable algorithm that under certain conditions (replica symmetry, and absence of phase coexistence related to a first order phase transitions) estimates marginal probabilities of the probability distribution (1) exactly in the large size limit. We start with the belief propagation (BP) algorithm, as well explained e.g. in Yedidia:2003 ().
It should be noted at this point that the BP algorithm is guaranteed to be exact for a general system only when the underlying graphical model is a tree. Yet it is often applied to locallytree like graphical models. On treelike graphical models usage of BP is justified when correlations decay fast enough so that faraway boundary of a rooted tree typically does not influence the root. This is formalized as pointtoset correlations montanari2006rigorous (). In the present case the graphical model is fully connected and hence a derivation based on belief propagation might seem puzzling at the first sight. The main assumptions done in belief propagation is that the incoming messages are probabilistically independent when conditioned on the value of the root. For this to be true the underlying graphical model does not have to be treelike it can also be because the interactions are sufficiently weak. In the present case the factor in the Hamiltonian (12) makes the interactions sufficiently weak so that the assumption of independence of incoming messages is correct in the large size limit. This is quite a common situation known from the cavity method derivation of the solution for the SherringtonKirkpatrick spin glass model mezard1986sk (). The resulting equations can be derived also by other means, e.g. the Plefka expansion 03054470156035 () that we use later to derive the Bethe free energy. We prefer the derivation from BP, because that is the only one we know of that provides time indices that lead to a converging algorithm (note that the TAP equations with the ”naive” time indices were notoriously known not to converge even in the paramagnetic phase).
Dealing with continuous variables makes the BP algorithm quite unsuitable to implementation, as one would need to represent whole probability distributions. One takes advantage of the fact that BP assumes incoming messages to be statistically independent and since there are many incoming messages due to central limit theorem only means and variances of the products give the desired result called relaxedbelief propagation (in analogy to similar algorithm for sparse linear estimation rangan2010estimation ()).
Next step is analogous to what is done for the TAP equations in the SK model ThoulessAnderson77 () where one realizes that the algorithm can be further simplified and instead of dealing with one message for every directed edge, we can work only with nodedependent quantities. This gives rise to the socalled Onsager terms. Keeping track of the correct time indices under iteration zdeborova2015statistical () in order to preserve convergence of the iterative scheme, we end up with the LowRAMP algorithm that was derived in various special forms in several previous works. In the early literature on the Hopfield model these equations were written in various forms MezardParisi87b () without paying close attention to the time iteration indices that cause convergence problems zdeborova2015statistical (). For more recent revival of interest in this approach that was motivated mainly by application of similar algorithmic scheme to the problem of linear estimation DonohoMaleki09 (); bayati2011dynamics () and by the need of new algorithms for constrained lowrank estimation see rangan2012iterative (); NIPS2013_5074 (); DeshpandeM14 (); lesieur2015phase (); mezard2016mean (). We finish this section with simplifications that can be done due to selfaveraging, or due to the Nishimori condition in the Bayesoptimal case of due to the particular form of the Hamiltonian.
We note that the LowRAMP algorithm is related but different from the variational mean field equations for this problem. To make the difference apparent, we state the the variational mean field algorithms in the notation of this paper in Appendix A.
ii.1 From BP to relaxed BP
To write the BP equations for the probability measure (1) we represent it by a fully connected factor graph, Fig. 1, where every node corresponds to a variables node , every edge corresponds to a pairwise factor node , and every node is related to a single site factor node .
We introduce the messages , respectively as the dimensional messages from a variable node to a factor node and from a factor node to a variable node. The belief propagation equations then are
(47)  
(48) 
The factor graph from which these messages are derived is given in Fig. 1. The most important assumption made in the BP equations is that the messages are conditionally on the value independent of each other thus allowing to write the product in eq. (48).