Constrained Low-rank Matrix Estimation: Phase Transitions, Approximate Message Passing and Applications.

# Constrained Low-rank Matrix Estimation: Phase Transitions, Approximate Message Passing and Applications.

Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová Institut de Physique Théorique, CNRS, CEA, Université Paris-Saclay, F-91191, Gif-sur-Yvette, France.
Laboratoire de Physique Statistique, Ecole Normale Supérieure, & Université Pierre et Marie Curie, Sorbonne Universités.
To whom correspondence shall be sent: lenka.zdeborova@cea.fr
July 6, 2019
###### Abstract

This article is an extended version of previous work of the authors lesieur2015phase (); lesieur2015mmse () on low-rank matrix estimation in the presence of constraints on the factors into which the matrix is factorized. Low-rank matrix factorization is one of the basic methods used in data analysis for unsupervised learning of relevant features and other types of dimensionality reduction. We present a framework to study the constrained low-rank matrix estimation for a general prior on the factors, and a general output channel through which the matrix is observed. We draw a paralel with the study of vector-spin glass models – presenting a unifying way to study a number of problems considered previously in separate statistical physics works. We present a number of applications for the problem in data analysis. We derive in detail a general form of the low-rank approximate message passing (Low-RAMP) algorithm, that is known in statistical physics as the TAP equations. We thus unify the derivation of the TAP equations for models as different as the Sherrington-Kirkpatrick model, the restricted Boltzmann machine, the Hopfield model or vector (xy, Heisenberg and other) spin glasses. The state evolution of the Low-RAMP algorithm is also derived, and is equivalent to the replica symmetric solution for the large class of vector-spin glass models. In the section devoted to result we study in detail phase diagrams and phase transitions for the Bayes-optimal inference in low-rank matrix estimation. We present a typology of phase transitions and their relation to performance of algorithms such as the Low-RAMP or commonly used spectral methods.

## I Introduction

### i.1 Problem Setting

In this paper we study a generic class of statistical physics models having Boltzmann probability measure that can be written in one of the two following forms:

• Symmetric vector-spin glass model:

 P(X|Y)=1ZX(Y)∏1≤i≤NPX(xi)∏1≤i

Here and are real valued matrices. In this case is a symmetric matrix. In statistical physics is called the quenched disorder. In the whole paper we denote by the vector-spin (-dimensional column vector) that collects the elements of the th row of the matrix , is the scalar product of the two vectors. is the corresponding partition function playing role of the normalization.

• Bipartite vector-spin glass model:

 P(U,V|Y)=1ZUV(Y)∏1≤i≤NPU(ui)∏1≤j≤MPV(vj)∏1≤i≤N,1≤j≤Meg(Yij,u⊤ivj/√N). (2)

Defined as above, this time and , . Again we denote by , the vector-spins of dimension that collect rows of matrices , . In this case the graph of interactions between spins is bipartite.

The main motivation on this work is twofold. On the one hand, the above mentioned probability measures are posterior probability measures of an important class of high-dimensional inference problems known as constrained low-rank matrix estimation. In what follows we give examples of applications of these matrix estimation problems in data processing and statistics. On the other hand, our motivation from the physics point of view is to present a unified formalism providing the (replica symmetric) solution for a large class of mean-field vectorial spin models with disorder.

The general nature of the present work stems from the fact that the probability distributions , , and the function are very generic (assumptions are summarize in section I.3). These functions can even depend on the node or edge . For simplicity we will treat site-independent functions , , and , but the theory developed here generalizes very straightforwardly to the site or edge dependent case. From a statistical physics point of view the terms , , play a role of generic local magnetic fields acting on the individual spins. Distributions , , describe the nature of the vector-spin variables and the fields that act on them. The simplest example is the widely studied Ising spins for which and , where here would be related to the usual magnetic field and inverse temperature as . In this paper we treat a range of other examples with and elements of being both discrete or continuous. This involves for instance spherical spin models with being Gaussian, or Heisenberg spins where and each is confined to the surface of a sphere.

Denoting

 wij=x⊤ixj/√N,orwij=u⊤ivj/√N (3)

according to symmetric or bipartite context, the terms are then interactions between pairs of spins that depend only on the scalar product between the corresponding vectors. The most commonly considered form of interaction in statistical physics is simply

 g(Y,w)=βYw (4)

with being a constant called inverse temperature, leading to a range of widely considered models with pair-wise interactions. We will refer to this form of function as the conventional Hamiltonian.

In order to complete the setting of the problem we need to specify how is the quenched disorder chosen. We will consider two main cases of the quenched disorder defined below. We note that even for problems where the matrix is not generated by either of the below our approach might still be relevant, e.g. for the restricted Boltzmann machine that is equivalent to the above bipartite model with that were learned from data (see for instance gabrie2015training (); tramel2016inferring () and the discussion below).

• Randomly quenched disorder: In this case the matrix elements of are chosen independently at random from some probability distribution . In this paper we will consider this distribution to be independent of and in later parts for simplicity we will restrict its mean to be zero. This case of randomly quenched disorder will encompass many well known and studied spin glass models such as the Ising spin glass, Heisenberg spin glass or the spherical spin glass or the Hopfield model.

• Planted models: Concerning applications in data science this is the more interesting case and most of this paper will focus on it. In this case we consider that there is some ground truth value of (or , ) with rows that are generated independently at random from some probability distribution (or , ). Then the disorder is generated element-wise as a noisy observation of the product (or ) via an output channel characterized by the output probability distribution .

### i.2 Preliminaries on the planted setting

#### i.2.1 Bayes optimal inference

Many applications, listed and analyzed below, in which the planted setting is relevant, concern problems where we aim to infer some ground truth matrices , , from the observed data and from the information we have about the distributions , , and . The information-theoretically optimal way of doing inference if we know how the data and how the ground-truth matrices were generated is to follow the Bayesian inference and compute marginals of the corresponding posterior probability distribution. According to the Bayes formula, the posterior probability distribution for the symmetric case is

 P(X|Y)=1ZX(Y)∏1≤i≤NPX0(xi)∏1≤i

For the bipartite case it is

 P(U,V|Y)=1ZUV(Y)∏1≤i≤NPU0(ui)∏1≤j≤MPV0(vj)∏1≤i≤N,1≤j≤MPout(Yij∣∣ ∣∣u⊤ivj√N). (6)

Making link with the Boltzmann probability measures (1) and (2) we see that the Bayes optimal inference of the planted configuration is equivalent to the statistical physics of the above vector-spin models with

 PX0=PX,PU0=PU,PV0=PV,Pout(Y|w)=eg(Y,w). (7)

This approach is optimal in the sense that the statistical estimator computed from the data that minimizes the expected mean-squared error between the estimator and the ground truth is given by the mean of the marginal of variable in the probability distribution (5)

 ^xi(Y)=∫dxxμi(x),whereμi(x)=∫P(X|Y)∏{xj}j≠idxj. (8)

Analogously for the bipartite case.

In the Bayes-optimal setting defined by conditions (7) the statistical physics analysis of the problem presents important simplifications known as the Nishimori conditions NishimoriBook (); zdeborova2015statistical (), which will be largely used in the present paper. These conditions can be proven and stated without the usage of the methodology developed below, they are a direct consequence of the Bayesian formula for conditional probability and basic properties of probability distributions.

Assume Bayes-optimality of the output channel, that is . First let us notice that every probability distribution has to be normalized

 ∀w,∫dYPout(Y|w)=1. (9)

By deriving the above equation with respect to one gets.

 ∀w,∫dYPout(Y|w)∂g(Y,w)∂w = EPout(Y|w)[∂g(Y,w)∂w]=0, (10) ∀w,∫dYPout(Y|w)⎡⎣(∂g(Y,w)∂w)2+∂2g(Y,w)∂w2⎤⎦ = EPout(Y,w)⎡⎣(∂g(Y,w)∂w)2+∂2g(Y,w)∂w2⎤⎦=0. (11)

Anticipating the derivation in the following we also define the inverse Fisher information of an output channel at as

 1Δ=EPout(Y|w=0)[(∂g∂w)2Y,w=0], (12)

Let us now assume Bayes-optimality also for the prior distributions, i.e. and , , . Then under averages over the posterior distribution (5) or (6) we can replace the random variables , , for the ground truth , , and vice versa. To prove this let us consider the symmetric case, the generalization to the bipartite case is straightforward. Take to be three independent samples from the posterior probability distribution , eq. (5). We then consider some function of two configurations of the variables . Consider the following two expectations

 E[f(X1,X2)] = ∫f(X1,X2)P(Y)P(X1|Y)P(X2|Y)dX1dX2dY, (13) E[f(X0,X)] = ∫f(X0,X)P(X0,X)dXdX0=∫f(X0,X)P(X0,X,Y)dXdX0dY (14) = ∫f(X0,X)P(X|Y,X0)Pout(Y|X0)P0(X0)dXdX0dY.

where we used the Bayes formula. We further observe that because  is independent of when conditioned on . In the Bayes optimal case, i.e. when and , we then obtain

 Bayesoptimal:E[f(X1,X2)]=E[f(X0,X)], (15)

meaning that under expectations there is no statistical difference between the ground truth assignment of variables and an assignment sampled uniformly at random from the posterior probability distribution (5). This is a simple yet important property that will lead to numerous simplifications in the Bayes optimal case and it will be used in several places of this paper, under the name Nishimori condition.

From the point of view of statistical physics of disordered systems the most striking property of systems that verify the Nishimori conditions is that there cannot be any replica symmetry breaking in the equilibrium solution of these systems nishimori2001absence (); NishimoriBook (); zdeborova2015statistical (). This simplifies considerably the analysis of the Bayes-optimal inference. Note, however, that metastable (out-of-equilibrium) properties of Bayes-optimal inference do not have to satisfy the Nishimori conditions and replica symmetry breaking might be needed for their correct description (this will be relevant in the cases of first order phase transition described in section IV.3).

#### i.2.2 Principal component analysis

In the above probabilistic inference setting the Bayesian approach of computing the marginals is optimal. However, in a majority of the interesting cases it is computationally intractable (NP hard) to compute the marginals exactly. In all low-rank estimation problems the method that (for the bipartite case) comes to mind as first when we look for a low-rank matrix close to an observe matrix is the singular value decomposition (SVD) where a rank matrix that minimizes the squared difference in computed

 argmin~Y[∑i,j(Yij−~Yij)2]=r∑s=1usλsv⊤s, (16)

where is the th largest singular value of , and , are the corresponding left-singular and right-singular vectors of the matrix . The above property is know as the Eckart-Young-Mirsky theorem eckart1936approximation (). In the symmetric case we simply replace the singular values by eigenvalues and the singular vectors by the eigenvectors. The above unconstrained low-rank approximation of the matrix , eq.  (16), is also often referred to as principal component analysis (PCA), because indeed when the matrix is interpreted as samples of -dimensional data then the right-singular vectors are directions of greatest variability in the data.

PCA and SVD are methods of choice when the measure of performance is the sum of square differences between the observe and estimated matrices, and when there are no particular requirements on the elements of the matrices , or .

The methodology developed in this paper for the planted probabilistic models, generalizes to arbitrary cost function that can be expressed as a product of element-wise terms and to arbitrary constraints on the rows of the matrices , , as long as they can be described by row-wise probability distributions , , . Systematically comparing our results to the performance of PCA is useful because PCA is well known and many researcher have good intuition about what are its strengths and limitations in various settings.

### i.3 The large size limit, assumptions and channel universality

In this article we focus on the thermodynamic limit where whereas , and and all the elements of , , and are of order 1. The functions , , and do not depend on explicitly. In the planted model also the distribution , , and do not depend on explicitly. The only other requirement we impose on the distributions , , and , , is that they all have a finite second moment.

The factor in the second argument of the function ensures that the behaviour of the above models is non-trivial and that there is an interesting competition between the number of local magnetic fields , , and the number of interactions. To physics readership familiar with the Sherrington-Kirkpatrick (SK) model this factor will be familiar because in the SK model the interaction between the Ising spins that lead to extensive free energy are also of this order (with mean that is of order ). This is compared to the ferromagnetic Ising model on a fully connected lattice for which the interactions leading to extensive free energy scale as .

For readers interested in inference problems, i.e. the planted setting, the factor is the scaling of the signal-to-noise ratio for which inference of unknown from measurements is neither trivially easy nor trivially impossible. In the planted setting can be viewed as a random matrix with a rank- perturbation. The regime where the eigenvalues of dense random matrices with low-rank perturbations split from the bulk of the spectral density is precisely when the strength of the perturbation is , see e.g. baik2005phase ().

We are therefore looking at statistical physics models with pairwise interactions where each of the interactions depend only weakly on the configuration of the vector-spins. As a consequence, properties of the system in the thermodynamic limit depend only weakly on the details of the interaction function with given by (3). The results of this paper hold for every function for which the following Taylor expansion is well defined

 eg(Yij,wij)=eg(Yij,0)⎧⎨⎩1+∂g(Yij,w)∂w∣∣w=0wij+⎡⎣(∂g(Yij,w)∂w∣∣w=0)2+∂2g(Yij,w)∂w2∣∣w=0⎤⎦w2ij2+O(w3ij)⎫⎬⎭. (17)

In order to simplify the notation in the following we denote

 Sij ≡ ∂g(Yij,w)∂w∣∣w=0, (18) Rij ≡ (∂g(Yij,w)∂w∣∣w=0)2+∂2g(Yij,w)∂w2∣∣w=0. (19)

We will refer to the matrix as the Fisher score matrix. The above expansion can now be written in a more compact way

 eg(Yij,wij)=eg(Yij,0)[1+Sijwij+Rijw2ij2+O(w3ij)]=eg(Yij,0)+Sijwij+12(Rij−S2ij)w2ij+O(w3ij). (20)

Let us now analyze the orders in this expansion. In the Boltzmann measure (1) and (2) the terms appears in a product over terms and . At the same time only terms of order in the exponent of the Boltzmann measure influence the leading order (in ) of the marginals (local magnetizations), therefore all the terms that depend on 3rd and higher order of are negligible. This means that the leading order of the marginals depend on the function only trough the matrices of its first and second derivatives at , denoted and (18-19). This means in particular that in order to understand the phase diagram of a model with general we only need to consider one more order than in the conventional Hamiltonian considered in statistical physics (4).

In the sake of specificity let us state here the two examples of the output channels considered most prominently in this paper and their corresponding matrices and . The first example corresponds to observations of low-rank matrices with additive Gaussian noise, we will refer to this as the Gaussian output channel

 GaussianNoiseChannel:g(Y,w)=−(Y−w)22Δ−12log2πΔ,Sij=YijΔ,Rij=Y2ijΔ2−1Δ, (21)

where is the variance of the noise, for the specific case of the Gaussian output channel is also the Fisher information as defined in eq. (12). The second example is the one most often considered in physics given by eq. (4)

 ConventionalHamiltonian:g(Y,w)=βYw,Sij=βYij,Rij=β2Y2ij. (22)

with being a constant called inverse temperature. Another quite different example of the output channel will be given to model community detection in networks in section I.4.2.

### i.4 Examples and applications

The Boltzmann measures (1) and (2) together with the model for the disorder can be used to describe a range of problems of practical and scientific interest studied previously in physics and/or in data sciences. In this section we list several examples and applications for each of the four categories – the symmetric and bipartite case, and the randomly quenched and planted disorder.

#### i.4.1 Examples with randomly quenched disorder

##### Sherrington-Kirkpatrick (SK) model.

The SK model SherringtonKirkpatrick75 () stands at the roots of the theory of spin glasses. It can be described by the symmetric Boltzmann measure (1) with the conventional Hamiltonian .

The are Ising spins, i.e. , with distribution

 PX(xi)=ρδ(xi−1)+(1−ρ)δ(xi+1). (23)

The parameter is related to the inverse temperature and an external magnetic field as . Note that the parameter could also be site-dependent and our approach would generalize, but in this paper we work with site independent functions .

The elements of the (symmetric) matrix are the quenched random disorder, i.e. they are generated independently at random from some probability distribution. Most usually considered distributions of disorder would be the normal distribution , or binary with probability and otherwise.

The algorithm developed in this paper for the general case corresponds to the Thouless-Anderson-Palmer ThoulessAnderson77 () equations for the SK model. The theory developed here correspond to the replica symmetric solution of SherringtonKirkpatrick75 (). Famously this solution is wrong below certain temperature where effects of replica symmetry breaking (RSB) have to be taken into account. In this paper we focus on the replica symmetric solution, that leads to exact and novel phase diagrams for the planted models. The RSB solution in the present generic setting will be presented elsewhere. We present the form of the TAP equations in the general case encompassing a range of existing works.

##### Spherical spin glass.

Next to the SK model, the spherical spin glass model kosterlitz1976spherical () stands behind large fraction of our understanding about spin glass. Mathematically much simpler than the SK model this model stands as a prominent case in the development in mathematical physics. The spherical spin glass is formulated via the symmetric Boltzmann measure (1) with the conventional Hamiltonian . The function with enforces (canonically) the spherical constraint . External magnetic field can also be included in .

The disorder is most commonly randomly quenched in physics studies of the spherical spin glass model.

##### Heisenberg spin glass.

In Heisenberg spin glass sommers1981theory () the Hamiltonian is again the conventional symmetric one with randomly quenched disorder. The spins are 3-dimensional vectors, , of unit length, . Magnetic field influences the direction of the spin so that

 PX(xi)=eβh⊤xi, (24)

where . The more general -component model was also astudied extensively in the spin glass literature gabay1981coexistence ().

##### Restricted Boltzmann Machine.

Restricted Boltzmann machines (RBMs) are one of the triggers of the recent revolution in machine learning called deep learning hinton2010practical (); hinton2006fast (); lecun2015deep (). The way RBMs are used in machine learning is that one considers the bipartite Boltzmann measure (2). In the training phase one searches a matrix such that the data represented as a set of configurations of the -variable have high likelihood (low energy). The -variable are called the hidden units and columns of the matrix (each corresponding to one hidden unit) are often interpreted as features that are useful to explain the structure of the data.

The RBM is most commonly considered for the conventional Hamiltonian and for binary variables and . But other distributions for both the data-variables and the hidden variables were considered in the literature and the approach of the present paper applies to all of them.

We note that the disorder that was obtained for an RBM trained on real datasets does not belong to the classes for which the theory developed in this paper is valid (training introduces involved correlations). However, it was shown recently that the Low-RAMP equations as studied in the present paper can be used efficiently for training of the RBM kappen1998boltzmann (); gabrie2015training ().

The RBM with Gaussian hidden variables is related to the well known Hopfield model of associative memory hopfield1982neural (). Therefore the properties of the bipartite Boltzmann measure (2) with a randomly quenched disorder are in one-to-one correspondence with the properties of the Hopfield model. This relation in the view of the TAP equations was studied recently in mezard2016mean ().

#### i.4.2 Examples with planted disorder

So far we covered examples where the disorder was randomly quenched (or more complicated as in the RBM). The next set of examples involves the planted disorder that is more relevant for applications in signal processing or statistics, where the variables , , represent some signal we aim to recover from its measurements . Sometimes it is the low-rank matrix that we aim to recover from its noisy measurements . In the literature the general planted problem can be called low-rank matrix factorization, matrix recovery, matrix denoising or matrix estimation.

##### Gaussian estimation

The most elementary examples of the planted case is when the measurement channel is Gaussian as in eq. (21), and the distributions , and are also Gaussian i.e.

 PX(xi) = 1√Det(2πσX)e−12(xi−μX)⊤σ−1X(xi−μX), (25) PU(ui) = 1√Det(2πσU)e−12(ui−μU)⊤σ−1U(ui−μU), (26) PV(vi) = 1√Det(2πσV)e−12(vi−μV)⊤σ−1V(vi−μV), (27)

where are the means of the distributions, are the covariance matrices, with being given by (3).

We speak about the estimation problem as Bayes-optimal Gaussian estimation if the disorder was generated according to

 Pout(Yij|w0ij)=eg(Yij,w0ij), (28)

where is given by eq. (21), and

 w0ij=x0⊤ix0j/√N,orw0ij=u0i⊤v0j/√N. (29)

with , , and being generated from probability distributions , , . The goal is to estimate matrices , , and from .

##### Gaussian Mixture Clustering

Another example belonging to the class of problems discussed in this paper is the model for Gaussian mixture clustering. In this case the spin variables are such that

 PU(ui)=r∑s=1nsδ(ui−es), (30)

where is the number of clusters, and is a unit -dimensional vector with all components except equal to zero, and the -component equal to 1, e.g. for we have , , . Having is interpreted as data points belongs to cluster . We have data points.

The columns of the matrix then represent centroids of each of the clusters in the -dimensional space. The distribution can as an example take the Gaussian form (27) with the covariance being an identity and the mean being zero. The output channel is Gaussian as in (21). All together this means the collects positions of points in dimensional space that are organized in Gaussian clusters. The goal is to estimate the centers of the clusters and the cluster membership from .

Standard algorithms for data clustering incluse those based on the spectral decomposition of the matrix such as principal component analysis hastie2005elements (); wasserman2013all (), or Loyd’s -means lloyd1982least (). Works on Gaussian mixtures that are methodologically closely related to the present paper include application of the replica method to the case of two clusters in watkin1994optimal (); barkai1994statistical (); biehl1994statistical () or the AMP algorithm of NIPS2013_5074 (). Note that for two clusters with the two centers being symmetric around the origin, the resulting Boltzmann measure of the case with randomly quenched disorder is equivalent to the Hopfield model as treated e.g. in mezard2016mean ().

Note also that there are interesting variants of the Gaussian mixture clustering such as subspace clustering parsons2004subspace () where only some of the directions are relevant for the clustering. This can be modeled by a prior on the vectors that have a non-zero weight of being the null vector.

The approach described in the present paper on the Bayes-optimal inference in the Gaussian mixture clustering problem has been used in a work of the authors with other collaborators in the work presented in lesieur2016phase ().

##### Sparse PCA

Sparse Principal Component Analysis (PCA) johnstone2004sparse (); zou2006sparse () is a dimensionality reduction technique where one seeks a low-rank representation of a data matrix with additional sparsity constraints on the obtained representation. The motivation is that the sparsity helps to interpret the resulting representation. Formulated within the above setting sparse PCA corresponds to the bipartite case (2). The variables is considered unconstrained, as an example one often considers a Gaussian prior on (26). The variables are such that many of the matrix-elements are zero.

In the literature the sparse PCA problem was mainly considered in the rank-one case for which a series of intriguing results was derived. The authors of johnstone2004sparse () suggested an efficient algorithm called diagonal thresholding that solves the sparse PCA problem (i.e. estimates correctly the position and the value of the non-zero elements of ) whenever the number of data samples is amini2008high (), where is the number of non-zeros and is some constant. More recent works show existence of efficient algorithm that only need samples deshpande2014sparse (). For very sparse systems, i.e. small , this is a striking improvement over the conventional PCA that would need samples. This is why sparsity linked together with PCA brought excitement into data processing methods. At the same time, this result is not as positive as it may seem, because by searching exhaustively over all positions of the non-zeros the correct support can be discovered with high probability with number of samples .

Naively one might think that polynomial algorithms that need less thank samples might exist and a considerable amount of work was devoted to their search without success. However, some works suggest that perhaps polynomial algorithm that solve the sparse PCA problems for number of samples do not exist. Among the most remarkable one is the work krauthgamer2015semidefinite () showing that the SDP algorithm, that is otherwise considered rather powerful, fails in this regime. The work of berthet2013computational () goes even further showing that if sparse PCA can be solved for then also a problem known as the planted clique problem can be solved in a regime that is considered as algorithmically hard for already several decades.

The problem of sparse PCA is hence one of the first examples of a relatively simple to state problem that currently presents a wide barrier between computational and statistical tractability. Deeper description of the origin of this barrier is likely to shed light on our understanding of typical algorithmic complexity in a broader scope.

The above works consider the scaling when and is fixed (or growing slower than ). A regime natural to many applications is when where . This regime was considered in DeshpandeM14 () where it was shown that for an efficient algorithm that achieves the information theoretical performance exists. This immediately bring a question of what exactly happens for and how does the barrier described above appear for ? This question was illuminated in a work by the present authors lesieur2015phase () and will be developed further in this paper.

We consider sparse PCA in the bipartite case, with Gaussian (27) and sparse

 PV(vi)=ρδ(vi−1)+(1−ρ)δ(vi), (31)

as corresponds to the formulation of johnstone2004sparse (); zou2006sparse (); amini2008high (); berthet2013computational () and others. This probabilistic setting of sparse PCA was referred to as spiked Wishart model in DeshpandeM14 (), notation that we will adopt in the present paper. This model is also equivalent to the one studied recently in monasson2015estimating () where the authors simply integrate over the Gaussian variables.

In DeshpandeM14 () the authors also considered a symmetric variant of the sparse PCA, and refer to it as the spiked Wigner model. The spiked Wigner model is closer to the planted clique problem, that can be formulated using (1) with having many zero elements. In the present work we will consider several models for the prior distribution . The Bernoulli model as in DeshpandeM14 () where

 Bernoullimodel:PX(xi)=ρδ(xi−1)+(1−ρ)δ(xi). (32)

The spiked Bernoulli model can also be interpreted as a problem of submatrix localization where a submatrix of size of the matrix has a larger mean than a randomly chosen submatrix. The submatrix localization is also relevant in the bipartite case, where it has many potential applications. The most striking ones being in gene expression where large-mean submatrices of the matrix of gene expressions of different patients may correspond to groups of patients having the same type of disease madeira2004biclustering (); cheng2000biclustering ().

In this paper we will also consider the spiked Rademacher-Bernoulli model with

as well as the spiked Gauss-Bernoulli

 Gauss−Bernoullimodel:PX(xi)=ρN(xi,0,1)+(1−ρ)δ(xi), (34)

where is the Gaussian distribution with zero mean and unit variance.

So far we discussed the sparse PCA problem in the case of rank one, , but the case with larger rank is also interesting, especially in the view of the question of how does the algorithmic barrier depend on the rank. To investigate this question in lesieur2015phase () we also considered the jointly-sparse PCA, where the whole -dimensional lines of are zero at once, the non-zeros are Gaussians of mean and covariance being the identity. Mathematically, with

 PX(xi)=ρ(2π)r/2e−x⊤x2+(1−ρ)δ(xi). (35)

Another example to consider is the independently-sparse PCA where each of the components of the lines in is taken independently from the Gauss-Bernoulli distribution, for we have then

 PX(xi)=∏1≤k≤r[ρ√2πe−x2ik2+(1−ρ)δ(xik)]. (36)
##### Community detection

Detection of communities in networks is often modeled by the stochastic block model (SBM) where pairs of nodes get connected with probability that depends on the indices of groups to which the two nodes depend. Community detection is a widely studied problem, see e.g. the review fortunato2010community (). Studies of statistical and computationally barriers in the SBM recently became very active in mathematics and statistics starting perhaps with a series of statistical physics conjectures about the existence of phase transitions in the problem decelle2011inference (); decelle2011asymptotic (). The particular interest of those works is that they are set in the sparse regime where every node has neighbors.

Also the dense regime where every node has neighbors is theoretically interesting when the difference between probabilities to connect depends only weakly on the group membership. The relevant scaling is the same as in the Potts glass model studied in GrossKanter85 (). In fact the dense community detection is exactly the planted version of this Potts glass model. In the setting of the present model (1) the community detection was already considered in deshpande2016asymptotic () for two symmetric groups, in lesieur2015mmse (), and in barbier2016mutual (). In the present paper we detail the results reported briefly in lesieur2015mmse () and in barbier2016mutual (). We consider the case with a general number of equal sized groups, the symmetric case. And also a case with two groups of different sizes, but such that the average degree in each of the groups is the same.

To set up the dense community detection problem we consider a network with nodes. Each node belongs to a community indexed by . For each pair we create an edge with probability . Where is an matrix called the connectivity matrix. In the examples of this paper we will consider two special cases of the community detection problem.

One example with symmetric equally sized groups where for each pair of nodes we create an edge between the two nodes with probability if they are in the same group and with probability if not:

 C = ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝pinpout⋯poutpout⋱⋱⋮⋮⋱⋱poutpout⋯poutpin⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=pinIr+pout(Jr−Ir), (37)

where is a -dimensional identity matrix, and is a -dimensional matrix filled with unit elements. The scaling we consider here is

 pout = O(1),pin=O(1), (38) |pin−pout| = μ√N,μ=O(1), (39)

so that the average degree in the graph is extensive. Note, however, that by rather generic correspondence between diluted and dense models, that has been made rigorous recently deshpande2016asymptotic (); caltagirone2016recovering (), the results derived in this case hold even for average degrees that diverge only very mildly with . The goal is to infer the group index to which each node belongs purely from the adjacency matrix of the network (up to a permutation of the indices). This problem is transformed into the low-rank matrix factorization problem through the use of the following prior probability distribution

 PX(xi)=1rr∑s=1δ(xi−es). (40)

where is the vector with everywhere except a at position . Eq. (40) is just a special case of (30). The output channel that describes the process of creation of the graph is

 Pout(Yij=1|x⊤ixj/√N)=pout+μx⊤ixj√N, (41) Pout(Yij=0|x⊤ixj/√N)=1−pout−μx⊤ixj√N. (42)

Next to the conventional Hamiltonian (22) and the Gaussian noise (21), the SBM output (41-42) is a third example of an output channel that we consider in this article. It will be used to illustrate the simplicity that arises due to the channel universality, as also considered in deshpande2016asymptotic () and lesieur2015mmse (). Here, we obtain for the output matrices

 Sij(Yij=1) =μpout,Sij(Yij=0)= −μ1−pout, (43) Rij(Yij=1) =0,Rij(Yij=0)= 0. (44)

Here is parameter that can be used to fix the signal to noise ratio.

Another example of community detection is the one with two balanced communities, i.e. having different size but the same average degree. In that setting there are two communities of size and with . The connectivity matrix of this model is given by

 C=(poutpoutpoutpout)+μ√N⎛⎜⎝1−ρρ−1−1ρ1−ρ⎞⎟⎠. (45)

This can be modelled at the symmetric matrix factorization with rank and the prior given as

 PX(x) = ρδ(x−√1−ρρ)+(1−ρ)δ⎛⎝x+√ρ1−ρ⎞⎠. (46)

The values in are chosen so that each community has an average degree of . The fact that in both of these cases each community has the same average degree means that one can not hope to just use the histogram of degrees to make the distinction between the communities. The output channel here is identical to the one given in (41-42)

A third example of community detection is locating one denser community in a dense network, as considered in montanari2015finding () (specifically the large degree limit considered in that paper). We note that thanks to the output channel universality (Sec. I.3) this case is equivalent to the spiked Bernoulli model of symmetric sparse PCA.

As a side remark we note that the community detection setting as considered here is also relevant in the bipartite matrix factorization setting where it becomes the problem of biclustering madeira2004biclustering (); cheng2000biclustering (). The analysis developed in this paper can be straightforwardly extended to the bipartite case.

### i.5 Main results and related work

The present paper is built upon two previous shorter papers by the same authors lesieur2015phase (); lesieur2015mmse (). it focuses on the study of the general type of models described by probability measures (1) and (2). On the one hand, these represent Boltzmann measures of vectorial-spin systems on fully connected symmetric or bipartite graphs. Examples of previously studied physical models that are special cases of the setting considered here would be the Sherrington-Kirkpatrick model SherringtonKirkpatrick75 (), the Hopfield model hopfield1982neural (); mezard2016mean (), the inference (not learning) in the restricted Boltzmann machine gabrie2015training (); mezard2016mean (); tubiana2016emergence (). Our work hence provides a unified replica symmetric solution and TAP equations for generic class of prior distributions and Hamiltonians.

On the other hand, these probability distributions (1) and (2) represent the posterior probability distribution of a low-rank matrix estimation problem that finds a wide range of applications in high-dimensional statistics and data analysis. The thermodynamic limit with whereas was widely considered in the studies of spin glasses, in the context of low-rank matrix estimation this limit correspond to the challenging high-dimensional regime, whereas traditional statistics considers the case where . We focus on the analysis of the phase diagrams and phase transitions of low-rank matrix estimation corresponding to the Bayes optimal matrix estimation. We note that because we assume the data were generated from random factors , , we obtain much tighter, including the constant factors, control of the high-dimensional behaviour , , than some traditional bounds in statistics that aim not to assume any generative model but instead craft proofs under verifiable conditions of the observed data .

We note at this point that methodologically closely related series of work on matrix factorization is concerned with the case of high rank, i.e. when krzakala2013phase (); parker2014bilinear (); kabashima2014phase (). While that case also has a set of important application (among then the learning of overcomplete dictionaries) it is different from the low-rank case considered here. The theory developed for the high rank case requires the prior distribution to be separable component-wise. The high rank case also does not present any known output channel universality, the details of the channel enter explicitly the resulting state evolution. Whereas the low-rank case can be viewed as a generalization of a spin model with pairwise interaction, in the graphical model for the high-rank case the interactions involve variables.

No attempt is made at mathematical rigor in the present article. Contrary to the approach traditionally taken in statistics most of the analysis in this paper is non-rigorous, and rather relies of the accepted assumptions of the replica and cavity methods. It it worth, however, mentioning that for the case of Bayes-optimal inference, a large part of the results of this paper were proven rigorously in a recent series of works rangan2012iterative (); javanmard2013state (); DeshpandeM14 (); krzakala2016mutual (); deshpande2016asymptotic (); barbier2016mutual (); LelargeMiolane16 (); miolane2017fundamental (). These proofs include the mutual information (related to the replica free energy) in the Bayes-optimal setting and the corresponding minimum mean-squared-error (MMSE), and the rigorous establishment that the state evolution is indeed describing asymptotic evolution of the Low-RAMP algorithm. The study out of the Bayes-optimal conditions (without the Nishimori conditions) are move involved.

It has become a tradition in related literature zdeborova2015statistical () to conjecture that the performance of the Low-RAMP algorithm cannot be improved by other polynomial algorithms. We do analyze here in detail the cases where Low-RAMP does not achieve the MMSE, and we remark that since effects of replica symmetry breaking need to be taken into account when evaluating the performance of the best polynomial algorithms, the conjecture of the Low-RAMP optimality among the polynomial algorithms deserves further detailed investigation.

This section gives a brief summary of our main results and their relation to existing work.

• Approximate Message Passing for Low-Rank matrix estimation ( Low-RAMP): In section II we derive and detail the approximate message passing algorithm to estimate marginal probabilities of the probability measures (1) and (2) for general prior distribution, rank and Hamiltonian (output channel). We describe various special case of these equations that arise due to the Nishimori conditions or due to self-averaging. In the physics literature this would be the TAP equations ThoulessAnderson77 () generalized to vectorial spins with general local magnetic fields and generic type of pairwise interactions. The Low-RAMP equations encompass as a special case the original TAP equations for the Sherrington-Kirkpatrick model, TAP equations for the Hopfield model MezardParisi87b (); mezard2016mean (), or the restricted Boltzmann machine gabrie2015training (); tramel2016inferring (); mezard2016mean (); tubiana2016emergence (). Within the context of low-rank matrix estimation, the AMP equations were discussed in rangan2012iterative (); NIPS2013_5074 (); DeshpandeM14 (); lesieur2015phase (); lesieur2015mmse (); deshpande2016asymptotic (); lesieur2016phase (). Recently the Low-RAMP algorithm was even generalized to spin-variables that are not real vectors but live on compact groups perry2016message ().

AMP type of algorithm is a promising alternative to gradient descent type of methods to minimize the likelihood. One of the main advantage of AMP is that it provides estimates of uncertainty which is crucial for accessing reliability and interpretability of the result. Compared to other Bayesian algorithms, AMP tends to be faster than Monte-Carlo based algorithms and more precise than variational mean-field based algorithms.

We distribute two open-source Julia and Matlab versions of LowRAMP at http://krzakala.github.io/LowRAMP/. We strongly encourage the reader to download, modify, and improve on it.

• In physics, message passing equations are always closely linked with the Bethe free energy who’s stationary points are the message passing fixed point equations. In the presence of multiple fixed points it is the value of the Bethe free energy that decides which of the fixed points is the correct one. In section II.5 and appendix B we derive the Bethe free energy on a single instance of the low-rank matrix estimation problem. The form of free energy that we derive has the convenience to be variational in the sense that in order to find the fixed point we are looking for a maximum of the free energy, not for a saddle. Corresponding free energy for the compressed sensing problem was derived in krzakala2014variational (); DBLP:journals/corr/abs-1301-6295 () and can be used to guide the iterations of the Low-RAMP algorithm DBLP:journals/corr/VilaSRKZ14 ().

• In section III we derive the general form of the state evolution (under the replica symmetric assumption) of the Low-RAMP algorithm, generalizing previous works, e.g. rangan2012iterative (); DeshpandeM14 (); lesieur2015phase (); lesieur2015mmse (). We present simplifications for the Bayes-optimal inference and for the conventional form of the Hamiltonian. We also give the corresponding expression for the free energy. We derive the state evolution and the free energy using both the cavity and the replica method.

For the Bayes-optimal setting the replica Bethe free energy is up to a simple term related to the mutual information from which one can deduce the value of the minimum information-theoretically achievable mean squared error. Specifically, the MMSE correspond to the global maximum of the replica free energy (defined here with the opposite sign than in physics), the performance of Low-RAMP correspond to the maximum of the replica free energy that has the highest MSE.

We stress here that Low-RAMP algorithm belongs to the same class of approximate Bayesian inference algorithms as generic type of Monte Carlo Markov chains of variational mean-field methods. Yet Low-RAMP is very particular compared to these other two because of the fact that on a class of random models considered here its performance can be analyzed exactly via the state evolution and (out of the hard region) Low-RAMP asymptotically matches the performance of the Bayes-optimal estimator. Study of AMP-type of algorithms hence opens a way to put the variational mean field algorithms into more theoretical framework.

• We discuss the output channel universality as known in special cases in statistical physics (replica solution of the SK model depends only on the mean and variance of the quenched disorder not on other details of the distribution) and statistics deshpande2016asymptotic () (for the two group stochastic block model). The general form of this universality was first put into light for the Bayes-optimal estimation in lesieur2015mmse (), proven in krzakala2016mutual (), in this paper we discuss this universality out of the Bayes-optimal setting.

• In section IV.1 we show that the state evolution with a Gaussian prior can be use to analyse the asymptotic performance of spectral algorithms such as PCA (symmetric case) or SVD (bipartite case) and derive the corresponding spectral mean-squared errors and phase transitions as studied in the random matrix literature baik2005phase (). For a recent closely related discussion see perry2016optimality ().

• In section IV.2 and IV.3 we discuss the typology of phase transition and phases that arise in Bayes-optimal low-rank matrix estimation. We provide sufficient criteria for existence of phases where estimation better than random guesses from the prior distribution is not information-theoretically possible. We analyze linear stability of the fixed point of the Low-RAMP algorithm related to this phase of undetectability, to conclude that the threshold where Low-RAMP algorithm starts to have better performance than randomly guessing from the prior agrees with the spectral threshold known in the literature on low-rank perturbations of random matrices. We also provide sufficient criteria for existence of first order phase transitions related to existence of phases where information-theoretically optimal estimation is not achievable by existing algorithms. And analyze the three thresholds , and related to the first order phase transition.

• In section V.1 we give a number of examples of phase diagrams for the following models: rank-one symmetric case with prior distribution being Bernoulli, Rademacher-Bernoulli, Gauss-Bernoulli, corresponding to balanced 2-groups. For generic rank we give the phase diagram for the symmetric case for the jointly-sparse PCA, and for the symmetric community detection.

• Section V.2 and appendix D is devoted to small analysis of the above models. This is motivated by the fact that in most existing literature with sparsity constraints the number of non-zeros is usually considered to be a vanishing fraction of the system size. In our analysis the number of non-zeros is a finite fraction of the system size, we call the the regime of linear sparsity. We investigate whether the limit corresponds to previously studied cases. What concerns the information-theoretically optimal performance and related threshold our small limit agrees with the results known for sub-linear sparsity. Concerning the performance of efficient algorithms, from our analysis we conclude that for linear sparsity in the leading order the existing algorithms do not beat the threshold of the naive spectral method. This correspond to the known results for the planted dense subgraph problem (even when the size of the planted subgraph is sub-linear). However, for the sparse PCA problem with sub-linear sparsity algorithms such as covariance thresholding are known to beat the naive spectral threshold deshpande2014sparse (). In the regime of linear sparsity we do not recover such behaviour, suggesting that for linear sparsity, , efficient algorithms that take advantage of the sparsity do not exist.

• In section V.3 and appendix E we discuss analytical results that we can obtain for the community detection, and joint-sparse PCA models in the limit of large rank . These large rank results are matching the rigorous bounds derived for these problems in banks2016information ().

## Ii Low-rank approximate message passing, Low-RAMP

In this section we derive the approximate message passing for the low-rank matrix estimation problems, i.e. estimation of marginals of probability distributions (1) and (2). We detail the derivation for the symmetric case (1) and state the resulting equations for the bipartite case (2). We derive the algorithm in several stages. We aim to derive a tractable algorithm that under certain conditions (replica symmetry, and absence of phase coexistence related to a first order phase transitions) estimates marginal probabilities of the probability distribution (1) exactly in the large size limit. We start with the belief propagation (BP) algorithm, as well explained e.g. in Yedidia:2003 ().

It should be noted at this point that the BP algorithm is guaranteed to be exact for a general system only when the underlying graphical model is a tree. Yet it is often applied to locally-tree like graphical models. On tree-like graphical models usage of BP is justified when correlations decay fast enough so that far-away boundary of a rooted tree typically does not influence the root. This is formalized as point-to-set correlations montanari2006rigorous (). In the present case the graphical model is fully connected and hence a derivation based on belief propagation might seem puzzling at the first sight. The main assumptions done in belief propagation is that the incoming messages are probabilistically independent when conditioned on the value of the root. For this to be true the underlying graphical model does not have to be tree-like it can also be because the interactions are sufficiently weak. In the present case the factor in the Hamiltonian (1-2) makes the interactions sufficiently weak so that the assumption of independence of incoming messages is correct in the large size limit. This is quite a common situation known from the cavity method derivation of the solution for the Sherrington-Kirkpatrick spin glass model mezard1986sk (). The resulting equations can be derived also by other means, e.g. the Plefka expansion 0305-4470-15-6-035 () that we use later to derive the Bethe free energy. We prefer the derivation from BP, because that is the only one we know of that provides time indices that lead to a converging algorithm (note that the TAP equations with the ”naive” time indices were notoriously known not to converge even in the paramagnetic phase).

Dealing with continuous variables makes the BP algorithm quite unsuitable to implementation, as one would need to represent whole probability distributions. One takes advantage of the fact that BP assumes incoming messages to be statistically independent and since there are many incoming messages due to central limit theorem only means and variances of the products give the desired result called relaxed-belief propagation (in analogy to similar algorithm for sparse linear estimation rangan2010estimation ()).

Next step is analogous to what is done for the TAP equations in the SK model ThoulessAnderson77 () where one realizes that the algorithm can be further simplified and instead of dealing with one message for every directed edge, we can work only with node-dependent quantities. This gives rise to the so-called Onsager terms. Keeping track of the correct time indices under iteration zdeborova2015statistical () in order to preserve convergence of the iterative scheme, we end up with the Low-RAMP algorithm that was derived in various special forms in several previous works. In the early literature on the Hopfield model these equations were written in various forms MezardParisi87b () without paying close attention to the time iteration indices that cause convergence problems zdeborova2015statistical (). For more recent revival of interest in this approach that was motivated mainly by application of similar algorithmic scheme to the problem of linear estimation DonohoMaleki09 (); bayati2011dynamics () and by the need of new algorithms for constrained low-rank estimation see rangan2012iterative (); NIPS2013_5074 (); DeshpandeM14 (); lesieur2015phase (); mezard2016mean (). We finish this section with simplifications that can be done due to self-averaging, or due to the Nishimori condition in the Bayes-optimal case of due to the particular form of the Hamiltonian.

We note that the Low-RAMP algorithm is related but different from the variational mean field equations for this problem. To make the difference apparent, we state the the variational mean field algorithms in the notation of this paper in Appendix A.

### ii.1 From BP to relaxed BP

To write the BP equations for the probability measure (1) we represent it by a fully connected factor graph, Fig. 1, where every node corresponds to a variables node , every edge corresponds to a pair-wise factor node , and every node is related to a single site factor node .

We introduce the messages , respectively as the -dimensional messages from a variable node to a factor node and from a factor node to a variable node. The belief propagation equations then are

 ~nki→i(xi) = 1Zki→i∫dxknk→ki(xk)eg(Yki,x⊤kxi√N), (47) ni→ij(xi) = PX(xi)Zi→ij∏1≤k≤N,k≠i,j~nki→i(xi). (48)

The factor graph from which these messages are derived is given in Fig. 1. The most important assumption made in the BP equations is that the messages are conditionally on the value independent of each other thus allowing to write the product in eq. (48).

The message in (47) can be expanded as in (20) around thanks to the term. One gets

 ~nki→i(xi)=eg(Yki,0)Zki→i∫dxknk→ki(xk)[1+Skix⊤kxi√N+x⊤ixkx⊤kxi2NRki+O(1N3/2)], (49)

where matrices and were defined in (18-19). One then defines the mean (-dimensional vector) and covariances matrix (of size