Bilinear Generalized Approximate Message Passing
Abstract
We extend the generalized approximate message passing (GAMP) approach, originally proposed for highdimensional generalizedlinear regression in the context of compressive sensing, to the generalizedbilinear case, which enables its application to matrix completion, robust PCA, dictionary learning, and related matrixfactorization problems. In the first part of the paper, we derive our Bilinear GAMP (BiGAMP) algorithm as an approximation of the sumproduct belief propagation algorithm in the highdimensional limit, where centrallimit theorem arguments and Taylorseries approximations apply, and under the assumption of statistically independent matrix entries with known priors. In addition, we propose an adaptive damping mechanism that aids convergence under finite problem sizes, an expectationmaximization (EM)based method to automatically tune the parameters of the assumed priors, and two rankselection strategies. In the second part of the paper, we discuss the specializations of EMBiGAMP to the problems of matrix completion, robust PCA, and dictionary learning, and present the results of an extensive empirical study comparing EMBiGAMP to stateoftheart algorithms on each problem. Our numerical results, using both synthetic and realworld datasets, demonstrate that EMBiGAMP yields excellent reconstruction accuracy (often best in class) while maintaining competitive runtimes and avoiding the need to tune algorithmic parameters.
I Introduction
In this work, we present a new algorithmic framework for the following generalized bilinear inference problem: estimate the matrices and from a matrix observation that is statistically coupled to their product, . In doing so, we treat and as realizations of random matrices A and X with known separable pdfs (or pmfs in the case of discrete models), i.e.,
(1)  
(2) 
and we likewise assume that the likelihood function of is known and separable, i.e.,
(3) 
Recently, various special cases of this problem have gained the intense interest of the research community, e.g.,

Matrix Completion: In this problem, one observes a few (possibly noisecorrupted) entries of a lowrank matrix and the goal is to infer the missing entries. In our framework, would represent the complete lowrank matrix (with tall and wide ) and the observation mechanism, which would be (partially) informative about at the observed entries and noninformative at the missing entries .

Robust PCA: Here, the objective is to recover a lowrank matrix (or its principal components) observed in the presence of noise and sparse outliers. In our framework, could again represent the lowrank matrix, and the noiseandoutliercorrupted observation mechanism. Alternatively, could also capture the outliers, as described in the sequel.

Dictionary Learning: Here, the objective is to learn a dictionary for which there exists a sparse data representation such that closely matches the observed data . In our framework, would be chosen to induce sparsity, would represent the noiseless observations, and would model the (possibly noisy) observation mechanism.
While a plethora of approaches to these problems have been proposed based on optimization techniques (e.g., [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), greedy methods (e.g., [15, 16, 17, 18, 19]), Bayesian sampling methods (e.g., [20, 21]), variational methods (e.g., [22, 23, 24, 25, 26]), and discrete message passing (e.g., [27]), ours is based on the Approximate Message Passing (AMP) framework, an instance of loopy belief propagation (LBP) [28] that was recently developed to tackle linear [29, 30, 31] and generalized linear [32] inference problems encountered in the context of compressive sensing (CS). In the generalizedlinear CS problem, one estimates from observations that are statistically coupled to the transform outputs through a separable likelihood function , where in this case the transform is fixed and known.
In the context of CS, the AMP framework yields algorithms with remarkable properties: i) solution trajectories that, in the largesystem limit (i.e., as with fixed, under iid subGaussian ) are governed by a stateevolution whose fixed points—when unique—yield the true posterior means [33, 34] and ii) a low implementation complexity (i.e., dominated by one multiplication with and per iteration, and relatively few iterations) [31]. Thus, a natural question is whether the AMP framework can be successfully applied to the generalized bilinear problem described earlier.
In this manuscript, we propose an AMPbased approach to generalized bilinear inference that we henceforth refer to as Bilinear Generalized AMP (BiGAMP), and we uncover special cases under which the general approach can be simplified. In addition, we propose an adaptive damping [35] mechanism, an expectationmaximization (EM)based [36] method of tuning the parameters of , , and (in case they are unknown), and methods to select the rank (in case it is unknown). In the case that , , and/or are completely unknown, they can be modeled as Gaussianmixtures with mean/variance/weight parameters learned via EM [37]. Finally, we present a detailed numerical investigation of BiGAMP applied to matrix completion, robust PCA, and dictionary learning. Our empirical results show that BiGAMP yields an excellent combination of estimation accuracy and runtime when compared to existing stateoftheart algorithms for each application.
Although the AMP methodology is itself restricted to separable known pdfs (1)(3), the results of Part II suggest that this limitation is not an issue for many practical problems of interest. However, in problems where the separability assumption is too constraining, it can be relaxed through the use of hidden (coupling) variables, as originally proposed in the context of “turboAMP” [38] and applied to BiGAMP in [39]. Due to space limitations, however, this approach will not be discussed here. Finally, although we focus on realvalued random variables, all of the methodology described in this work can be easily extended to circularly symmetric complexvalued random variables.
We now discuss related work. One possibility of applying AMP methods to matrix completion was suggested by Montanari in [31, Sec. 9.7.3] but the approach described there differs from BiGAMP in that it was i) constructed from a factor graph with vectorvalued variables and ii) restricted to the (incomplete) additive white Gaussian noise (AWGN) observation model. Moreover, no concrete algorithm nor performance evaluation was reported. Since we first reported on BiGAMP in [1, 2], Rangan and Fletcher [40] proposed an AMPbased approach for the estimation of rankone matrices from AWGNcorrupted observations, and showed that it can be characterized by a state evolution in the largesystem limit. More recently, Krzakala, Mézard, and Zdeborová [41] proposed an AMPbased approach to blind calibration and dictionary learning in AWGN that bears similarity to a special case of BiGAMP, and derived a stateevolution using the cavity method. Their method, however, was not numerically successful in solving dictionary learning problems [41]. The BiGAMP algorithm that we derive here is a generalization of those in [40, 41] in that it handles generalized bilinear observations rather than AWGNcorrupted ones. Moreover, our work is the first to detail adaptive damping, parameter tuning, and rankselection mechanisms for AMP based bilinear inference, and it is the first to present an indepth numerical investigation involving both synthetic and realworld datasets. An application/extension of the BiGAMP algorithm described here to hyperspectral unmixing (an instance of nonnegative matrix factorization) was recently proposed in [39].
The remainder of the document is organized as follows. Section II derives the BiGAMP algorithm, and Sec. III presents several specialcase simplifications of BiGAMP. Section IV describes the adaptive damping mechanism, and Sec. V the EMbased tuning of prior parameters and selection of rank . Applicationspecific issues and numerical results demonstrating the efficacy of our approach for matrix completion, robust PCA, and dictionary learning, are discussed in Sections VI–VIII, respectively, and concluding remarks are offered in Sec. IX.
Notation: Throughout, we use sanserif font (e.g., x) for random variables and serif font (e.g., ) otherwise. We use boldface capital letters (e.g., X and ) for matrices, boldface small letters (e.g., x and ) for vectors, and nonbold small letters (e.g., x and ) for scalars. We then use to denote the pdf of random quantity x, and to denote the Gaussian pdf for a scalar random variable with mean and variance . Also, we use and to denote mean and variance of x, respectively, and for the KullbackLeibler (KL) divergence between pdfs and . For a matrix , we use to denote the entry in the row and column, to denote the Frobenius norm, and to denote transpose. Similarly, we use to denote the entry in vector and to denote the norm.
Ii Bilinear Generalized AMP
Iia Problem Formulation
For the statistical model (1)(3), the posterior distribution is
(4)  
(5)  
(6) 
where (4) employs Bayes’ rule and denotes equality up to a constant scale factor.
The posterior distribution can be represented with a factor graph, as depicted in Fig. 1. There, the factors of from (6) are represented by “factor nodes” that appear as black boxes, and the random variables are represented by “variable nodes” that appear as white circles. Each variable node is connected to every factor node in which that variable appears. The observed data are treated as parameters of the factor nodes in the middle of the graph, and not as random variables. The structure of Fig. 1 becomes intuitive when recalling that implies .
IiB Loopy Belief Propagation
In this work, we aim to compute minimum meansquared error (MMSE) estimates of and , i.e., the means^{1}^{1}1Another worthwhile objective could be to compute the joint MAP estimate ; we leave this to future work. of the marginal posteriors and , for all pairs and . Although exact computation of these quantities is generally prohibitive, they can be efficiently approximated using loopy belief propagation (LBP) [28].
In LBP, beliefs about the random variables (in the form of pdfs or log pdfs) are propagated among the nodes of the factor graph until they converge. The standard way to compute these beliefs, known as the sumproduct algorithm (SPA) [42, 43], stipulates that the belief emitted by a variable node along a given edge of the graph is computed as the product of the incoming beliefs from all other edges, whereas the belief emitted by a factor node along a given edge is computed as the integral of the product of the factor associated with that node and the incoming beliefs on all other edges. The product of all beliefs impinging on a given variable node yields the posterior pdf for that variable. In cases where the factor graph has no loops, exact marginal posteriors result from two (i.e., forward and backward) passes of the SPA [42, 43]. For loopy factor graphs, exact inference is in general NP hard [44] and so LBP does not guarantee correct posteriors. That said, LBP has shown stateoftheart performance in many applications, such as inference on Markov random fields [45], turbo decoding [46], LDPC decoding [47], multiuser detection [48], and compressive sensing [29, 30, 32, 33, 34].
In highdimensional inference problems, exact implementation of the SPA is impractical, motivating approximations of the SPA. A notable example is the generalized approximate message passing (GAMP) algorithm, developed in [32] to solve the generalized CS problem, which exploits the “blessings of dimensionality” that arise when is a sufficiently large and dense and which was rigorously analyzed in [34]. In the sequel, we derive an algorithm for the generalized bilinear inference BiGAMP algorithm that employs GAMPlike approximations to the SPA on the factor graph in Fig. 1. As we shall see, the approximations are primarily based on centrallimittheorem (CLT) and Taylorseries arguments.
IiC Sumproduct Algorithm
In our formulation of the SPA, messages take the form of logpdfs with arbitrary constant offsets. For example, the iteration (where ) message can be converted to the pdf , where the choice of scale factor ensures that the pdf integrates to one. Four types of message will be used, as specified in Table I. We also find it convenient to express the (iteration SPAapproximated) posterior pdfs and in the log domain as and , respectively, again with arbitrary constant offsets.
SPA message from node to node  
SPA message from node to node  
SPA message from node to node  
SPA message from node to node  
SPAapproximated log posterior pdf of  
SPAapproximated log posterior pdf of 
Applying the SPA to the factor graph in Fig. 1, we arrive at the following update rules for the four messages in Table I.
(7)  
(8)  
(9)  
(10) 
where const is an arbitrary constant (w.r.t in (7) and (8), and w.r.t in (9) and (10)). In the sequel, we denote the mean and variance of the pdf by and , respectively, and we denote the mean and variance of by and . For the logposteriors, the SPA implies
(11)  
(12) 
and we denote the mean and variance of by and , and the mean and variance of by and .
IiD Approximated FactortoVariable Messages
We now apply AMP approximations to the SPA updates (7)(12). As we shall see, the approximations are based primarily on centrallimittheorem (CLT) and Taylorseries arguments that become exact in the largesystem limit, where with fixed ratios and . (Due to the use of finite in practice, we still regard them as approximations.) In particular, our derivation will neglect terms that vanish relative to others as , which requires that we establish certain scaling conventions. First, we assume w.l.o.g^{2}^{2}2Other scalings on , , and could be used as long as they are consistent with the relationship . that and scale as , i.e., that the magnitudes of these elements do not change as . In this case, the relationship implies that must scale as . These scalings are assumed to hold for random variables , , and distributed according to the prior pdfs, according to the pdfs corresponding to the SPA messages (7)(10), and according to the pdfs corresponding to the SPA posterior approximations (11)(12). These assumptions lead straightforwardly to the scalings of , , , , , , , , , and specified in Table II. Furthermore, because and differ by only one term out of , it is reasonable to assume [31, 32] that the corresponding difference in means and variances are both , which then implies that is also . Similarly, because and differ by only one term out of , where and are , it is reasonable to assume that is and that both and are . The remaining entries in Table II will be explained below.
We start by approximating the message . Expanding (7), we find
(13) 
For large , the CLT motivates the treatment of , the random variable associated with the identified in (13), conditioned on , as Gaussian and thus completely characterized by a (conditional) mean and variance. Defining the zeromean r.v.s and , where and , we can write
(14)  
after which it is straightforward to see that
(15)  
(16) 
for
(17)  
(18) 
With this conditionalGaussian approximation, (13) becomes
(19)  
(20) 
in terms of the function
(21) 
Unlike the original SPA message (7), the approximation (20) requires only a single integration. Still, additional simplifications are possible. First, notice that and differ from the corresponding invariant quantities
(22)  
(23) 
by one term. In the sequel, we will assume that and are since these quantities can be recognized as the mean and variance, respectively, of an estimate of , which is . Writing the term in (20) using (22)(23),
(24)  
(25) 
where in (25) we used the facts that and are both .
Rewriting (20) using a Taylor series expansion in about the point , we get
(26) 
where and are the first two derivatives of w.r.t its first argument and is the first derivative w.r.t its second argument. Note that, in (26) and elsewhere, the higherorder terms in the Taylor’s expansion are written solely in terms of their scaling dependence on , which is what will eventually allow us to neglect these terms (in the largesystem limit).
We now approximate (26) by dropping terms that vanish, relative to the secondtolast term in (26), as . Since this secondtolast term is due to the scalings of , , and , we drop terms that are of order , such as the final term. We also replace with , and with , since in both cases the difference is . Finally, we drop the terms inside the derivatives, which can be justified by taking a Taylor series expansion of these derivatives with respect to the perturbations and verifying that the higherorder terms in this latter expansion are . All of these approximations are analogous to those made in previous AMP derivations, e.g., [30], [31], and [32].