Bilinear Generalized Approximate Message Passing

# Bilinear Generalized Approximate Message Passing

Jason T. Parker, Philip Schniter, and Volkan Cevher J. Parker is with the Sensors Directorate, Air Force Research Laboratory, Dayton, OH 45433, USA e-mail: jason.parker.13@us.af.mil. His work on this project has been supported by AFOSR Lab Task 11RY02COR.P. Schniter is with the Dept. ECE, The Ohio State University, 2015 Neil Ave., Columbus OH 43210, e-mail: schniter@ece.osu.edu, phone 614.247.6488, fax 614.292.7596. His work on this project has been supported by NSF grants IIP-0968910, CCF-1018368, CCF-1218754, and by DARPA/ONR grant N66001-10-1-4090.V. Cevher is with École Polytechnique Fédérale de Lausanne. Email: volkan.cevher@epfl.ch. His work is supported in part by the European Commission under the grants MIRG-268398 and ERC Future Proof, and by the Swiss Science Foundation under the grants SNF 200021-132548, SNF 200021-146750 and SNF CRSII2-147633.Portions of this work were presented at the 2011 Workshop on Signal Processing with Adaptive Sparse Structured Representations [1], the 2012 Information Theory and Applications Workshop [2], and at the 2012 Asilomar Conference on Signals, Systems, and Computers [3].
###### Abstract

We extend the generalized approximate message passing (G-AMP) approach, originally proposed for high-dimensional generalized-linear regression in the context of compressive sensing, to the generalized-bilinear case, which enables its application to matrix completion, robust PCA, dictionary learning, and related matrix-factorization problems. In the first part of the paper, we derive our Bilinear G-AMP (BiG-AMP) algorithm as an approximation of the sum-product belief propagation algorithm in the high-dimensional limit, where central-limit theorem arguments and Taylor-series approximations apply, and under the assumption of statistically independent matrix entries with known priors. In addition, we propose an adaptive damping mechanism that aids convergence under finite problem sizes, an expectation-maximization (EM)-based method to automatically tune the parameters of the assumed priors, and two rank-selection strategies. In the second part of the paper, we discuss the specializations of EM-BiG-AMP to the problems of matrix completion, robust PCA, and dictionary learning, and present the results of an extensive empirical study comparing EM-BiG-AMP to state-of-the-art algorithms on each problem. Our numerical results, using both synthetic and real-world datasets, demonstrate that EM-BiG-AMP yields excellent reconstruction accuracy (often best in class) while maintaining competitive runtimes and avoiding the need to tune algorithmic parameters.

## I Introduction

In this work, we present a new algorithmic framework for the following generalized bilinear inference problem: estimate the matrices and from a matrix observation that is statistically coupled to their product, . In doing so, we treat and as realizations of random matrices A and X with known separable pdfs (or pmfs in the case of discrete models), i.e.,

 p{{A}}(A) =∏m∏npamn(amn) (1) p{{X}}(X) =∏n∏lpxnl(xnl), (2)

and we likewise assume that the likelihood function of is known and separable, i.e.,

 p{{Y}}|{{Z}}(Y|Z) =∏m∏lpyml|zml(yml|zml). (3)

Recently, various special cases of this problem have gained the intense interest of the research community, e.g.,

1. Matrix Completion: In this problem, one observes a few (possibly noise-corrupted) entries of a low-rank matrix and the goal is to infer the missing entries. In our framework, would represent the complete low-rank matrix (with tall and wide ) and the observation mechanism, which would be (partially) informative about at the observed entries and non-informative at the missing entries .

2. Robust PCA: Here, the objective is to recover a low-rank matrix (or its principal components) observed in the presence of noise and sparse outliers. In our framework, could again represent the low-rank matrix, and the noise-and-outlier-corrupted observation mechanism. Alternatively, could also capture the outliers, as described in the sequel.

3. Dictionary Learning: Here, the objective is to learn a dictionary for which there exists a sparse data representation such that closely matches the observed data . In our framework, would be chosen to induce sparsity, would represent the noiseless observations, and would model the (possibly noisy) observation mechanism.

While a plethora of approaches to these problems have been proposed based on optimization techniques (e.g., [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]), greedy methods (e.g., [15, 16, 17, 18, 19]), Bayesian sampling methods (e.g., [20, 21]), variational methods (e.g., [22, 23, 24, 25, 26]), and discrete message passing (e.g., [27]), ours is based on the Approximate Message Passing (AMP) framework, an instance of loopy belief propagation (LBP) [28] that was recently developed to tackle linear [29, 30, 31] and generalized linear [32] inference problems encountered in the context of compressive sensing (CS). In the generalized-linear CS problem, one estimates from observations that are statistically coupled to the transform outputs through a separable likelihood function , where in this case the transform is fixed and known.

In the context of CS, the AMP framework yields algorithms with remarkable properties: i) solution trajectories that, in the large-system limit (i.e., as with fixed, under iid sub-Gaussian ) are governed by a state-evolution whose fixed points—when unique—yield the true posterior means [33, 34] and ii) a low implementation complexity (i.e., dominated by one multiplication with and per iteration, and relatively few iterations) [31]. Thus, a natural question is whether the AMP framework can be successfully applied to the generalized bilinear problem described earlier.

In this manuscript, we propose an AMP-based approach to generalized bilinear inference that we henceforth refer to as Bilinear Generalized AMP (BiG-AMP), and we uncover special cases under which the general approach can be simplified. In addition, we propose an adaptive damping [35] mechanism, an expectation-maximization (EM)-based [36] method of tuning the parameters of , , and (in case they are unknown), and methods to select the rank (in case it is unknown). In the case that , , and/or are completely unknown, they can be modeled as Gaussian-mixtures with mean/variance/weight parameters learned via EM [37]. Finally, we present a detailed numerical investigation of BiG-AMP applied to matrix completion, robust PCA, and dictionary learning. Our empirical results show that BiG-AMP yields an excellent combination of estimation accuracy and runtime when compared to existing state-of-the-art algorithms for each application.

Although the AMP methodology is itself restricted to separable known pdfs (1)-(3), the results of Part II suggest that this limitation is not an issue for many practical problems of interest. However, in problems where the separability assumption is too constraining, it can be relaxed through the use of hidden (coupling) variables, as originally proposed in the context of “turbo-AMP” [38] and applied to BiG-AMP in [39]. Due to space limitations, however, this approach will not be discussed here. Finally, although we focus on real-valued random variables, all of the methodology described in this work can be easily extended to circularly symmetric complex-valued random variables.

We now discuss related work. One possibility of applying AMP methods to matrix completion was suggested by Montanari in [31, Sec. 9.7.3] but the approach described there differs from BiG-AMP in that it was i) constructed from a factor graph with vector-valued variables and ii) restricted to the (incomplete) additive white Gaussian noise (AWGN) observation model. Moreover, no concrete algorithm nor performance evaluation was reported. Since we first reported on BiG-AMP in [1, 2], Rangan and Fletcher [40] proposed an AMP-based approach for the estimation of rank-one matrices from AWGN-corrupted observations, and showed that it can be characterized by a state evolution in the large-system limit. More recently, Krzakala, Mézard, and Zdeborová [41] proposed an AMP-based approach to blind calibration and dictionary learning in AWGN that bears similarity to a special case of BiG-AMP, and derived a state-evolution using the cavity method. Their method, however, was not numerically successful in solving dictionary learning problems [41]. The BiG-AMP algorithm that we derive here is a generalization of those in [40, 41] in that it handles generalized bilinear observations rather than AWGN-corrupted ones. Moreover, our work is the first to detail adaptive damping, parameter tuning, and rank-selection mechanisms for AMP based bilinear inference, and it is the first to present an in-depth numerical investigation involving both synthetic and real-world datasets. An application/extension of the BiG-AMP algorithm described here to hyperspectral unmixing (an instance of non-negative matrix factorization) was recently proposed in [39].

The remainder of the document is organized as follows. Section II derives the BiG-AMP algorithm, and Sec. III presents several special-case simplifications of BiG-AMP. Section IV describes the adaptive damping mechanism, and Sec. V the EM-based tuning of prior parameters and selection of rank . Application-specific issues and numerical results demonstrating the efficacy of our approach for matrix completion, robust PCA, and dictionary learning, are discussed in Sections VIVIII, respectively, and concluding remarks are offered in Sec. IX.

Notation: Throughout, we use san-serif font (e.g., x) for random variables and serif font (e.g., ) otherwise. We use boldface capital letters (e.g., X and ) for matrices, boldface small letters (e.g., x and ) for vectors, and non-bold small letters (e.g., x and ) for scalars. We then use to denote the pdf of random quantity x, and to denote the Gaussian pdf for a scalar random variable with mean and variance . Also, we use and to denote mean and variance of x, respectively, and for the Kullback-Leibler (KL) divergence between pdfs and . For a matrix , we use to denote the entry in the row and column, to denote the Frobenius norm, and to denote transpose. Similarly, we use to denote the entry in vector and to denote the norm.

## Ii Bilinear Generalized AMP

### Ii-a Problem Formulation

For the statistical model (1)-(3), the posterior distribution is

 p{{X}},{{A}}|{{Y}}(X,A|Y) =p{{Y}}|{{X}},{{A}}(Y|X,A)p{{X}}(X)p{{A}}(A)/p{{Y}}(Y) (4) ∝p{{Y}}|{{Z}}(Y|AX)p{{X}}(X)p{{A}}(A) (5) =[∏m∏lpyml|zml(yml∣∣∑kamkxkl)] ×[∏n∏lpxnl(xnl)][∏m∏npamn(amn)], (6)

where (4) employs Bayes’ rule and denotes equality up to a constant scale factor.

The posterior distribution can be represented with a factor graph, as depicted in Fig. 1. There, the factors of from (6) are represented by “factor nodes” that appear as black boxes, and the random variables are represented by “variable nodes” that appear as white circles. Each variable node is connected to every factor node in which that variable appears. The observed data are treated as parameters of the factor nodes in the middle of the graph, and not as random variables. The structure of Fig. 1 becomes intuitive when recalling that implies .

### Ii-B Loopy Belief Propagation

In this work, we aim to compute minimum mean-squared error (MMSE) estimates of and , i.e., the means111Another worthwhile objective could be to compute the joint MAP estimate ; we leave this to future work. of the marginal posteriors and , for all pairs and . Although exact computation of these quantities is generally prohibitive, they can be efficiently approximated using loopy belief propagation (LBP) [28].

In LBP, beliefs about the random variables (in the form of pdfs or log pdfs) are propagated among the nodes of the factor graph until they converge. The standard way to compute these beliefs, known as the sum-product algorithm (SPA) [42, 43], stipulates that the belief emitted by a variable node along a given edge of the graph is computed as the product of the incoming beliefs from all other edges, whereas the belief emitted by a factor node along a given edge is computed as the integral of the product of the factor associated with that node and the incoming beliefs on all other edges. The product of all beliefs impinging on a given variable node yields the posterior pdf for that variable. In cases where the factor graph has no loops, exact marginal posteriors result from two (i.e., forward and backward) passes of the SPA [42, 43]. For loopy factor graphs, exact inference is in general NP hard [44] and so LBP does not guarantee correct posteriors. That said, LBP has shown state-of-the-art performance in many applications, such as inference on Markov random fields [45], turbo decoding [46], LDPC decoding [47], multiuser detection [48], and compressive sensing [29, 30, 32, 33, 34].

In high-dimensional inference problems, exact implementation of the SPA is impractical, motivating approximations of the SPA. A notable example is the generalized approximate message passing (GAMP) algorithm, developed in [32] to solve the generalized CS problem, which exploits the “blessings of dimensionality” that arise when is a sufficiently large and dense and which was rigorously analyzed in [34]. In the sequel, we derive an algorithm for the generalized bilinear inference BiG-AMP algorithm that employs GAMP-like approximations to the SPA on the factor graph in Fig. 1. As we shall see, the approximations are primarily based on central-limit-theorem (CLT) and Taylor-series arguments.

### Ii-C Sum-product Algorithm

In our formulation of the SPA, messages take the form of log-pdfs with arbitrary constant offsets. For example, the iteration- (where ) message can be converted to the pdf , where the choice of scale factor ensures that the pdf integrates to one. Four types of message will be used, as specified in Table I. We also find it convenient to express the (iteration- SPA-approximated) posterior pdfs and in the log domain as and , respectively, again with arbitrary constant offsets.

Applying the SPA to the factor graph in Fig. 1, we arrive at the following update rules for the four messages in Table I.

 Δxm→nl(t,xnl) =log∫{amk}Nk=1,{xrl}r≠npy% ml|zml(yml∣∣∣N∑k=1amkxkl) ×∏r≠nexp(Δxm←rl(t,xrl))N∏k=1exp(Δal←mk(t,amk)) +\sf const (7) Δxm←nl(t+1,xnl) =logpxnl(xnl)+∑k≠mΔxk→nl(t,xnl)+\sf const (8) Δal→mn(t,amn) ×N∏k=1exp(Δxm←kl(t,xkl))∏r≠nexp(Δal←mr(t,amr)) +\sf const (9) Δal←mn(t+1,amn) =logpamn(amn)+∑k≠lΔak→mn(t,amn)+\sf const, (10)

where const is an arbitrary constant (w.r.t in (7) and (8), and w.r.t in (9) and (10)). In the sequel, we denote the mean and variance of the pdf by and , respectively, and we denote the mean and variance of by and . For the log-posteriors, the SPA implies

 Δxnl(t+1,xnl) =logpxnl(xnl)+M∑m=1Δxm→nl(t,xnl)+\sf const (11) Δamn(t+1,amn) =logpamn(amn)+L∑l=1Δal→mn(t,amn)+\sf const, (12)

and we denote the mean and variance of by and , and the mean and variance of by and .

### Ii-D Approximated Factor-to-Variable Messages

We now apply AMP approximations to the SPA updates (7)-(12). As we shall see, the approximations are based primarily on central-limit-theorem (CLT) and Taylor-series arguments that become exact in the large-system limit, where with fixed ratios and . (Due to the use of finite in practice, we still regard them as approximations.) In particular, our derivation will neglect terms that vanish relative to others as , which requires that we establish certain scaling conventions. First, we assume w.l.o.g222Other scalings on , , and could be used as long as they are consistent with the relationship . that and scale as , i.e., that the magnitudes of these elements do not change as . In this case, the relationship implies that must scale as . These scalings are assumed to hold for random variables , , and distributed according to the prior pdfs, according to the pdfs corresponding to the SPA messages (7)-(10), and according to the pdfs corresponding to the SPA posterior approximations (11)-(12). These assumptions lead straightforwardly to the scalings of , , , , , , , , , and specified in Table II. Furthermore, because and differ by only one term out of , it is reasonable to assume [31, 32] that the corresponding difference in means and variances are both , which then implies that is also . Similarly, because and differ by only one term out of , where and are , it is reasonable to assume that is and that both and are . The remaining entries in Table II will be explained below.

We start by approximating the message . Expanding (7), we find

 Δxm→nl(t,xnl) ×∏r≠nexp(Δxm←rl(t,xrl))N∏k=1exp(Δal←mk(t,amk)) +\sf const. (13)

For large , the CLT motivates the treatment of , the random variable associated with the identified in (13), conditioned on , as Gaussian and thus completely characterized by a (conditional) mean and variance. Defining the zero-mean r.v.s and , where and , we can write

 zml = (ˆal,mn(t)+˜al,mn)xnl+∑k≠n(ˆal,mk(t)ˆxm,kl(t) (14) +ˆal,mk(t)˜xm,kl+˜al,mkˆxm,kl(t)+˜al,mk˜xm,kl)

after which it is straightforward to see that

 E{zml|xnl=xnl} =ˆal,mn(t)xnl+ˆpn,ml(t) (15) var{zml|xnl=xnl} =νal,mn(t)x2nl+νpn,ml(t) (16)

for

 ˆpn,ml(t) ≜∑k≠nˆal,mk(t)ˆxm,kl(t) (17) νpn,ml(t) ≜∑k≠n(ˆa2l,mk(t)νxm,kl(t)+νal,mk(t)ˆx2m,kl(t) +νal,mk(t)νxm,kl(t)). (18)

With this conditional-Gaussian approximation, (13) becomes

 Δxm→nl(t,xnl)≈\sf const+log∫zmlpyml|zml(yml|zml) (19) =Hml(ˆal,mn(t)xnl+ˆpn,ml(t), νal,mn(t)x2nl+νpn,ml(t);yml)+\sf const (20)

in terms of the function

 Hml(ˆq,νq;y) ≜log∫zpyml|zml(y|z)N(z;ˆq,νq). (21)

Unlike the original SPA message (7), the approximation (20) requires only a single integration. Still, additional simplifications are possible. First, notice that and differ from the corresponding -invariant quantities

 ˆpml(t) ≜N∑k=1ˆal,mk(t)ˆxm,kl(t) (22) νpml(t) ≜N∑k=1(ˆa2l,mk(t)νxm,kl(t)+νal,mk(t)ˆx2m,kl(t) +νal,mk(t)νxm,kl(t)) (23)

by one term. In the sequel, we will assume that and are since these quantities can be recognized as the mean and variance, respectively, of an estimate of , which is . Writing the term in (20) using (22)-(23),

 Hml(ˆal,mn(t)xnl+ˆpn,ml(t),νal,mn(t)x2nl+νpn,ml(t);yml) =Hml(ˆal,mn(t)(xnl−ˆxm,nl(t))+ˆpml(t), νal,mn(t)(x2nl−ˆx2m,nl(t))−ˆa2l,mn(t)νxm,nl(t) −νal,mn(t)νxm,nl(t)+νpml(t);yml) (24) =Hml(ˆal,mn(t)(xnl−ˆxnl(t))+ˆpml(t)+O(1/N), νal,mn(t)(x2nl−ˆx2nl(t))+νpml(t)+O(1/N);yml) (25)

where in (25) we used the facts that and are both .

Rewriting (20) using a Taylor series expansion in about the point , we get

 Δxm→nl(t,xnl)≈\sf const +Hml(ˆpml(t)+O(1/N),νpml(t)+O(1/N);yml) +ˆal,mn(t)(xnl−ˆxnl(t)) ×H′ml(ˆpml(t)+O(1/N),νpml(t)+O(1/N);yml) +2νal,mn(t)ˆxnl(t)(xnl−ˆxnl(t)) ×˙Hml(ˆpml(t)+O(1/N),νpml(t)+O(1/N);yml) +νal,mn(t)(xnl−ˆxnl(t))2 ×˙Hml(ˆpml(t)+O(1/N),νpml(t)+O(1/N);yml) +12ˆa2l,mn(t)(xnl−ˆxnl(t))2 ×H′′ml(ˆpml(t)+O(1/N),νpml(t)+O(1/N);yml) +O(1/N3/2), (26)

where and are the first two derivatives of w.r.t its first argument and is the first derivative w.r.t its second argument. Note that, in (26) and elsewhere, the higher-order terms in the Taylor’s expansion are written solely in terms of their scaling dependence on , which is what will eventually allow us to neglect these terms (in the large-system limit).

We now approximate (26) by dropping terms that vanish, relative to the second-to-last term in (26), as . Since this second-to-last term is due to the scalings of , , and , we drop terms that are of order , such as the final term. We also replace with , and with , since in both cases the difference is . Finally, we drop the terms inside the derivatives, which can be justified by taking a Taylor series expansion of these derivatives with respect to the perturbations and verifying that the higher-order terms in this latter expansion are . All of these approximations are analogous to those made in previous AMP derivations, e.g., [30], [31], and [32].

Applying these approximations to (26) and absorbing -invariant terms into the const term, we obtain

 Δxm→nl(t,xnl) ≈[ˆsml(t)ˆal,mn(t)+νsml(t)ˆa2mn(t)ˆxnl(t)] ×xnl−12[νsml(t)ˆa2mn(t)−νamn(t) ×(ˆs2ml(t)−νsml(t))]x2nl+\sf const, (27)

where we used the relationship

 ˙Hml(ˆq,νq;y) = 12[(H′ml(ˆq,νq;y))2+H′′ml(ˆq,νq;y)] (28)

and defined

 ˆsml(t) ≜H′ml(ˆpml(t),νpml(t);yml) (29) νsml(t) ≜−H′′ml(ˆpml(t),νpml(t);yml). (30)

Note that (27) is essentially a Gaussian approximation to the pdf .

We show in Appendix A that

 ˆsml(t) =1νpml(t)(ˆzml(t)−ˆpml(t)) (31) νsml(t) =1νpml(t)(1−νzml(t)νpml(t)), (32)

for the conditional mean and variance

 ˆzml(t)