Expectation Propagation for Approximate Inference:
Free Probability Framework
Abstract
We study asymptotic properties of expectation propagation (EP) – a method for approximate inference originally developed in the field of machine learning. Applied to generalized linear models, EP iteratively computes a multivariate Gaussian approximation to the exact posterior distribution. The computational complexity of the repeated update of covariance matrices severely limits the application of EP to large problem sizes. In this study, we present a rigorous analysis by means of free probability theory that allows us to overcome this computational bottleneck if specific data matrices in the problem fulfill certain properties of asymptotic freeness. We demonstrate the relevance of our approach on the gene selection problem of a microarray dataset.
I Introduction
We consider the problem of approximate (Bayesian) inference for a latent vector with its posterior probability density function (pdf) of the form where is a prior pdf and is an matrix. Here all variables are realvalued. This probabilistic model often is referred to as “generalized linear model”[1]. For a (multivariate) Gaussian , it covers e.g. many Gaussian process inference problems in machine learning. But we are also interested in more general cases where both factors can be nonGaussian. Such models naturally appear in signal recovery problems where a signal described by a vector is linearly coupled and then passed through a channel with the output signal vector and the channel pdf .
Actually, it is rather convenient to introduce the latent vector as an auxiliary vector [2] and study the posterior pdf of the latent vector ^{1}^{1}1By abuse of notation, for column vectors and we introduce the conjugate column vector notation by with denoting the (conjugate, in general) transposition. which factorizes according to
(1) 
Here, for short we neglect the dependencies on and in the notations of and , respectively.
We are interested in the method of (Gaussian) expectation propagation (EP) [3, 4]. It approximates the marginal posterior pdf by which is a marginal of the Gaussian pdf
(2) 
Here, the parameters of and are obtained from the expectation consistency (EC) conditions
(3) 
with the expectation notation for a pdf . The set is specified in such a way that yields the set of all natural parameters of the Gaussian pdf . For example, the mean and the covariance are required in general, i.e. .
The conventional approach in machine learning is to restrict the matrices & to diagonals. Consequently, the EC conditions in (3) are fulfilled for
(4) 
The diagonalEP approach has shown excellent performances for many relevant machine learning problems [5, 3, 6].
In the diagonalEP case where & being (arbitrary) diagonal, an iterative algorithm associated with the EP fixedpoint equations (3) will require a matrix inversion operation at each iteration. This leads to the iterative algorithm to have a cubic computational complexity, say , per iteration which makes a direct application of the method to large systems problematic. A simple way to resolve this problem is to consider the scalarEP approach [4] in which the diagonal matrices for are restricted to where is identity matrix. Consequently, the EC conditions in (3) are fulfilled for
(5) 
Such approach has recently been attracted considerable attention in the information theory community [7, 8, 9, 10, 11] because of its strong connection with the approximate message passing techniques [12, 13, 14]. Moreover, important theoretical analyses on the trajectories of the scalarEP fixedpoint algorithms have been developed in the context of the linear observation model [15, 16, 17].
The diagonalEP method may stand as a plain approximation – at first sight. But it actually coincides with the powerful cavity method introduced in statistical physics [5, 18]. On the other hand, though the asymptotic consistency of the scalarEP method with the powerful replica ansatz [19] is now wellunderstood [15, 16, 17], it is still unclear how to obtain the method per se from an explicit mathematical argumentation.
In this study, we introduce a profound theoretical approach by means of free probability theory [20, 21] to relate the diagonal and scalar EP methods. Specifically, we show that if the data matrix fulfills certain properties of asymptotic freeness both methods become asymptotically equivalent. Though the underlying asymptotic freeness conditions will be typically approximations in practice, we are nevertheless convinced that they can be reasonable approximations for many realworld problems. Indeed, in Section 5 we demonstrate the relevance of our free probability approximations on the gene selection problem of a microarray dataset.
Ia Related Works
The main technical step in the proof of the aforementioned contribution is to show the following: Let and be real diagonal and Hermitian matrices, respectively, and be an diagonal matrix and independent of and . The diagonal entries of are independent and composed of with equal probabilities. If and are asymptotically free we have as where is an appropriately computed scalar. Such a “local law” for addition of free matrices has been shown by [22] (see also [23]) but for being invariant, i.e. where is Haar matrix that is independent of diagonal .
While invariant matrices are important ensembles for the asymptotic freeness to hold, there are important matrix ensemble that are not invariant but the asymptotic freeness can still holds. For example, consider as above. Then, by substituting the Haar matrix with randomly permuted Fourier or Hadamard matrices the asymptotic freeness still holds, see [24] for further information.
The proof technique by [22] is based on the probabilistic analysis of certain operations of Haar measure. Our proof technique is algebraic and considerably simpler which could be a useful approach to obtain such local law property of random matrices for different problems. Indeed, in our context we need to extend the above result to a moreinvolved matrix model in which our algebraic approach is convenient. On the other hand, the weak aspect of our method is that we prove the convergence in the norm sense (see the explanation in the first paragraph of Section IV) while [22] shows the convergence pointwise. This weakness is potentially remediable, but it exceeds the scope of the present contribution and is left for future study.
Finally, in [25] we and the coauthors provide a nonrigorous (and rather complicated) approach to the problem.
IB Outline
This paper is organized as follows: Section 2 briefly presents the concept of “asymptotic freeness” (in the almost sure sense) in random matrix theory, see [26] for a detailed exposition. Section 3 presents a more explicit form of the fixedpoint equations in (3) for the diagonal and scalar EP methods. Our main result is presented in Section 4. Section 5 is devoted to the simulation results. Section 6 gives a summary and outlook. The proof of the main result is located in the Appendix.
Ii Asymptotic Freeness
In probability theory we say the real and bounded random variables and are independent if for all integers we have . This role of independence in expectations of products of (commutative) random variables is analogous to the role of “asymptotic freeness” in asymptotic normalized traces of products of (noncommutative) matrices with for an matrix . Specifically, we say the matrices and are asymptotically free if for all and for all integers we have [22]
(6) 
Note that the adjacent factors in the product must be polynomials belonging to different matrices. This gives the highest degree of non–commutativity in the definition of asymptotic freeness. We next present the concept of asymptotic freeness in a more general and formal setup[26]. To this end we define a noncommutative polynomial set in matrices of order :
(7) 
For example consists of matrices of the form
Definition 1
We refer the reader to [26, Section 4.3] for some important examples of asymptotically free matrices.
Iii The EP FixedPoint Equations
In the sequel we present a more explicit form of the fixedpoint equations in (3) for the diagonal and scalar EP methods: Firstly, as regards the first expectation term in (3) we define for convenience the vectors
(9) 
Secondly, as regards the third expectation term in (3) we define
(10) 
Here, we point out the following immediate remarks that and . In particular, by invoking the standard Gaussian integral identities we have the explicit expressions of and as
(11)  
(12) 
where, for notational convenience, associated with the vectors and we consider the EP parameters for as
(13) 
with and . Note that for the scalarEP and where is the identity matrix of appropriate dimension. Similar to (13) we also write and . Thus, the fixedpoint equations (3) for the diagonal and scalar EPs read of the form [25]
(14)  
(15)  
(16) 
Here, for a matrix and a vector we define and . Note that (14) & (15) and (14) & (16) constitute the fixedpoint equations of the diagonal and the scalar EPs, respectively.
Iiia Cubic versus quadratic computational complexities
An iterative algorithm to solve the diagonalEP fixedpoint equations, i.e. (14) & (15), has an computational complexity per iteration (here we assume that ) as it requires to update the matrix inverse (11) at each iteration.
On the other hand, an iterative algorithm for the scalarEP fixedpoint equations, i.e. (14) & (16), can have computational complexity per iteration. Specifically, by computing the singular value decomposition of before the iteration starts, the cubic computational complexity of updating (11), i.e. , at each iteration reduces to a quadratic computational complexity, see e.g. [27].
Within the scalarEP method it is also possible to circumvent the need for the singular value decomposition. In short, this can be done by bypassing the need for matrix inverse (11) in fulfilling (14) (see next subsection) and fulfilling (16) by means of the socalled Rtransform of free probability (see [25]) which is in general can be estimated from spectral moments up to some order [28].
IiiB An equivalent representation of (14)
From (12), in (14) holds if, and only if
(17) 
Combining this equality with the first equality of (14), i.e. , we have
(18) 
Here, we express implicitly via : Consider the scalar functions , see (9). One can show that is a strictly increasing function of and thereby its inverse (with respect to functional decomposition) exists. Thus, holds if, and only if
(19) 
Iv The Main Result
For fixed since (18) & (19) do not depend on both the diagonal and scalar EP methods share the same fixedpoint equations. The fixedpoint equations of differ however. That of diagonalEP is given by (15) and that of scalarEP is given by (16). We next show under certain asymptotic freeness conditions that and where & are as in (15) and & fulfill (16). Here, for stands for where the distribution function of is defined as the empirical distribution function of ^{2}^{2}2Specifically, the empirical distribution function of the entries of is given by ., i.e. converges in the norm to a random variable .
Theorem 1
Let and be a and an real diagonal and invertible matrices, respectively, and be an matrix. Furthermore, let and be respectively a and an diagonal matrices whose diagonal entries are introduced through
(20) 
where . Moreover, let and be a and an diagonal matrices, respectively, and independent of , and . The diagonal entries of both and are independent and composed of with equal probabilities. In general, we assume that , and have almost surely (a.s.) a limiting eigenvalue distribution (LED) each as with .

If has a LED a.s. and is asymptotically free of a.s. there exists a deterministic variable such that we have a.s.
(21) whenever has a.s. an asymptotically bounded spectral norm.

If has a LED a.s. and is asymptotically free of a.s. there exists a deterministic variable such that we have a.s.
(22) whenever has a.s. an asymptotically bounded spectral norm.
Theorem 1 is selfcontained in the sense that its premises do not make reference to any of EP methods as and are some arbitrary diagonal matrices. Moreover, instead of assuming that e.g. is asymptotically free of for any it is sufficient to assume that is asymptotically free of .
The asymptotic freeness conditions in Theorem 1 hold if is biinvariant (i.e. the probability distribution of is invariant under multiplications from left and right with any independent orthogonal matrices), , and are independent of each others and , and have LEDs with uniformly bounded spectral norms [29]. Of course, the condition of independence becomes artificial in an application of EP. Nevertheless, as the diagonal entries of and have a limiting (deterministic) distribution each we may treat the underlying matrices as asymptotically independent.
Note that (23) may not yield unique solutions for and . Thus, for the asymptotic equivalence between the diagonal and scalar EP methods to hold we need to assume that the later yields unique solutions for and . It is interesting to investigate a “region of stability” analysis [30] where the solutions offered by (23) are unique. This could clarify the region of system parameters where both methods yields reliable solutions. But we consider this as out of the scope of the current contribution.
V Simulation Results on Real Data
To demonstrate that our free probability assumptions are applicable in practice, we address a novel application of EP to gene selection^{3}^{3}3The work [31] addressed a factorizing EP method to the problem which neglects the dependencies of the latent variables in the posterior and yields a cruder approximation.. This problem is based on microarray data [32] which are used to analyze the tissue samples taken from patients. The data consists of where the matrix element stands for the measurement of the expression level of the th gene for the th patient for and , and the vector contains binary labels where or indicate whether the th patient is healthy or has cancer. Given the data, the problem is to identify relevant genes for the cancer type of interest and to diagnose new patients. We model the inference problem as [2]
(24) 
Here, stands for a white Gaussian noise vector and for the vector of regression parameters, where corresponds to the th gene. The number of genes is typically in the range from to . To model the prior assumption that only a small subset of genes are relevant for the cancer type of interest [33] we postulate a sparsity promoting prior distribution for in the form of a spikeandslab (aka BernoulliGaussian) prior as for .
We assume that the expression level values are irregular enough (arising from both biological and experimental variability) to be considered “random” and that and are large enough to justify the application of the free probability approach. Then, Theorem 1 offers the approximations for the fixedpoint equations of diagonalEP in (15) as
(25a)  
(25b) 
where and are appropriately computed scalars which are asymptotically consistent with (23). In Figure 1a) and b) we illustrate the validity of the approximations in (25) on the prostate cancer microarray dataset [34] with patients and genes. Approximation (25) is excellent for the large matrix, but (as might be expected) yields cruder results for the comparably small matrix related to the small number of patients available. Nevertheless, such a mismatch may not necessarily have a deteriorating role to the approximate posterior of . To show this, we compare the posterior mean using the diagonalEP denoted by and compare with the scalarEP denoted by . As shown in Figure 1c both methods yield in fact close results.
Vi Summary and Outlook
By utilizing the concept of asymptotic freeness we have introduced a theoretical approach to EP approximate inference method. Our analysis plays a “bridging role” for the gap between the perspectives of machine learning and information theory communities, i.e. the diagonal and scalar EP approaches, respectively. We have shown a concrete application with real data (gene selection) where the free probability framework is applicable.
For data matrices that show strong deviations from asymptotic freeness we may need to take the deviations into account by a perturbation approach. Specifically, our proof technique is based on the exact infinite sum representation of the inverse of a sum of matrices – originally utilized by [35] for the additivity of Rtransforms of the sum of free matrices. While freeness of matrices leads to the vanishing of most terms under the trace operation, we may keep a small, possibly increasing number of such terms to account for the deviations. This could lead to an improvement for the scalarEP method.
Acknowledgment
The authors would like to thank Daniel HernándezLobato and José M. HernándezLobato for discussions on real data application of our approach and providing the microarray datasets and the anonymous reviewers for their suggestions and comments.
References
 [1] P. McCullagh and J. A. Nelder, Generalized Linear Models. Monograph on Statistics and Applied Probability (Chapman and Hall), 1989, no. 37.
 [2] J. H. Albert and S. Chib, “Bayesian analysis of binary and polychotomous response data,” Journal of the American statistical Association, vol. 88, no. 422, pp. 669–679, 1993.
 [3] T. P. Minka, “Expectation propagation for approximate Bayesian inference,” in Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, ser. UAI ’01, 2001, pp. 362–369.
 [4] M. Opper and O. Winther, “Expectation consistent approximate inference,” Journal of Machine Learning Research, 6 (2005): 21772204.
 [5] ——, “Gaussian processes for classification: Meanfield algorithms,” Neural Computation, pp. 2655–2684, 2000.
 [6] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning. Cambridge: MIT press, 2006, vol. 1.
 [7] Y. Kabashima and M. Vehkaperä, “Signal recovery using expectation consistent approximation for linear observations,” in 2014 IEEE International Symposium on Information Theory, June 2014, pp. 226–230.
 [8] B. Çakmak, O. Winther, and B. H. Fleury, “SAMP: Approximate message passing for general matrix ensembles,” in Proc. IEEE Information Theory Workshop (ITW), November 2014.
 [9] J. Ma and L. Ping, “Orthogonal amp,” IEEE Access, vol. 5, pp. 2020–2033, 2017.
 [10] B. Çakmak, O. Winther, and B. H. Fleury, “SAMP for nonlinear observation models,” in 2015 IEEE International Symposium on Information Theory (ISIT), June 2015, pp. 2807–2811.
 [11] A. Fletcher, M. SahraeeArdakan, S. Rangan, and P. Schniter, “Expectation consistent approximate inference: Generalizations and convergence,” in 2016 IEEE International Symposium on Information Theory (ISIT), July 2016, pp. 190–194.
 [12] Y. Kabashima, “A CDMA multiuser detection algorithm on the basis of belief propagation,” Journal of Physics A: Mathematical and General, vol. 36.43, p. 11111, October 2003.
 [13] D. L. Donoho, A. Maleki, and A. Montanari, “Messagepassing algorithms for compressed sensing,” Proceedings of the National Academy of Sciences, vol. 106, no. 45, pp. 18 914–18 919, September 2009.
 [14] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in Proc. IEEE International Symposium on Information Theory (ISIT), SaintPetersburg, Russia, July 2011.
 [15] K. Takeuchi, “Rigorous dynamics of expectationpropagationbased signal recovery from unitarily invariant measurements,” in 2017 IEEE International Symposium on Information Theory (ISIT), June 2017, pp. 501–505.
 [16] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,” in 2017 IEEE International Symposium on Information Theory (ISIT), June 2017, pp. 1588–1592.
 [17] B. Çakmak, M. Opper, O. Winther, and B. H. Fleury, “Dynamical functional theory for compressed sensing,” in 2017 IEEE International Symposium on Information Theory (ISIT), June 2017, pp. 2143–2147.
 [18] M. Opper and O. Winther, “Adaptive and selfaveraging ThoulessAndersonPalmer mean field theory for probabilistic modeling,” Physical Review E, vol. 64, pp. 056 131–(1–14), October 2001.
 [19] A. M. Tulino, G. Caire, S. Shamai, and S. Verdú, “Support recovery with sparsely sampled free random matrices,” IEEE Transactions on Information Theory, vol. 59, pp. 4243–4271, July 2013.
 [20] F. Hiai and D. Petz, The Semicirle Law, Free Random Variables and Entropy. American Mathematical Society, 2006.
 [21] J. A. Mingo and R. Speicher, Free probability and random matrices, ser. Fields Institute Monographs. Springer, 2017, vol. 35.
 [22] Z. Bao, L. Erdős, and K. Schnelli, “Local law of addition of random matrices on optimal scale,” Communications in Mathematical Physics, vol. 349, no. 3, pp. 947–990, Feb 2017.
 [23] V. Kargin, “Subordination for the sum of two random matrices,” Ann. Probab., vol. 43(4), pp. 2119–2150, 2015.
 [24] G. W. Anderson and B. Farrell, “Asymptotically liberating sequences of random unitary matrices,” Advances in Mathematics, vol. 255, pp. 381 – 413, 2014.
 [25] B. Çakmak, M. Opper, B. H. Fleury, and O. Winther, “Selfaveraging expectation propagation,” arXiv preprint arXiv:1608.06602, August 2016.
 [26] R. R. Müller, G. Alfano, B. M. Zaidel, and R. de Miguel, “Applications of large random matrices in communications engineering,” arXiv preprint arXiv:1310.5479, 2013.
 [27] P. Schniter, S. Rangan, and A. K. Fletcher, “Vector approximate message passing for the generalized linear model,” in 50th Asilomar Conference on Signals, Systems and Computers, Nov 2016, pp. 1525–1529.
 [28] J. Pielaszkiewicz, D. von Rosen, and M. Singull, “Cumulantmoment relation in free probability theory,” Acta et Commentationes Universitatis Tartuensis de Mathematica, vol. 18.2, pp. 265–278, 2014.
 [29] B. Collins and P. Sniady, “Integration with respect to the Haar measure on unitary, orthogonal and symplectic group,” Communications in Mathematical Physics, vol. 264.3, pp. 773–795, 2006.
 [30] J. R. L. De Almeida and D. J. Thouless, “Stability of the SherringtonKirkpatrick solution of a spin glass model,” Journal of Physics A: Mathematical and General, vol. 11, no. 5, p. 983, 1978.
 [31] D. HernándezLobato, J. M. HernándezLobato, and A. Suárez, “Expectation propagation for microarray data classification,” Pattern recognition letters, vol. 31, no. 12, pp. 1618–1626, 2010.
 [32] K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick, “Gene selection: a bayesian variable selection approach,” Bioinformatics, vol. 19, no. 1, pp. 90–97, 2003.
 [33] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” science, vol. 286, pp. 531–537, 1999.
 [34] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie et al., “Gene expression correlates of clinical prostate cancer behavior,” Cancer cell, vol. 1, no. 2, pp. 203–209, 2002.
 [35] T. Tao, Ed., Topics in Random Matrix Theory. American Mathematical Society, 2012, vol. 132.
 [36] H. Bercovici and D. Voiculescu, “Free convolution of measures with unbounded support,” Indiana University Mathematics Journal, vol. 42, no. 3, pp. 733–773, 1993.
 [37] U. Haagerup and F. Larsen, “Brown’s spectral distribution measure for Rdiagonal elements in finite von Neumann algebras,” Neumann Algebras, Journ. Functional Analysis, vol. 176, pp. 331–367, 1999.
 [38] R. R. Muller, D. Guo, and A. L. Moustakas, “Vector precoding for wireless mimo systems and its replica analysis,” IEEE Journal on Selected Areas in Communications, vol. 26, no. 3, pp. 530–540, April 2008.
 [39] J. S. Geronimo and T. P. Hill, “Necessary and sufficient condition that the limit of stieltjes transforms is a stieltjes transform,” Journal of Approximation Theory, vol. 121, no. 1, pp. 54–60, 2003.
Appendix A Proof of Theorem 1: Preliminaries
Aa The R and S transforms
Definition 2
[36, 37] Let be a probability distribution function on the real line and be its Stieltjes transform, i.e.
(26) 
Moreover, let us denote the inverse (with respect to functional decomposition) of the Stieltjes transform by . Then, the Rtransform of is given by
(27) 
Moreover, for the Stransform of is given by
(28) 
Lemma 1
Let be a probability distribution function on the real line and be its Rtransform. Moreover, let and . Then, we have .
Lemma 2
[38] Let be a probability distribution function on the real line and be its Rtransform. Furthermore, let
(30) 
Moreover, let . Then, we have
(31) 
AB The Additive and Multiplicative Free Convolutions
If a matrix has a LED, the Stieltjes transform of the LED of is defined as
(32) 
Also, let . Thus, stands for the Rtransform of the LED of . Similarly, stands for the Stransform of the LED of .
AC A local law for addition of free matrices
Theorem 2
Let and be an real diagonal matrix and an matrix, respectively. Furthermore, let be an diagonal matrix with its diagonal entries introduced through
(35) 
Moreover, let be an diagonal matrix and independent of and . The diagonal entries of are independent and composed of with equal probabilities.
Let and have a LED each a.s. and be a.s. asymptotically free of as . Then, for and we have a.s. that
(36) 
whenever has a.s. an asymptotically bounded spectral norm.
Proof of Theorem 2
For convenience we define . By invoking Lemma 1 and the additive free convolution (33) we have that
(37)  
(38)  
(39)  
(40) 
Here, since is assumed to have a asymptotically bounded spectral norm (a.s.). We now define
(41) 
To prove Theorem 2 it is sufficient to show (a.s.) that
(42) 
i.e. where is distributed according to the empirical distribution function of .
Note that . Hence, we have
(43)  
(44) 
Here and throughout the section expectation is taken over the random matrix . Thus, showing the condition
(45) 
is sufficient for the proof. To that end we first represent the matrix in a convenient mathematical formula for the algebra of asymptotic freeness, i.e. (8).
Lemma 3
Let the matrices , and be given as in (41). Furthermore, let . Moreover, for let us define