Inference for Generalized Linear Models via Alternating Directions and Bethe Free Energy Minimization

Inference for Generalized Linear Models via Alternating Directions and Bethe Free Energy Minimization

Sundeep Rangan, , Alyson K. Fletcher, ,
Philip Schniter, , and Ulugbek S. Kamilov
S. Rangan (email: srangan@poly.edu) is with the Department of Electrical and Computer Engineering, Polytechnic Institute of New York University, Brooklyn, NY. His work was supported by the National Science Foundation under Grant No. 1116589.A. K. Fletcher (email: akfletcher@ucla.edu) is with the Departments of Statistics, Mathematics, and Electrical Engineering, University of California, Los Angeles.P. Schniter (email: schniter@ece.osu.edu) is with the Department of Electrical and Computer Engineering, The Ohio State University. His work was supported in part by the National Science Foundation Grants CCF-1218754, CCF-1018368, and CCF-1527162.U. S. Kamilov (email: ulugbek.kamilov@epfl.ch) is with the Biomedical Imaging Group, École polytechnique fédérale de Lausanne (EPFL), CH-1015 Lausanne VD, Switzerland. His work was supported by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013)/ERC Grant Agreement 267439.This work was presented in part at the 2015 IEEE Symposium on Information Theory.
Abstract

Generalized Linear Models (GLMs), where a random vector is observed through a noisy, possibly nonlinear, function of a linear transform , arise in a range of applications in nonlinear filtering and regression. Approximate Message Passing (AMP) methods, based on loopy belief propagation, are a promising class of approaches for approximate inference in these models. AMP methods are computationally simple, general, and admit precise analyses with testable conditions for optimality for large i.i.d. transforms . However, the algorithms can diverge for general . This paper presents a convergent approach to the generalized AMP (GAMP) algorithm based on direct minimization of a large-system limit approximation of the Bethe Free Energy (LSL-BFE). The proposed method uses a double-loop procedure, where the outer loop successively linearizes the LSL-BFE and the inner loop minimizes the linearized LSL-BFE using the Alternating Direction Method of Multipliers (ADMM). The proposed method, called ADMM-GAMP, is similar in structure to the original GAMP method, but with an additional least-squares minimization. It is shown that for strictly convex, smooth penalties, ADMM-GAMP is guaranteed to converge to a local minimum of the LSL-BFE, thus providing a convergent alternative to GAMP that is stable under arbitrary transforms. Simulations are also presented that demonstrate the robustness of the method for non-convex penalties as well.

Belief propagation, ADMM, variational optimization, message passing, generalized linear models.

I Introduction

Consider the problem of estimating a random vector from observations as shown in Fig. 1. The unknown vector is assumed to have a prior density of the form and the observations are described by a likelihood function of the form for some known transform . In statistics, this model is a special case of a generalized linear model (GLM) [1, 2] and arises in a range of applications including statistical regression, filtering, inverse problems, and nonlinear forms of compressed sensing. The posterior density of given in the GLM model is given by

(1)

where is a normalization constant. In the sequel, we will often omit the dependence on and simply write

(2)

so that the dependence on in the function and the normalization constant is implicit. In this work, we consider the inference problem of estimating the posterior marginal distributions, . From these posterior marginals, one can compute the posterior means and variances

(3a)
(3b)

We study this inference problem in the case where the functions and are separable, in that they are of the form

(4a)
(4b)

for some scalar functions and . The separability assumption (4a) corresponds to the components in being a priori independent. Recalling the implicit dependence of on , the separability assumption (4b) corresponds to the observations being conditionally independent given the transform outputs .

Unknown input, independent components

Linear transform

Componentwise output map

Fig. 1: Generalized Linear Model (GLM) where an unknown random vector is observed via a linear transform followed by componentwise likelihood to yield a measurement vector .

For posterior densities of the form (2), there are several computationally efficient methods to find the maximum a posteriori (MAP) estimate, which is given by

(5)

Under the separability assumptions (4), the MAP minimization (5) admits a factorizable dual decomposition that can be exploited by a variety of approaches, including variants of the iterative shrinkage and thresholding algorithm (ISTA) [3, 4, 5, 6, 7, 8] and the alternating direction method of multipliers (ADMM) [9, 10, 11, 12].

In contrast, the inference problem of estimating the posterior marginals and the corresponding minimum mean squared error (MMSE) estimates (3a) is often more difficult—even in the case when and are convex. As a simple example, consider the case where and each constrains to belong to some interval, so that constrains to belong to some polytope. The MAP estimate (5) is then given by any point in the polytope. Such a point can can be computed via a linear program. However, the MMSE estimate (3a) is the centroid of the polytope which is, in general, #P-hard to compute [13].

GLM inference methods often use a penalized quasi-likelihood method [14] or some form of Gibbs sampling [15, 16]. In recent years, Bayesian forms of approximate message passing (AMP) have been considered as a potential alternate class of methods for approximate inference in GLMs [17, 18, 19, 20, 21, 22]. AMP methods are based on Gaussian and quadratic approximations to loopy belief propagation (loopy BP) in graphical models and are both computationally simple and applicable to arbitrary separable penalty functions and . In addition, for certain large i.i.d. transforms , they have the benefit that the behavior of the algorithm can be exactly predicted by a state evolution analysis, which then provides testable conditions for Bayes optimality [23, 22, 24].

Unfortunately, for general , AMP methods may diverge [25, 26]—a situation that is not surprising given that AMP is based on loopy BP, which also may diverge. Several recent modifications have been proposed to improve the stability of AMP, including damping [25], sequential updating [27], and adaptive damping [28]. However, while these methods appear to perform well empirically, little has been proven rigorously about their convergence.

The main goal of this paper is to provide a provably convergent approach to AMP. We focus on the generalized AMP (GAMP) method of [22], which allows arbitrary separable functions for both and . Our approach to stabilizing GAMP is based on reconsidering the inference problem as a type of free-energy minimization. Specifically, it is known that GAMP can be considered as an iterative procedure for minimizing a large-system-limit approximation of the so-called Bethe Free Energy (BFE) [29, 30], which we abbreviate as “LSL-BFE” in the sequel. The BFE also plays a central role in loopy BP [31], and we review both the BFE and LSL-BFE in Section III.

In contrast to GAMP, which implicitly minimizes the LSL-BFE through an approximation of the sum-product algorithm, our proposed approach explicitly minimizes the LSL-BFE. We propose a double-loop algorithm, similar to the well-known Convex Concave Procedure (CCCP) [32]. Specifically, the outer loop of our method successively approximates the LSL-BFE by partially linearizing the LSL-BFE around the current belief estimate, while the inner loop minimizes the linearized LSL-BFE using ADMM [9]. Similar applications of ADMM have also been proposed for related free-energy minimizations [33, 34]. Interestingly, our proposed double-loop algorithm, which we dub ADMM-GAMP, is similar in structure to the original GAMP method of [22], but with an additional least squares optimization. We discuss these differences in detail in Section VIII.

Our main theoretical result shows that, for strictly convex penalties, the proposed ADMM-GAMP algorithm is guaranteed to converge to at least a local minimum of the LSL-BFE. In this way, we obtain a variant of the GAMP method with a provable convergence guarantee for arbitrary transforms . In addition, using hardening arguments similar to [35, 36], we show that our ADMM-GAMP can also be applied to the MAP estimation problem, in which case we can obtain global convergence for strictly convex, smooth penalties. Also, while our theory requires convex penalties, we present simulations that show robust behavior even in non-convex cases.

Ii The GLM and Examples

Before describing our optimization approach, it is useful to briefly provide some examples of the model (1) to illustrate the generality of the framework. As a first simple example, consider a simple linear model

(6)

where is a known matrix, is an unknown vector and is a noise vector. In statistics, would be the data matrix with predictors, would be the vector of regression coefficients, the vector of target or response variables and would represent the model errors. To place this model in the framework of this paper, we must impose a prior on and model the noise as a random vector independent of and . Under these assumptions, the posterior density of given will be of the form (1) if we define

(7)

The separability assumption (4) will be valid if the components of are are independent so the prior and noise density factorizes as

If the output noise is Gaussian with independent components , the output factor in (7) has a quadratic cost,

Similarly, if has a Gaussian prior with , the input factor will be given by

Note that the estimation in this quadratic case would be given by standard least squares estimation.

However, much more general models are possible. For example, for Bayesian forms of compressed sensing problems [37], one can impose a sparse prior such as a Bernoulli-Gaussian or a heavy-tailed density.

Also, for the output, any likelihood that factorizes as can be incorporated. This model would occur, for example, under any output nonlinearities as considered in [38],

where is a known, nonlinear function and is noise. The model can also include logistic regression [39] where is a binary class variable and

for some sigmoidal function . One-bit and quantized compressed sensing [40] as well as Poisson output models [41] can also be easily modeled.

Iii Bethe Free Energy Minimization

We next provide a brief review of the Bethe Free Energy (BFE) minimization approach to estimation of marginal densities in GLMs. A more complete treatment of this topic, along with related ideas in variational inference, can be found in [31, 42].

For a generic density , exact computation of the marginal densities is difficult, because it involves a potentially high-dimensional integration. BFE minimization provides an approximate approach to marginal density computation for the case when the joint density admits a factorizable structure of the form

(8)

where, for each , is a sub-vector of created from indices in the subset and is a potential function on that sub-vector. In this case, BFE minimization aims to compute the vectors of densities

where represents an estimate of the marginal density and where represents an estimate of the joint density on the sub-vector . These density estimates, often called “beliefs,” are computed using an optimization of the form

(9)

where is the BFE given by

(10)

where is the KL divergence,

(11)

where is the entropy or differential entropy; and where (for each ) is the number of factors such that . The BFE minimization (9) is performed over the set of all whose components satisfy a particular “matching” condition: for each , the marginal density of within the belief must agree with the belief . That is, the set contains all such that

(12)

where the integration is over the components in the sub-vector holding constant. Note that imposes a set of linear constraints on the belief vectors and .

The BFE minimization exactly recovers the true marginals in certain cases (e.g., when the factor graph has no cycles) and provides good estimates in many other scenarios as well; see [42] for a complete discussion. In addition, due to its separable structure, the BFE can be typically minimized “locally,” by solving a set of minimizations over the densities and . When the cardinalities of the subsets are small, these local minimizations may involve much less computation than directly calculating the marginals of the full joint density . In fact, the classic result of [31] is that loopy belief propagation can be interpreted precisely as one type of iterative local minimization of the BFE.

For the GLM in Section I, the separability assumption (4) allows us to write the density (2) in the factorized form (8) using the potentials

(13a)
(13b)

where is the -th row of . Note that, if is a non-sparse matrix, then depends on all components in the vector . In this case, the application of traditional loopy BP—as described for example in [43]—does not generally yield a significant computational improvement.

The GAMP algorithm from [22] can be seen as an approximate BFE minimization method for GLMs with possibly dense transforms . Specifically, it was shown in [29] that the stationary points of GAMP coincide with the local minima of the constrained optimization

(14b)

where and are product densities, i.e.,

(15)

and the objective function is given by

(16)
(17)
(18)
(19)
(20)

Above, and in the sequel, we use to denote the expectation of under , and we use to denote the vector whose th entry is the variance of under . Note that is not a full covariance matrix. Also, is the scale factor that makes a valid density over . Although it is not essential for this paper, we note that is an upper bound on the differential entropy of that is tight when has independent Gaussian entries with variances . It was then shown in [30] that the objective function in (16) can be interpreted as an approximation of the BFE for the GLM from Section I in a certain large-system limit, where and has i.i.d. entries. We thus call the approximate BFE in (16) the large-system limit Bethe Free Energy or LSL-BFE.

Similar to the case of loopy BP, it has been shown in [29, 30] that the stationary points of (14) are precisely the fixed points of sum-product GAMP. Thus, GAMP can be interpreted as an iterative procedure to find local minima of the LSL-BFE, much in the same way that loopy BP can be interpreted as an iterative way to find local minima of the BFE. The trouble with GAMP, however, is that it does not always converge (see, e.g., the negative results in [25, 26, 28]). The situation is similar to the case of loopy BP. Although several modifications of GAMP have been proposed with the goal of improving convergence, such as damping [25], sequential updating [27], and adaptive damping [28], a globally convergent GAMP modification remains elusive.

Iv Minimization via Iterative Linearization

Our approach to finding a convergent algorithm for minimizing the constrained LSL-BFE employs a generalization of the convex-concave procedure (CCCP) of [32] that we will refer to as Minimization via Iterative Linearization.

Iv-a The Convex-Concave Procedure

We first briefly review the CCCP. Observe that, in the BFE (10), the terms are convex in and the terms are concave in . Thus, the BFE (10) can be written as a sum of terms

where is convex and is concave. The CCCP finds a sequence of estimates of a BFE minimizer by iteratively linearizing the concave part of this function, i.e.,

(21a)
(21b)

where denotes the gradient of at . The resulting procedure is often called a “double-loop” algorithm, since each iteration involves a minimization (21a) that is itself usually performed by an iterative procedure. Because is convex and the constraint is linear, the minimization problem (21a) is convex. Thus, the CCCP converts the non-convex BFE minimization to a sequence of convex minimizations. In fact, it can be shown that the CCCP will monotonically decrease the BFE for arbitrary convex and concave [32].

Iv-B Minimization via Iterative Linearization

For the LSL-BFE, it is not convenient to decompose the objective function into a convex term plus a concave term. To handle problems like LSL-BFE minimization, we consider optimization problems of the form

(22)

where now is a vector in a Hilbert space , is a closed affine subspace of , is a convex functional, is a mapping from to for some , and is an arbitrary functional. Below, we use to denote the input to . Note that the functionals and may be neither concave nor convex.

To solve (22), we propose the iterative procedure shown in Algorithm 1, which is reminiscent of the CCCP. At each iteration , an estimate of is computed by minimizing the functional

(23)

where is a “damped” version of the gradient . In particular, when the damping parameter is set to unity, the linearization vector is exactly equal to the gradient at , i.e., , similar to CCCP. However, in Algorithm 1, we have the option of setting , which has the effect of slowing the update on . We will see that, by setting , we can guarantee convergence when and/or is non-concave.

0:  Optimization problem (22).
1:  
2:   Select initial linearization
3:  repeat
4:      {Minimize the linearized function}
5:     
6:     
7:      {Update the linearization}
8:     Select a damping parameter
9:     
10:     
11:  until Terminated
Algorithm 1 Minimization via Iterative Linearization

Iv-C Convergence of Minimization via Iterative Linearization

Observe that when is convex, is concave, (as when are discrete variables), is the identity map (i.e., ), and there is no damping (i.e., ), Algorithm 1 reduces to the CCCP. However, we are interested in possibly non-concave , in which case we cannot directly apply the results of [32]. We instead consider the following alternate conditions.

Assumption 1

Consider the optimization problem (22), and suppose that the functions , , and have components that are twice differentiable with uniformly bounded second derivatives. Also, assume that there exists a convex set such that, for all :

  1. The minimization of the linearized function,

    (24)

    exists and is unique.

  2. At each minimum, the linearized objective is uniformly strictly convex in the linear space in that there exists constants with such that

    (25)

    where is the Hessian of with respect to at , i.e.,

    (26)

    and where the constants and do not depend on .

  3. The gradient obeys when .

Theorem 1

Consider Algorithm 1 under Assumption 1. There exists a such that if the damping parameters are selected with for all , and if the initialization obeys , then for all and the objective monotonically decreases, i.e.,

(27)
Proof:

See Appendix A.

The most simple case where Assumption 1 holds is the setting where is strictly convex and smooth, is linear and is smooth (but neither necessarily convex nor concave). Under these assumptions, would be strictly convex for all , thereby satisfying Assumptions (a) and (b). The assumption would also be valid for strictly convex and convex , provided we restrict to positive . In this case, to satisfy assumption (c), we would require that , i.e.  is increasing in each of its component. Interestingly, in the setting we will use below, will be convex, but will be concave. Nevertheless, we will show that the assumption will be satisfied.

Iv-D Application to LSL-BFE Minimization

To apply Algorithm 1 to the LSL-BFE minimization (14), we first take to be the vector of separable density pairs satisfying the moment matching constraint

(28)

Then, if we define the functions

(29a)
(29b)
(29c)

we see that from (16) can be cast into the form in (22). Observe that, while is convex, the function is, in general, neither convex nor concave. Thus, while the CCCP does not apply, we can apply the iterative linearization method from Algorithm 1.

We will partition the linearization vector conformally with function in (29b) as

(30)

where we use “” to denote componentwise division of two vectors and “” to denote vertical concatenation. The notation in (30) will help to clarify the connections to the original GAMP algorithm. Using the above notation, the linearized objective (23) can be written as

(31)

Finally, we compute the gradient of the function from (29c). Similar to , we will partition the gradient into two terms,

(32)

From (17), the derivative of with respect to is

(33)

Similarly, using the chain rule and (33), we find

(34)

We can then write (33) and (34) in vector form as

(35)

Substituting the above computations into the iterative linearization algorithm, Algorithm 1, we obtain Algorithm 2. We refer to this as the outer loop, since each iteration involves a minimization of the linearized LSL-BFE in line 5. We discuss this latter minimization next and show that it can itself be performed iteratively using a set of iterations that we will refer to as the inner loop.

We will also show shortly that, under certain convexity conditions, the conditions of Assumption 1 are satisfied, so that Algorithm 2 will converge to a local minimum of the LSL-BFE.

0:  LSL-BFE objective function (16) with a matrix .
1:  
2:   Select initial linearization , .
3:  repeat
4:      {Minimize the linearized LSL-BFE}
5:     
6:     
7:      {Compute the gradient terms}
8:      ,
9:     
10:     
11:     
12:     
13:      {Update the linearization}
14:     Select a damping parameter
15:     
16:     
17:  until Terminated
Algorithm 2 Minimizing LSL-BFE via Iterative Linearization

Iv-E Alternative Methods

While the method proposed in this paper is based on CCCP of [32], there are other methods for direct minimization of the BFE that may apply to the LSL-BFE as well. For example, for problems with binary variables and pairwise penalty functions, [44, 45] propose a clever re-parametrization to convert the constrained BFE minimization to an unconstrained optimization on which gradient descent can be used. Unfortunately, it is not obvious if the LSL-BFE here can admit such a re-parametrization since the penalty functions are not pairwise and the variables are not binary.

V Inner-Loop Minimization and ADMM-GAMP

V-a ADMM Principle

The outer loop algorithm, Algorithm 2, requires that in each iteration we solve a constrained optimization of the form

(36)

We will show that this optimization can be performed by the Alternating Direction Method of Multipliers (ADMM) [9]. ADMM is a general approach to constrained optimizations of the form

(37)

where is an objective function and is some constraint matrix. Corresponding to this optimization, let us define the augmented Lagrangian

(38)

where is a dual vector, is a vector of positive weights and . The ADMM procedure then produces a sequence of estimates for the optimization (37) through the iterations

(39a)
(39b)

where creates a diagonal matrix from the vector . The algorithm thus alternately updates the primal variables and dual variables . The vector can be interpreted as a step-size on the primal problem and an inverse step-size on the dual problem.

The key benefit of the ADMM method is that, for any positive step-size vector , the procedure is guaranteed to converge to a global optimum for convex functions under mild conditions on .

V-B Application of ADMM to LSL-BFE Optimization

The ADMM procedure can be applied to the linearized LSL-BFE optimization (36) as follows. First, we replace the constraint with two constraints: and . Variable splittings of this form are commonly used in the context of monotropic programming [46]. With this splitting, the augmented Lagrangian for the LSL-BFE (14) becomes

(40)

where and represent the dual variables. Note that the vectors and that appear in the linearized LSL-BFE have been used for the augmentation terms (i.e., the last two terms) in (40). This choice will be critical. From (39), the resulting ADMM recursion becomes

(41a)
(41b)
(41c)
(41d)

To compute the minimization in (41a), we first note that the second and fourth terms in (40) can be rewritten as

(42)

where in (a) we used ; in (b) we used “” to denote componentwise multiplication between vectors; and “const” includes terms that are constant with respect to and . A similar development yields