Bayesian Lasso Posterior Sampling via Parallelized Measure Transport

Bayesian Lasso Posterior Sampling via Parallelized Measure Transport

Abstract

It is well known that the Lasso can be interpreted as a Bayesian posterior mode estimate with a Laplacian prior. Obtaining samples from the full posterior distribution, the Bayesian Lasso, confers major advantages in performance as compared to having only the Lasso point estimate. Traditionally, the Bayesian Lasso is implemented via Gibbs sampling methods which suffer from lack of scalability, unknown convergence rates, and generation of samples that are necessarily correlated. We provide a measure transport approach to generate i.i.d samples from the posterior by constructing a transport map that transforms a sample from the Laplacian prior into a sample from the posterior. We show how the construction of this transport map can be parallelized into modules that iteratively solve Lasso problems and perform closed-form linear algebra updates. With this posterior sampling method, we perform maximum likelihood estimation of the Lasso regularization parameter via the EM algorithm. We provide comparisons to traditional Gibbs samplers using the diabetes dataset of Efron et al. Lastly, we give an example implementation on a computing system that leverages parallelization, a graphics processing unit, whose execution time has much less dependence on dimension as compared to a standard implementation.

\kwd
\csdef

input@pathsty/img/\csgdefbibdir\startlocaldefs \endlocaldefs

\runtitle

Bayesian Lasso via Parallelized Measure Transport

{aug}

, and

Bayesian Lasso \kwdOptimal transport theory \kwdconvex optimization \kwddistributed optimization \kwdMonte Carlo sampling

1 Introduction

A quintessential formulation for sparse approximation is Tibshirani’s Lasso, which simultaneously induces shrinkage and sparsity in the estimation of regression coefficients [32]. The formulation of the standard Lasso is as follows:

 x∗=argminx∈Rd||y−Φx||22+λ||x||1 (1.1)

where is a vector of responses, is a matrix of standardized regressors, and is the vector of regressor coefficients to be estimated.

It is known that the Lasso can be interpreted as a Bayesian posterior mode estimate with a Laplacian prior [32]. Imposing a Laplacian prior is equivalent to -regularization, which has desirable properties, including robustness and logarithmic sample complexity [24]. Various algorithms for solving (1.1) are typically employed, including Least Angle Regression (LARS), iterative soft-thresholding and its successors, and iteratively reweighted least squares(IRLS) [6], [1],[9], [10]. These methods are scalable, yet they only provide a point (maximum a posteriori) estimate. With i.i.d. samples from the posterior distribution, the Bayes optimal decision, can be approximately computed for any set of possible decisions and any loss function by minimizing the empirical conditional expectation:

 d∗(y)=argmind∈DE[l(X,d)|Y=y]≃argmind∈D1NK∑i=1l(Zi,d) (1.2)

Previous approaches have been developed [25, 13] to sample from the posterior distribution corresponding to the Lasso problem based on Markov Chain Monte Carlo methods (MCMC). However these methods necessarily introduce correlations between the generated samples, are sequential in nature, and do not often scale well with dataset size or model complexity [15],[23], [19].

We here consider a measure transport approach, where i.i.d. samples from the posterior distribution associated with the Lasso are produced by first generating i.i.d. samples from the Laplacian prior and constructing a map that transforms a sample from into a sample from . From here, the posterior samples are constructed as . We exploit previous results that cast Bayesian posterior sampling from a measure transport perspective [7], [21], that turn the construction of into a relative entropy minimization problem [16], and that exploit the Bayesian Lasso posteriorâs log-concavity to turn construction of with polynomial chaos into a convex optimization problem [18] that is amenable to parallelization [22]. In this Lasso setting, we further show that constructing the optimal map to transform prior samples to posterior samples can be performed with off-the-shelf Lasso solvers and closed-form linear algebra updates.

1.1 Relevant Work

Park and Casella proposed a Gibbs sampler for a variation of the original Bayesian Lasso problem, where the latent variance scale variable in the Gauss-scale mixture representation of the Laplacian distribution has a prior distribution [25]. This structure leads to a tractable three-step Gibbs sampler that can be used to draw approximate samples from the posterior and construct credibility intervals. Hans [13] obviated the need for hyper-parameters and used a direct characterization of the posterior distribution to develop a Gibbs sampler to generate posterior samples. As an MCMC algorithm, the Gibbs sampler generates a Markov chain of samples, each of which is correlated with its previous sample. The correlation between these samples can decay slowly and lead to burn-in periods where samples have to be discarded [28]. Although theoretical upper bounds on the convergence of Gibbs samplers have been proved [27], these guarantees are weaker in the case of Bayesian Lasso. [26] developed a two-step Gibbs sampler for the Bayesian Lasso with improved convergence behavior. However, a way to derive i.i.d. samples from the Bayesian Lasso posterior without burn-in periods has remained elusive.

Moshely et al. first proposed an alternative method for directly sampling from the posterior distribution based on a measure transport approach [7], [21] using a polynomial chaos expansion [11]. Bayesian inference can be cast as a special case of this, where the original distribution is the prior (which in many cases is easy to sample from) and the target distribution is the posterior. Recently, Kim et al. further investigated the Bayesian transport sampling problem and showed that when the prior and likelihood satisfy a log-concavity property, the relative entropy minimization approach to find a transport map with a polynomial chaos representation is a convex optimization problem [16]. Mesa et al. introduced an Alternating Direction Method of Multipliers (ADMM) reformulation and showed that this minimization can be performed by iteratively solving a series of convex optimization problems in parallel [22]. Wang et al. used a measure transport approach to extend the randomize-then-optimize MCMC approach to sample from posteriors with priors by transforming the prior distribution to a Gaussian distribution [34].

1.2 Our Contribution

We present a technique to sample from the Bayesian Lasso posterior based on a measure transport approach [7], [16]. The formulation is conceptually different from the Gibbs sampler methodology; our initial objective is not to compute samples from the posterior, but rather compute a transport map based upon training samples. Once the transport map is computed, one can generate an arbitrary number of posterior samples by first computing samples from the prior and passing them through the transport map. We show that construction of the transport map can be performed in a parallelized fashion based on an Alternating Direction Method of Multipliers (ADMM) formulation, as in [22]. Unique to the Lasso formulation, our solution only requires off-the-shelf Lasso solvers and linear algebra updates. This provides opportunities to leverage computing architectures with parallelization and attain significant completion time improvements.

We exploit the ability to draw i.i.d. samples from the posterior to develop an Expectation Maximization (EM) algorithm for maximum likelihood estimation of in (1.1). Additionally, we compare our results to a Bayesian Lasso Gibbs sampler from [25] and show that we achieve similar results when analyzing the diabetes dataset presented in [6]. Finally, we show that our framework is amenable to implementation in architectures that leverage parallelization and provide performance of an example implementation with a graphics pocessing unit (GPU).

The rest of the paper is organized as follows. In Section 2, we provide some preliminaries and definitions. In Section 3, we introduce a relative entropy minimization formulation for Bayesian Lasso posterior sampling via measure transport. We consider transport maps described in terms of polynomial chaos and show how under a log-concavity assumption (which applies for the Lasso problem), the relative entropy minimization can be performed with ADMM methods from convex optimization. We then show how, unique to the Bayesian Lasso, this formulation can be reduced to iteratively solving a collection of Lasso problems in parallel and performing closed-form linear algebra updates. In Section 4, we exploit the ability to draw i.i.d. samples from the posterior to develop an Expectation Maximization (EM) algorithm for maximum likelihood estimation of in (1.1). In Section 5, we compare our results to a Bayesian Lasso Gibbs sampler from [25] and achieve similar results when analyzing the diabetes dataset from Efron et al in [6]. In Section 6, we show that our framework is amenable to implementation in architectures that leverage parallelization and provide performance of an example implementation with a graphics pocessing unit (GPU). In Section 7, we conclude and discuss future potential directions.

2 Definitions

2.1 Bayesian Lasso Statistical Model

We consider the following generative model of how a latent and sparse random vector relates to a measurement :

 Y=ΦX+ϵ (2.1)

and the measurement noise satisfies .

We assume an i.i.d. Laplacian statistical model on with parameter . Therefore, the following Bayesian Lasso regression model is specified as:

 p(y|x;σ2) = N(y;Φx,σ2In) (2.2) p(x;τ) = d∏i=1τ2e−τ|xi| (2.3)

where represents the density function, evaluated at , of a multivariate normal random variable with expectation and covariance matrix . We note that the negative log posterior density satisfies

 −logp(x|y;σ2,τ)∝12σ2∥y−Φx∥22+τ||x||1 (2.4)

As such, the standard Lasso problem for a given

 x∗=argminx∈Rd||y−Φx||22+λ||x||1 (2.5)

is a maximum a posteriori estimation problem for the Laplacian prior in (2.3).

In the model above and throughout using our methodology, we assume that the parameter is fixed and known, deviating from the results from the Bayesian Lasso Gibbs sampler first presented in [25], for which is imparted with a prior. As such, we are considering the posterior distribution associated with the original Bayesian interpretation to Lasso, for which the solution to (1.1) is the MAP estimate. We will discuss the assumptions associated with [25] in Section 5.

2.2 Transport Maps

Define as the space of probability measures over endowed with the Borel sigma-field.

Definition 2.1 (Push-forward).

Given and , we say that the map pushes forward to (denoted as ) if a random variable with distribution results in having distribution .

This gives insight into the notion of “transport maps” because they transform random variables from one distribution to another.

We say that is a diffeomorphism on if is invertible and both and are differentiable. We say that a diffeomorphism is monotonic on if its Jacobian satisfies the property that is positive definite for all , e.g.

 JS(x)≻0∀x∈Rd (2.6)

Define the set of all monotonic diffeomorphisms as .

With this, we have the following lemma from standard probability:

Lemma 2.2.

Consider any and , that both have the densities , with respect to the Lebesgue measure. Then if and only if

 p(x)=q(S(x))det(JS(x))∀x∈Rd (2.7)

where is the Jacobian matrix of the map at .

We note from a classical result in optimal transport theory [4, 33] that if and have densities and with respect to the Lebesgue measure, there will always exists a transport map that is a monotonic diffeomorphism (e.g. ) that pushes to .

Note that in the case of Bayesian Lasso, is the prior distribution on , which has a Laplacian density given by (2.3), and is the posterior distribution on , which has a density described up to a proportionality constant by (2.4).

3 Bayesian Lasso via Measure Transport

In this section, we provide background on measure transport theory and show that for the Bayesian LASSO, we can efficiently find a transport map that transforms samples from the prior distribution in (2.3) to samples from the posterior. We utilize the ADMM framework introduced in [22] and develop a Bayesian Lasso transport map by iteratively solving a collection of Lasso problems (which themselves can be solved with existing efficient sparse approximation algorithms) in parallel and performing linear algebra updates.

As an alternative to the Gibbs sampling approaches described previously, we consider finding a transport map that pushes the prior to the posterior . Throughout, we will assume and are the densities associated with the prior and posterior, respectively. Given such a map, we can sample from the posterior distribution by transforming i.i.d. samples with to i.i.d. samples from the posterior. Figure 3 shows the effect of a transport map on samples for the Bayesian Lasso in (2.3).

3.1 A Convex Optimization Formulation

Recent work [16, 18] has shown that for the problems where the prior density and likelihood are log-concave in , constructing a transport map that pushes to can be performed with convex optimization.

Consider fixing as the true posterior density. Given an arbitrary , then there will be an induced (with density ( for which pushes to . That is, from the Jacobian equation:

 ~pS(u)=q(S(u))det(JS(u)) for all u∈Rd (3.1)

For an arbitrary , need not be the same as . From this perspective, we re-define our problem as finding the transport map that minimizes a distance between and the induced . We use the Kullback-Leibler divergence and arrive at the following optimization problem:

 S∗=argminS∈MD+D(P∥~PS) (3.2)

By defining

 g(x)≜−logfY|X(y|x)−logfX(x) (3.3)

where the densities refer to the likelihood and prior, (3.2) becomes

 S∗=argminS∈MD+EP[g(S(X))−logdet(JS(X))] (3.4)

Moreover, when is log-concave (equivalently when is convex), this (infinite-dimensional) optimization problem is convex. Moreover, note that the proportionality constant in the denominator of the posterior density for Bayes rule is not required to perform this minimization, because it does not vary with .

Parametrization of the Transport Map: In order to solve (3.4), we parametrize the problem to arrive at a finite-dimensional convex optimization problem. We approximate any as a linear combination of basis functions through a Polynomial Chaos Expansion (PCE) [35], [8] where are the polynomials orthogonal with respect to the prior :

 S(x)=∑j∈Jbjϕ(j)(x) (3.5) ∫x∈Rdϕ(i)(x)ϕ(j)(x)p(x)dx=δi,j (3.6)

with being if and otherwise. Now define and we have that:

 F = [b1,…,bK],d×K (3.7) A(x) = [ϕ(1)(x),…,ϕ(K)(x)]T,K×1 (3.8) S(x) = FA(x),d×1 (3.9) J(x) = [∂ϕ(i)∂xj(x)]i,j,K×d (3.10) JS(x) = FJ(x)d×d. (3.11)

We can then approximate the expectation from (3.4) using an empirical expectation based upon i.i.d. samples from the prior . Within the context of the Bayesian Lasso, is the Laplacian (doubly exponential) distribution which is easy to sample from. Letting and , we arrive at the following finite-dimensional problem:

 F∗=argmaxF:FJi≻01NN∑i=1g(FAi)−logdet(FJi) (3.12)

Whenever is log-concave (equivalently is convex), this is a finite-dimensional convex optimization problem. Moreover, as , from the PCE theory, the map converges (in the relative entropy sense) to a transport map that pushes to (see [17]).

3.2 Parallelized Convex Solver with ADMM

More recently, [22] demonstrated a scalable framework to solve (3.12) which only requires iterative linear algebra updates and solving, in parallel, a number of quadratically regularized point estimation problems. The distributed architecture involves an augmented Lagrangian and a concensus Alternating Direction Method of Multipliers (ADMM) formulation:

 minF,Z,p,B 1NN∑i=1g(pi)−logdetZi+ 12ρ∥Fi−B∥22 +1NN∑i=112ρ∥BAi−pi∥22+ 12ρ∥BJi−Zi∥22 s.t. BAi=pi: γi(d×1) BJi=Zi: βi(d×d) Fi−B=0: αi(d×K) Zi≻0

for any fixed .

A penalized Lagrangian is solved iteratively by first solving for

 Bk+1= 1NN∑i=1[ρ(Fki+pkiATi+ZkiJTi) +γkiATi+βkiJTi+αki]M, (3.13) M≜ [ρ(I+1NN∑i=1AiATi+JiJTi)]−1 (3.14)

and then solving, in parallel for , the other variable updates:

 Fk+1i= −1ραki+Bk+1 (3.15a) Zk+1i= Q~ZiQT (3.15b) pk+1i= argminpig(pi)+12ρ∥Bk+1Ai−pi∥22 +γkTi(pi−Bk+1Ai) (3.15c) γk+1i= γki+ρ(pk+1i−Bk+1Ai) (3.15d) βk+1i= βki+ρ(Zk+1i−Bk+1Ji) (3.15e) αk+1i= αki+ρ(Fk+1i−Bk+1) (3.15f)

ADMM guarantees convergence to the optimal solution [3]. To emphasize, each th update in (3.15) can be solved in parallel. As (3.15b) is an eigenvalue-eigenvector decomposition (details can be found in [22]), it follows that all the updates involve linear algebra with the exception of (3.15c), which is a quadratically regularized point estimation problem.

3.3 Efficiently Solving the Bayesian Lasso

We exploit the unique problem structure of Bayesian Lasso to simplify a scalable implementation.

Lemma 3.1.

The PCE for the Laplacian distribution is where are the Laguerre polynomials.

Proof.
 ∞∫−∞ϕiE(|x|)ϕjE(|x|)pL(x)dx =∞∫−∞ϕiE(|x|)ϕjE(|x|)12pE(|x|)dx =2∞∫0ϕiE(x)ϕjE(x)12pE(x)dx =δi,j (3.16)

Where the first equality holds because the Laplacian density is related to the exponential density by , the second equality holds by symmetry of the function being integrated, and the third follows because the PCE for the exponential distribution is obtained with the Laguerre polynomials [35]. ∎

We now show that for Bayesian Lasso, the only ADMM update that is not linear algebra is simply a Lasso problem.

Theorem 3.2.

For the Bayesian Lasso statistical model given by (2.4), the ADMM update (3.15c) is a d-dimensional Lasso point estimation problem:

 pk+1i = argminpi||^y−^ΦTpi||22+λ||pi||1 (3.17)

where and satisfy

 ^ΦT^Φ = ΦTΦ+12ρI (3.18) ^y = ([yTΦ+12ρ(Bk+1Ai)T−12γkTi]^Φ+)T

and represents the pseudo-inverse.

Proof.

Dropping (i) superscript indices of (3.15c), we have

where (3.19) follows from performing a Cholesky decomposition to build a unique and then zero padding to build , obeying the relationship given in (3.18). Then we complete the square in order to get an equation of the form :

 −2^yT^Φp=(γT−2yTΦ−ρ(BA)T)p.

Corollary 3.3.

For the Bayesian Lasso, the problem of finding a map to generate i.i.d. samples from solved by iteratively solving for linear algebra updates and solving, in parallel, a collection of -dimensional Lasso problems (1.1).

The procedure for Bayesian Lasso via measure transport is outlined in Algorithm 3.3.

{algorithm}

Parallelized Bayesian Lasso \SetKwInOutInputInput \SetKwInOutOutputOutput function BayesianLasso , , , , , )  \InputSamples from prior in (2.3) \Output holds coefficients of map such that Construct and via Polynomial Chaos Expansion for as in (3.8) and (3.10)  Construct as in (3.14)  Initialize and randomly for \While has not converged Update as in (3.13)   Update in parallel for as in (3.15)
and with a Lasso solver as in (3.17)

4 Choosing λ via Maximum Likelihood Estimation

The parameter of the standard Lasso in (1.1), , can be chosen by cross-validation, generalized cross-validation, and ideas based on unbiased risk minimization [32]. Park and Casella used empirical Bayes Gibbs sampling [5] to find marginal maximum likelihood estimates of hyperparameters via EM algorithm [25]. This empirical scheme, however, is specific to the Gibbs sampler and the hierarchical model introduced in [5]. Here, we propose an EM algorithm to calculate a maximum likelihood estimate of according to the statistical model where and (and thus ) are assumed fixed and non-random. Without loss of generality, for the remainder of this section, we assume here that and so .

As such, our statistical model is of the form where is a Laplacian density with parameter and .

In the EM framework, it is our objective to find

 ^λ = argmaxλp(y;λ) (4.1) = argmaxλ∫p(x,y;λ)dx (4.2)

This can be performed by iteratively solving in the E-step for the posterior distribution with : . Since we can generate i.i.d. posterior samples with our transport map , define as i.i.d. samples from . The M-step is found by solving for :

 λ(k+1) = argmaxλEgk[logp(X,y;λ)|Y=y] (4.3) = argmaxλdlogλ−Egk[∥X∥1|Y=y] (4.4) = dEgk[∥X∥1|Y=y] (4.5) ≃ d1N∑Ni=1∥Z(k)i∥1 (4.6)

where (4.4) follows from (2.2) and (2.3) with and , (4.5) follows from closed form minimization, and (4.6) follows from approximating the conditional expectation with an empirical expectation using i.i.d. from .

Altogether, this becomes

1. Choose an initial and set

2. (E-Step): Perform Algorithm 1 with to find and generate samples from the posterior distribution as for where are i.i.d. samples from in (2.3).

3. (M-Step): Update according to (4.6)

4. If convergence has occurred, stop; else let and return to step 2.

5 Comparisons to Gibbs Sampling

We now present comparisons of our measure transport methodology with a Gibbs sampler for Bayesian Lasso based on the diabetes data of Efron et al [6], which has . We will follow the analysis presented in [25] and compare our results with the respective Gibbs sampler. We note that our Bayesian Lasso model differs from that in [25] in that we do not place a prior on . We first introduce and clarify models of Bayesian Lasso. Then we compare regression estimates obtained using Gibbs sampling with those obtained using our methodology.

5.1 Models of Bayesian Lasso

Park and Casella [25] extend the Bayesian Lasso regression model to account for uncertainty in the hyperparameters by not only placing prior on in accordance with the interpretation of the Laplacian as a Gauss Scale mixture, but in addition placing a prior on . Haans et al. [14] introduced computational approaches to handle model uncertainty under the Bayesian Lasso regression model and provided an implementation to run Gibbs sampling for different regression models. Here we give a summary of these varying Bayesian Lasso regression models.

Succinct Fixed Parameters

A “succinct” fixed parameter model operates with (proportional to ) and fixed and non-random. We denote the probability space induced by this model as . The generative model is expressed by (2.2) and (2.3). Therefore, for any , the posterior distribution can be expressed in density form as

 p(x|y;σ2,τ)∝p(y|x;σ2)p(x;τ) (5.1) ∝exp(−12σ2∥y−Φx∥22−τ∥x∥1) (5.2)

In our Measure Transport approach, we operate on a fixed parameter model. We also refer to this model as the Bayesian-frequentist approach.

Fixed Parameter Gauss Scale Mixtures Model

In order to develop Gibbs samplers, it is natural to interpret the Laplacian distribution as a scale mixture of Gaussians.

We denote the probability space induced when and are still non-random and the prior (2.3) is represented as a Gauss-scale mixture with latent random variables as . For any , the posterior distribution on given and latent variables

 p(x|y,t21,...,t2d;σ2,τ) (5.3)

is proportional to the product of the following probabilities

 p(y|x;σ2,Φ) = Nn(Φx,σ2In) (5.4) p(x|t21,...,t2d) = Nd(0,Dt)Dt=diag(t21,...,t2d) (5.5) p(t21,...,t2d;τ) = d∏j=1τ22e−τ2t2j/2dt2j (5.6)

Note that for any for which , for instance , it follows that . In other words, is the restriction of to : . Thus this is also a Bayesian-frequentist approach; its primary purpose is for using latent variables for Gibbs sampling.

Prior on σ2

Another Bayesian perspective formulated by Park and Casella [25] treats as a random variable with a prior density , inducing a probability space . The prior on given is the standard Laplacian

 p(x|σ2;τ)=(τ2σ)dexp(−τσ∥x∥1) (5.7)

that is, the penalty parameter is now scaled by the squared root of the error variance. This gives rise to the following hierarchical representation of the full model:

 p(y|x,σ2;τ) = Nn(Φx,σ2In) (5.8) p(x|t21,...,t2d,σ2) = Nd(0,σ2Dt)Dt=diag(t21,...,t2d) (5.9) p(t21,...,t2d;τ) = d∏j=1τ22e−τ2t2j/2dt2j (5.10) p(σ2) = π(σ2) (5.11)

For any the posterior distribution on given y takes the form:

 p(x,σ2|y;τ)∝ (5.12) π(σ2)(σ2)−(n−1)/2exp−12σ2∥y−Φx∥22−τ∥x∥1 (5.13)

5.2 Analysis on Diabetes Data

We analyze a diabetes data set [6] and compare results when using the Gibbs sampler presented in [25] which utilizes a prior on (operating on ) and when using samples from our transport based methodology (operating on ). We show that despite our treatment of as fixed, we achieve similar results as we capture the complexity of the posterior distribution in this real dataset.

Figure 7 compares our measure transport Bayesian Lasso posterior median estimates 7 with the ordinary Lasso 7 and the Gibbs sampler posterior median estimates 7. We take the vector of posterior medians as the one that minimizes the norm loss averaged over the posterior. For all three methods, the estimates were computed by sweeping over a grid of values for . We implemented our measure transport Bayesian Lasso with a Polynomial Chaos Expansion order of 3, and trained with prior samples to compute a transport map. The specifications for the Gibbs sampler were to use a scale-invariant prior on and to run for 10,000 iterations after 1000 iterations of burn-in.

Figure 7 shows the resulting optimal (depicted with a vertical line) found by the EM algorithm presented in Section 4. The vertical line in Figure 7 is the optimal found by [25] (with prior on ) by running a Monte Carlo EM algorithm corresponding to the particular Gibbs implementation. The vertical line in the Lasso graph 7 represents the estimate chosen by n-fold cross validation.

Despite treating as fixed, the paths are very similar to the Bayesian Lasso imparted with a prior on . As already noted in previous work, the Bayesian Lasso paths are smoother than the Lasso estimates.

We further compare the 95% credible intervals for the diabetes data obtained with a fixed (the optimal corresponding to the Gibbs sampler) for the marginal posterior distributions of the Bayesian Lasso estimates. Figure 8 shows the corresponding result for the Lasso, Gibbs sampler, and our proposed methodology. The Gibbs sampler credible intervals are wider in comparison due to the treatment of as random.

Figure 11 shows Kernel Density Estimates of two of the regression variables obtained by Gibbs sampling and sampling with a transport map. The density estimates obtained through a Gibbs sampler operating on are similar in shape and support to those from a transport map, verifying the convergence to the posterior distribution in both methods. The small differences seen in the estimates might be due to the polynomial nature of the transport map. We also show kernel density estimates obtained with a Gibbs sampler operating on for comparison.

6 Parallelized Implementation and Applications

The fundamentally parallel nature of our Bayesian Lasso formulation allows for solution implementation on a variety of platforms. The fact that our solution relies solely on linear algebra and Lasso solvers allows for it to be deployed in a variety of architectures for parallel computing. In order to leverage the parallel nature of the algorithm presented above, we here present implementation with a Iterative-Reweighted Least Squares (IRLS) Lasso solver implemented in a Graphics Processing Unit (GPU) solution.

6.1 IRLS solver within a GPU Implementation

In the last several years, GPUs have gained significant attention for their parallel programmability. In this work, we made use of the ArrayFire library that abstracts low-level GPU programming and provides highly parallelized and optimized linear algebra algorithms [20].

We implemented Algorithm 3.3 using ArrayFire. To solve Lasso problems of (2.5) we implemented a generalized iterative re-weighted least-squares (GIRLS) [2] algorithm. The GIRLS algorithm requires solving only least-squares sub-problems with linear algebra operations thus facilitating its implementation in ArrayFire.

Figure 12 shows execution times of computing a transport map with and PCE order of 3 running a Python implementation on an Intel Core i7 processor at 2.40 GHz(4 CPUs) and running with the ArrayFire implementation on an NVIDIA GeForce 840M GPU. As the complexity of the problem increases (determined by increasing ), the ArrayFire implementation readily outperforms the Python implementation. This showcases the future possibilities for rapid computation of transport maps on architectures that feature parallelization capabilities.

7 Discussion and Conclusion

We have shown that an i.i.d. posterior Bayesian Lasso sampler can be constructed with a measure transport framework with iteratively solving in a collection of standard LASSO problems in parallel and performing closed-form linear algebra updates. This formulation allows one to leverage the diversity of Lasso solvers to sample from posterior. For example, we show how posterior Bayesian Lasso transport samplers can be constructed with a GPU. We also note that this algorithm could be readily implemented in other systems for parallelization such as cloud computing.

Another potential application for inference with this transport-based approach is within the context of the Internet-of-Things (e.g. wearable electronics). In these settings, energy efficiency is of paramount importance, and wireless transmission usually is the most energy-consuming. Developing a framework such as ours where inference is performed on chip, obviates the need to transmit collected waveforms. Instead, one only needs to transmit information about the posterior distribution, which is a sufficient statistic for any Bayesian decision making problem. In our case, this boils down to transmission of coefficients of the polynomials representing the transport map. From there, in the cloud for instance, i.i.d. prior samples may be transformed into i.i.d. posterior samples. Figure 13 shows a potential use of our posterior transmission scheme and a comparison to current transmission schemes.

Another energy-efficient application could be achieved by implementing our Bayesian Lasso algorithm in analog systems. The Local Competitive Algorithm (LCA) first presented in [29] is an analog dynamical system inspired by neural image processing and exactly solves (1.1). This system has already been implemented in field-programmable analog arrays [31] and integrate-and-fire neurons [12], thus showing promising results for reduced energy in hardware implementations.

In the LCA, a set of parallel nodes, each associated with an element of the basis , compete with each other for representation of the input. The dynamics of LCA are expressed by a set of non-linear ordinary differential equations (ODEs) which represent simple analog components. The system’s steady-state is the solution to (1.1). Using the formulation presented in Theorem  3.2, we could solve (3.15c) by presenting the LCA dynamics in terms of and :

 ˙um(t) = 1τ[⟨^Φm,^y⟩−um(t)−∑n≠m⟨^Φm,^Φn⟩an(t)] an(t) ≜ Tλ(un(t))=max(0,un(t)−λ)

where denotes the th column of and is a thresholding function that induces local non-linear competition between nodes.

We have presented a framework to find a posterior for the Bayesian Lasso, however this parallelizable formulation could be easily extended to other priors and sparsity problems. Dynamic formulations for spectrotemporal estimation of time series [30] could be extended to a fully-Bayesian perspective to enable improved statistical inference and decision making.

Acknowledgments

M. Mendoza acknowledges support from the NSF Graduate Research Fellowship. T.P. Coleman acknowledges support from the NSF Center for Science of Information under Grant CCF-0939370, NSF grant IIS- 1522125, NIH grants 1R01MH110514, and ARO MURI grant ARO-W911NF-15-1-0479.

References

1. Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
2. Nicolai Bissantz, Lutz Dümbgen, Axel Munk, and Bernd Stratmann. Convergence analysis of generalized iteratively reweighted least squares algorithms on convex function spaces. SIAM Journal on Optimization, 19(4):1828–1845, 2009.
3. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
4. Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
5. George Casella. Empirical bayes gibbs sampling. Biostatistics, 2(4):485–500, 2001.
6. Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
7. Tarek A El Moselhy and Youssef M Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850, 2012.
8. Oliver G Ernst, Antje Mugler, Hans-Jörg Starkloff, and Elisabeth Ullmann. On the convergence of generalized polynomial chaos expansions. ESAIM: Mathematical Modelling and Numerical Analysis, 46(02):317–339, 2012.
9. Jerome Friedman, Trevor Hastie, Holger Höfling, Robert Tibshirani, et al. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.
10. Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
11. Roger G Ghanem and Pol D Spanos. Stochastic finite elements: a spectral approach. Courier Corporation, 2003.
12. Gerd Gruenert, Konrad Gizynski, Gabi Escuela, Bashar Ibrahim, Jerzy Gorecki, and Peter Dittrich. Understanding networks of computing chemical droplet neurons based on information flow. International journal of neural systems, 2014.
13. Chris Hans. Bayesian lasso regression. Biometrika, 96(4):835–845, 2009.
14. Chris Hans. Model uncertainty and variable selection in bayesian lasso regression. Statistics and Computing, 20(2):221–229, 2010.
15. W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.
16. Sanggyun Kim, Rui Ma, Diego Mesa, and Todd P Coleman. Efficient Bayesian inference methods via convex optimization and optimal transport. In ISIT, 2013.
17. Sanggyun Kim, Diego Mesa, Rui Ma, and Todd P Coleman. Tractable fully bayesian inference via convex optimization and optimal transport theory. arXiv preprint arXiv:1509.08582, 2015.
18. Sanggyun Kim, Christopher J Quinn, Negar Kiyavash, and Todd P Coleman. Dynamic and succinct statistical analysis of neuroscience data. Proceedings of the IEEE, 2014.
19. Anthony Lee, Christopher Yau, Michael B Giles, Arnaud Doucet, and Christopher C Holmes. On the utility of graphics cards to perform massively parallel simulation of advanced monte carlo methods. Journal of computational and graphical statistics, 19(4):769–789, 2010.
20. James Malcolm. Arrayfire: a gpu acceleration platform. In Proc. of SPIE Vol, volume 8403, pages 84030A–1, 2012.
21. Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. Sampling via Measure Transport: An Introduction, pages 1–41. Springer International Publishing, Cham, 2016.
22. Diego Mesa, Sanggyun Kim, and Todd P Coleman. A scalable framework to transform samples from one continuous distribution to another. In ISIT, 2015.
23. Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
24. Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78. ACM, 2004.
25. Trevor Park and George Casella. The Bayesian LASSO. Journal of the American Statistical Association, 103(482):681–686, 2008.
26. Bala Rajaratnam and Doug Sparks. Fast bayesian lasso for high-dimensional regression. arXiv preprint arXiv:1509.03697, 2015.
27. Bala Rajaratnam and Doug Sparks. Mcmc-based inference in the era of big data: A fundamental analysis of the convergence complexity of high-dimensional chains. arXiv preprint arXiv:1508.00947, 2015.
28. Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.
29. Christopher J Rozell, Don H Johnson, Richard G Baraniuk, and Bruno A Olshausen. Sparse coding via thresholding and local competition in neural circuits. Neural computation, 20(10):2526–2563, 2008.
30. Gabriel Schamberg, Demba Ba, and Todd P Coleman. A modularized efficient framework for non-markov time series estimation. arXiv preprint arXiv:1706.04685, 2017.
31. Samuel Shapero, Adam S Charles, Christopher J Rozell, and Paul Hasler. Low power sparse approximation on reconfigurable analog hardware. Emerging and Selected Topics in Circuits and Systems, IEEE Journal on, 2(3):530–541, 2012.
32. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
33. Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
34. Zheng Wang, Johnathan M Bardsley, Antti Solonen, Tiangang Cui, and Youssef M Marzouk. Bayesian inverse problems with priors: a randomize-then-optimize approach. arXiv preprint arXiv:1607.01904, 2016.
35. Dongbin Xiu and George Em Karniadakis. The Wiener–Askey polynomial chaos for stochastic differential equations. SIAM Journal on Scientific Computing, 24(2):619–644, 2002.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters