Lasso Regression: Estimation and Shrinkage via Limit of Gibbs Sampling

Lasso Regression: Estimation and Shrinkage via Limit of Gibbs Sampling

Bala Rajaratnam Steven Roberts Australian National University Doug Sparks Stanford University Onkar Dalal Stanford University
Abstract

The application of the lasso is espoused in high-dimensional settings where only a small number of the regression coefficients are believed to be nonzero (i.e., the solution is sparse). Moreover, statistical properties of high-dimensional lasso estimators are often proved under the assumption that the correlation between the predictors is bounded. In this vein, coordinatewise methods, the most common means of computing the lasso solution, naturally work well in the presence of low to moderate multicollinearity. The computational speed of coordinatewise algorithms, while excellent for sparse and low to moderate multicollinearity settings, degrades as sparsity decreases and multicollinearity increases. Though lack of sparsity and high multicollinearity can be quite common in contemporary applications, model selection is still a necessity in such settings. Motivated by the limitations of coordinatewise algorithms in such “non-sparse” and “high-multicollinearity” settings, we propose the novel “Deterministic Bayesian Lasso” algorithm for computing the lasso solution. This algorithm is developed by considering a limiting version of the Bayesian lasso. In contrast to coordinatewise algorithms, the performance of the Deterministic Bayesian Lasso improves as sparsity decreases and multicollinearity increases. Importantly, in non-sparse and high-multicollinearity settings the proposed algorithm can offer substantial increases in computational speed over coordinatewise algorithms. A rigorous theoretical analysis demonstrates that (1) the Deterministic Bayesian Lasso algorithm converges to the lasso solution, and (2) it leads to a representation of the lasso estimator which shows how it achieves both and types of shrinkage simultaneously. Connections between the Deterministic Bayesian Lasso and other algorithms are also provided. The benefits of the Deterministic Bayesian Lasso algorithm are then illustrated on simulated and real data.

1 Introduction

The process of estimating regression parameters subject to a penalty on the -norm of the parameter estimates, known as the lasso (Tibshirani, 1996), has become ubiquitous in modern statistical applications. In particular, in settings of low to moderate multicollinearity where the solution is believed to be sparse, the application of the lasso is almost de rigueur. Outside of the sparse and low to moderate multicollinearity setting the performance of the lasso is suboptimal (Zou and Hastie, 2005a). In this vein, many of the theoretical and algorithmic developments for the lasso assume and/or cater to a sparse estimator in the presence of low to moderate multicollinearity. A prime example of this phenomenon is coordinatewise algorithms, which have become the most common means of computing the lasso solution. The performance of coordinatewise algorithms, while ideal for sparse and low to moderate correlation settings, degrades as sparsity decreases and multicollinearity increases. However, the model selection capabilities of the lasso can still be essential even in the presence of high multicollinearity or in the absence of sparsity. The limitations of coordinatewise algorithms in such settings motivate us to propose in this paper the novel Deterministic Bayesian Lasso algorithm for computing the lasso solution. The performance of this proposed algorithm improves as sparsity decreases and multicollinearity increases, and hence our approach offers substantial advantages over coordinatewise techniques in such settings.

The popularity of the lasso comes despite the inability to express the lasso estimator in any convenient closed form. Hence, there is keen interest in algorithms capable of efficiently computing the lasso solution (Efron et al., 2004; Friedman et al., 2007; Osborne et al., 2000). Arguably, the two most well known algorithms for computing the lasso solution are least angle regression (Efron et al., 2004) and the even faster pathwise coordinate optimization (Friedman et al., 2007). Least angle regression (LARS) can be viewed as a form of stagewise regression. By exploiting the geometry of the lasso problem, LARS is able to efficiently compute the entire sequence of lasso solutions. Pathwise coordinate optimization is based on the idea of cycling through the coefficients and minimizing the objective function “one coefficient at a time”, while holding the other coefficients fixed. Since it has been shown to be considerably faster than competing methods, including LARS (Friedman et al., 2010), pathwise coordinate optimization is today the most commonly utilized algorithm for computing lasso solutions. While pathwise coordinate optimization is generally a fast and efficient algorithm for computing the lasso solution, the algorithm is not without limitations. In particular, the computational speed of pathwise coordinate optimization degrades as sparsity decreases and multicollinearity increases (Friedman et al., 2010).

In addition to the efficient computation of the lasso solution, the development of methods for quantifying the uncertainty associated with lasso coefficient estimates has proved difficult (Park and Casella, 2008). The difficulty primarily relates to assigning measures of uncertainty to (exact) zero lasso coefficient estimates. The recently developed Bayesian lasso (Park and Casella, 2008) addresses this issue by natural and economical uncertainty quantification, in the form of posterior credible intervals. The Bayesian lasso is based on the observation of Tibshirani (1996) that the lasso can be interpreted as a maximum a posteriori Bayesian procedure under a double-exponential prior. In their development of the Bayesian lasso, Park and Casella (2008) expressed the double exponential prior as a mixture of normals and derived a Gibbs sampler for generating from the posterior.

In this paper, we exploit the structure of the Bayesian lasso and its corresponding Gibbs sampler, not for uncertainty quantification, but rather for the computation of the lasso point estimate itself. Our approach is predicated upon the role played by the sampling variance in the lasso problem, commonly denoted by . Importantly, the lasso objective function does not depend on , and hence neither does the lasso solution. The sampling variance  does, however, play a role in the Bayesian lasso posterior. The value of  essentially controls the spread of the posterior around its mode. Hence, if is small, the posterior will be tightly concentrated around its mode, and thus is close to the lasso solution. This implies that the (Bayesian lasso) Gibbs sampler with a small, fixed value of  will yield a sequence that is tightly concentrated around the lasso solution. We also note that: (1) the lasso solution is exactly the mode of the marginal posterior of the regression coefficients, and (2) the mode of the joint posterior of the regression coefficients and hyperparameters used by the Gibbs sampler differs from the lasso solution by a distance proportional to . For computation of the lasso point estimate, the relevance of the discussion in the immediately preceding paragraph is realized by the fact that in the limit as the Gibbs sampler reduces to a deterministic sequence. Moreover, the limit of this deterministic sequence can be shown to be the lasso solution. This realization motivates our Deterministic Bayesian Lasso algorithm for computing the lasso point estimate.

A rigorous theoretical analysis demonstrates that (1) the Deterministic Bayesian Lasso converges to the lasso solution with probability 1 and, (2) it leads to a representation of the lasso estimator that demonstrates how it achieves both and types of shrinkage simultaneously. Connections between the Deterministic Bayesian Lasso and the EM algorithm, and modifications to the Deterministic Bayesian Lasso for the purposes of computing other lasso-like estimators are also provided. We also study the connections between our proposed algorithm and Iteratively Re-weighted Least Squares, an approach motivated by optimization. The probabilistic underpinning of our proposed methodology provides, (1) a theoretical backing for our proposed procedure and, (2) a means of avoiding certain technical difficulties that optimization methods in the literature have to contend with.

Further, it will be demonstrated, via simulation and real data analysis, that in non-sparse and/or high-multicollinearity settings the Deterministic Bayesian Lasso has computational advantages over coordinatewise algorithms. Such non-sparse and high-multicollinearity settings are highly prevalent in high dimensional and big data applications. In the remainder of the paper, in reference to its motivation, and for brevity, we shall interchangeably refer to the Deterministic Bayesian Lasso framework by its acronym SLOG: Shrinkage via Limit of Gibbs Sampling.

We note that one of the goals of the paper is to obtain a faster means of calculating the lasso solution in high multicollinearity and/or low sparsity settings. There are of course other computationally fast methods for high dimensional regression including “2-step methods” such as thresholding and then regressing (“marginal regression”), or Bayesian variants such as SSVS. Besides the lasso, these “2-step methods” methods are also useful, and have their respective strengths. One of the primary advantages of the lasso is that the chance of bringing in many predictors, which have low predictive power in the presence of other covariates, is relatively less.

2 Methodology

2.1 The Lasso and the Bayesian Lasso Posterior

Our developments assume that we are in the standard regression model setting with a length- response vector that is centered () and has distribution

where is the design matrix and without loss of generality is assumed known. We assume that the columns of  have been standardized such that for each . The frequentist lasso estimator (Tibshirani, 1996) of the coefficient vector  is

(1)

where denotes the usual vector norm and is the regularization parameter. Here it should be noted that if , then the minimization in (1) is strictly convex, and hence is unique. However, if , which for example is necessarily true when , then there may be uncountably many solutions which achieve the minimization in (1), i.e., may not be uniquely defined. Nevertheless, uniqueness of  can still be obtained when under quite mild conditions. For instance, Tibshirani (2013) showed that is unique if the columns of  are in a state called “general position.” In turn, a simple sufficient condition for the columns of  to be in general position is that the entries of  are drawn from a distribution that is absolutely continuous with respect to Lebesgue measure on   (Tibshirani, 2013). We will henceforth make the assumption that the columns of  are in general position (henceforth referred to as Assumption 1), which in turn implies that is unique. Note that another consequence of Assumption 1 is that the solution to the lasso problem for any subset of the columns of  is also unique, since the columns in the subset are also in general position.

The lasso estimator  may be interpreted from a Bayesian perspective as

where is the posterior distribution of  under the Bayesian model

Note that the double exponential distribution may be expressed as a scale mixture of normals (e.g., Andrews and Mallows, 1974). Hence, the Bayesian model above may be rewritten as the hierarchical model

(2)

which is popularly referred to as the Bayesian lasso (Park and Casella, 2008). Here it should be explicitly noted that our hierarchy appears to differ from that of Park and Casella. However, it can be seen that the two representations are in fact equivalent by noting that our regularization parameter  and the regularization parameter of Park and Casella are related according to , and we take as known. Under our model (2), the joint posterior is then

(3)

For convenience, let denote the quantity in parentheses in the last line of (3).

Park and Casella (2008) used the joint posterior (3) to derive a Gibbs sampler for drawing from the joint lasso posterior. The convergence properties of such a sequence were subsequently investigated by Kyung et al. (2010). The above Gibbs sampler cycles through the conditionals

(4)
(5)

where , and where we may replace by the alternative expression whenever an element of is zero.

2.2 The Deterministic Bayesian Lasso Algorithm

The Gibbs sampler of Park and Casella (2008) was motivated by its ability to provide credible intervals for the lasso estimates. However, we discovered that the particular form of the conditionals (4) and (5) that comprise the Gibbs sampler suggests a novel method for calculating the lasso point estimate itself. Specifically, notice that as , the conditional distribution of given in (4) converges to degeneracy at its mean . Similarly, the conditional distribution of given in (5) converges to degeneracy at in both the and cases. For the case, note that if with as , then in probability as , which in turn implies that as . (The case is clear from the properties of the inverse Gaussian distribution.)

Thus, in the limit as , the Bayesian lasso Gibbs sampler reduces to a deterministic sequence  given by

where , and where is some specified starting point. Substituting the form of into the equation for yields

(6)

where . Note that if every component of  is nonzero, then we may replace (6) by the simpler representation

(7)

Suppose the starting point  is drawn randomly from some distribution  on , where is absolutely continuous with respect to Lebesgue measure on . Then under mild regularity conditions, as with -probability , where -probability simply denotes probability under the distribution  from which the starting point is drawn. This result will be shown in Section 2.3. Thus, the recursive sequence given by (6) or (7), which we call the SLOG algorithm, provides a straightforward method of calculating  that holds regardless of the values of and . From (7) it is observed that each iteration of SLOG requires the inversion of a matrix. This inversion can become unduly time consuming in high dimensions. In Section 5.2.1 a variant of SLOG is developed that successfully overcomes this problem. This variant, termed rSLOG, is able to rapidly reduce the size of the matrix that needs inverting at each iteration of SLOG.

Essentially, the SLOG algorithm may be interpreted as providing the components of a Gibbs sampler in its degenerate limit as . Some intuition for this connection may be gained by noting that the lasso estimator does not depend on the value of . Thus, for the purposes of finding the lasso estimator, the value of  may be taken as any value that may be convenient. Now observe from the form of the joint posterior (3) that the smaller the value of , the more concentrated the posterior is around its mode. (It should be noted that the lasso estimator is the mode of the marginal posterior, and the modes of the joint and marginal posteriors do not coincide. However, they do coincide in their limits as .) Thus, the Gibbs sampler can be made arbitrarily closely concentrated around the lasso solution by taking the value of small enough. The SLOG algorithm simply carries this idea to its limiting conclusion by “sampling” directly from the degenerate limits of the conditional distributions. An annealing type variant to SLOG, where the cycles of the Gibbs sampler are based on a decreasing sequence, is investigated in Supplemental Section B.

2.3 Alternative Representations and Fixed-Point Results

In this section, we provide theoretical results to justify the use of the SLOG algorithm for calculation of .

2.3.1 Alternative Representation

The Deterministic Bayesian Lasso algorithm has already been written in both a general form (6) and a simpler form (7), with the simpler form only applicable in the absence of components that are exactly zero. (Note that the relevant issue is zeros in the components  of the sequence  generated by the SLOG algorithm. Zeros in the components of  itself are irrelevant.) In fact, we will eventually show in Lemma 7 that we may use the simpler form (7) for all with -probability 1. However, a more general form that can be applied for any point in will still be useful. The following lemma introduces a somewhat more intuitive representation of (6). We first define some additional notation. For each , let denote the set of indices of the nonzero components of , and let denote the matrix formed by retaining the th column of  if and only if . Similarly, let be the vector formed by retaining the th element of  if and only if , and let be the diagonal matrix with the absolute values of the elements of  on its diagonal. Also, let denote the vector formed by retaining the th element of  if and only if . (Note that is selected according to , not .) Then we have the following result.

Lemma 1.

, and for each .

Proof.

For convenience, we assume without loss of generality that for some . Now observe that

from which it follows that the recursion relation (6) may be written as

and the result follows from the fact that is invertible. ∎

Remark.

Lemma 1 establishes that a modified version of the simpler form (7) can still be used even in the presence of zeros in the components of . Any such zero components simply remain zero in the next iteration. Meanwhile, the nonzero components are updated by applying the simpler form (7) using the subvector of these nonzero components and the submatrix of the corresponding columns of .

2.3.2 Fixed Points

We now establish results on fixed points of the SLOG algorithm. To this end, it will be helpful to consider the recursion relation as a function. Specifically, let be the function that maps to according to (6), or equivalently Lemma 1.

Suggestions of the relationship between the sequence and the lasso estimator are provided by the following lemmas. The first states that the lasso’s objective function, which we define to be , is nondecreasing as a function of  when evaluated at each iteration of the SLOG sequence , while the second uses this result to conclude that the lasso estimator is a fixed point of the recursion under broad conditions.

Lemma 2.

for all , with strict inequality if . Moreover, converges as .

Proof.

Let , and define . Also, for convenience, define and . Observe that by Lemma 1, we may write and  as

where denotes the nonzero components of and where is the corresponding positive definite diagonal matrix (analogous to the definition of and from and ). Then

where . Now note that we may write since each element of is nonzero, and hence

which establishes the first result. To obtain the second result, note that is equivalent to , noting that for each  such that , we necessarily have as well. Then the strict inequality follows immediately from the fact that is positive definite. To obtain convergence of the sequence , simply combine the first result with the fact that for all .

Lemma 3.

Let be drawn from a distribution that is absolutely continuous with respect to Lebesgue measure on . Then .

Proof.

Note from Lemma 2 that for all , and recall that by definition. Then . Observe that is the unique maximizer of  by the condition on . It follows that . ∎

Lemma 3 above establishes that the lasso estimator is a fixed point of the recursion  that maps to . It is natural to ask whether there exist other fixed points for this recursion. The following lemma answers this question in the affirmative. For the sake of clarity, we temporarily introduce somewhat more cumbersome notation for the lasso estimator. Namely, we will explicitly indicate the dependence of on , and by writing to mean precisely (1).

Lemma 4.

if and only if the vector  of the nonzero components of  satisfies , where is the matrix formed by retaining the columns of  corresponding to the elements of  retained in .

Proof.

By Lemma 1, is equivalent to , where is the diagonal matrix with the absolute values of the elements of  as its diagonal entries. Then simply rewrite this as , which may be recognized as the Karush–Kuhn–Tucker condition for the lasso problem using only the covariates in  (more precisely, as the case of this condition when all components of the possible solution are nonzero). Thus, if and only if .

Lemma 4 has several consequences. First, it may be seen that . Second, since the lasso solution for each subset of the columns of  is unique by Assumption 1, there are at most fixed points of . (In fact, there are fewer than fixed points whenever some components of are already zero.) Third, every fixed point of  has at least one zero component, except for possibly itself (if each of its components is nonzero).

3 Convergence Analysis

In this section, we establish under mild regularity conditions that, with -probability , the sequence generated by the Deterministic Bayesian Lasso algorithm converges to . We begin by stating and proving two lemmas that motivate a simplifying assumption.

Lemma 5.

If , then .

Proof.

If , then , which is clearly minimized by . ∎

Lemma 6.

Suppose the columns of may be permuted and partitioned as , where equals the zero matrix of the appropriate size. Then
.

Proof.

The proof is given in the Supplemental Section.

The point of Lemma 6 is that when the covariates may be permuted and partitioned into sets and that are uncorrelated with each other, then solving the lasso problem for is equivalent to solving the lasso problem for and separately and combining the solutions. With this result in mind, we now assume that for any permutation and partition of the columns of  such that is the zero matrix, both and are nonzero (henceforth referred to as Assumption 2). This assumption is not restrictive, as can be seen from the preceding lemmas. If Assumption 2 did not hold, then the problem could be split into finding and separately by Lemma 6, and one of these solutions would be exactly zero by Lemma 5. Thus, the effect of Assumption 2 is merely to ensure that we are not attempting to solve a problem that can be trivially reduced to a simpler one, though it will also be needed to avoid a technical difficulty in proving the following useful result.

Lemma 7.

Under Assumption 2,

Proof.

The proof is given in the Supplemental Section.

We now state and prove the following result, which states that our SLOG algorithm converges to the lasso estimator.

Theorem 8.

Under Assumptions 1 and 2, as with -probability .

Proof.

The proof is long and technical and is therefore given in the Supplemental Section.

It should be remarked that the only purpose of the random starting point is to ensure that with -probability 1, our sequence avoids “accidentally” landing exactly on a fixed point other than . If a rule could be obtained by which the starting point could be chosen to avoid such a possibility, then we could choose the starting point by this rule instead.

4 Properties of the Deterministic Bayesian Lasso
Algorithm

4.1 Connections to Other Methods

4.1.1 EM Algorithm

An alternative interpretation of the Deterministic Bayesian Lasso algorithm may be obtained by comparing it to the EM algorithm (Dempster et al., 1977). Recall that the Deterministic Bayesian Lasso is based on a Gibbs sampler that includes both the parameter of interest  and a latent variable . An EM algorithm for the same parameter  and latent variable  can be considered in which the log-likelihood is

where again denotes the quantity in parentheses in the last line of (3). The iterates of the resulting EM algorithm coincide with those of the SLOG algorithm, as we now demonstrate below.

First, suppose that the value of  at the th step of the EM algorithm is . For the E-step of the EM algorithm, we obtain a function defined by

(8)

where denotes an expectation taken with respect to the distribution  where is fixed and has the distribution

()

Note that the above coincides with the distribution of with under the Bayesian model in (5). Then (8) becomes

(9)

where does not depend on . The expectation  may be evaluated as

noting that the last case holds by the fact that when since the shape parameter of the inverse gamma distribution in is . Then due to this last case, whenever and . Now consider the M-step of the EM algorithm, which takes

If for some , then as well, since otherwise and the maximum is not obtained. (Note that , so a value greater than is clearly obtainable.) Then the M-step essentially maximizes the function subject to the restriction that for every such that . For convenience, we now assume without loss of generality that for each and for each , where . Also, partition as

where is and is , and let , noting that is invertible. Then we have that

where

Then