A Variational Algorithm for Bayesian Variable Selection

# A Variational Algorithm for Bayesian Variable Selection

\nameXichen Huang \emailxhuang43@illinois.edu
University of Illinois at Urbana-Champaign
Champaign, IL 61820, USA \AND\nameJin Wang \emailjinwang8@illinois.edu
University of Illinois at Urbana-Champaign
Champaign, IL 61820, USA \AND\nameFeng Liang \emailliangf@illinois.edu
University of Illinois
Champaign, IL 61820, USA
###### Abstract

There has been an intense development on the estimation of a sparse regression coefficient vector in statistics, machine learning and related fields. In this paper, we focus on the Bayesian approach to this problem, where sparsity is incorporated by the so-called spike-and-slab prior on the coefficients. Instead of replying on MCMC for posterior inference, we propose a fast and scalable algorithm based on variational approximation to the posterior distribution. The updating scheme employed by our algorithm is different from the one proposed by Carbonetto and Stephens (2012). Those changes seem crucial for us to show that our algorithm can achieve asymptotic consistency even when the feature dimension diverges exponentially fast with the sample size. Empirical results have demonstrated the effectiveness and efficiency of the proposed algorithm.

\editor{keywords}

variable selection, variational approximation, spike-and-slab prior, consistency, Bayesian consistency

## 1 Introduction

Consider a standard linear regression problem, where we model , a continuous response variable, by a linear function of a set of features via

 Y=X1β1+⋯Xpβp+ϵ.

In the past three decades or so, there has been an intense development on the estimation of a sparse regression model. Here “sparse” means that only a small fraction of ’s is believed to be non-zero. Identifying the set is often referred to as the variable selection problem.

The current approaches to variable selection can be roughly divided into two categories. One category contains approaches based on penalized likelihood, including the classical variable selection procedures like AIC/BIC and the more recent ones like LASSO (Tibshirani, 1994) and SCAD (Fan and Li, 2001). As the name suggested, the penalized likelihood approach estimates the regression parameter by minimizing the log-likelihood plus some penalty function on . With a proper choice of the penalty function, the solution will have some of its components to be exactly zero, that is, parameter estimation and variable selection are carried out simultaneously. For an overview of the recent developments on penalized likelihood approaches to variable selection in high dimensions, see Fan and Lv (2010) and Bühlmann and van de Geer (2011).

We focus on the other category, the Bayesian approach, which starts with a hierarchical prior on all the unknown parameters. For example, a widely used prior on is the so-called spike-and-slab prior (Mitchell and Beauchamp, 1988):

 βj|γj,σ2∼γjN(0,v1σ2)+(1−γj)δ0,j=1,…,p, (1)

where denotes a point mass at 0, and if the -th variable is included and otherwise. The -dimensional binary vector , which serves as a model index for all the sub-models, is then modeled by a product of i.i.d. Bernoulli distributions with parameter .

An advantage of the Bayesian approach is that, in addition to the posterior distribution on , we can also obtain a posterior distribution on all the sub-models. For example, we can discuss the probability of a sub-model or the inclusion probability of a particular feature, which can be of more interest than a point estimate of Further, for prediction, it is well-known that model combination or aggregation has a better performance than a single model (Breiman, 2001). The Bayesian approach for variable selection gives rise to a natural averaging scheme: the prediction from various sub-models can be averaged with respect to their posterior probabilities (Raftery et al., 1998; Clyde and George, 2004).

Despite the aforementioned advantages, in practice, Bayesian variable selection is less preferable than those penalization algorithms. A major disadvantage of Bayesian variable selection is the computing cost. The posterior distribution usually does not have a closed-form expression, so posterior inference has to reply on MCMC, which could be time consuming especially when the number of predictors is large.

In this paper, we propose a variational algorithm for Bayesian variable selection. It is a deterministic algorithm, seeking an approximation of the true posterior distribution over , instead of running an MCMC chain. It converges very fast and can scale for large sized data sets. Our work is motivated by an earlier variational algorithm proposed by Carbonetto and Stephens (2012). The two algorithms have the same prior specification and the same set of variational parameters . The two algorithms, however, update the variational parameters differently. In the algorithm by Carbonetto and Stephens (2012), the parameters associated with each feature are updated sequentially given the others; such a component-wise updating scheme is prone to error accumulation especially when is large and predictors are correlated. In our algorithm, all features are updated simultaneously, which we refer to as the batch-wise updating scheme, therefore is more robust to errors and correlations among predictors. Indeed, the batch-wise updating scheme employed by our algorithm turns out to be crucial for us to show our algorithm achieves both frequentist consistency and Bayesian consistency even when diverges at an exponential rate of the sample size . To the best of our knowledge, no asymptotic results on variational algorithms for Bayesian variable selection are available in the literature.

The remaining of the paper is arranged as follows: Section 2 presents the two variational Bayes (VB) algorithms; Section 3 investigates the asymptotic properties of our new algorithm; Empirical results are given in Section 4 and Section 5, and conclusions are given in Section 6.

### 1.1 Notation.

We define some symbols that will be used in the following sections. For sequences and , we write

• , if and , s.t. , ;

• , if ;

• , if and , s.t. , ;

• if and if .

For a random variable sequence and a constant sequence , we write if , s.t. , , and when , . For , we write to represent the larger number of and , and to represent the smaller one of and .

## 2 Variational Approximation

### 2.1 The Model

Represent the linear regression model in a matrix form:

 y=Xβ+ϵ, (2)

where is a vector that contains i.i.d. random errors generated from a normal distribution , is the response vector of length , is an design matrix, and is the coefficient vector of length . Like in many other variable selection algorithms, we center and scale the data as follows:

 ∑iyi=0,∑ixij=0,∑ix2ij=∥Xj∥22=n,

where denotes the -th column of .

The hierarchical prior is specified as follows:

 βj|γj ∼ γjN(0,v1σ2)+(1−γj)δ0, (3) γj i.i.d.∼ Bern(θ), σ2 ∼ IG(ν2,νλ2), θ ∼ Beta(a0,b0),

where , and , , and are hyper-parameters.

### 2.2 A Variational EM Algorithm

Variational methods have been widely used in different models, such as the Graphic models (Jordan et al., 1999). In the ordinary variational Bayesian approach (Bishop, 2006), an approximating distribution of all the latent variables and parameters, which takes a factorized form of , is selected from a restricted family of distributions , such that the negative KL-divergence from the true posterior to is maximized, i.e.,

 maxQ∈QEQlogPQ=∫QlogPQdQ.

Then one can solve each sequentially by fixing other ’s until convergence.

Our variational algorithm is a hybrid of Expectation-Maximization (EM) and variational, same as the one used by Blei et al. (2003) for topic models. Next we give a general description of the framework we use for posterior inference.

Let denote the set of parameters of interest, and denote the hyper-parameters. The goal is to obtain an approximation of the posterior distribution on . Define the following objective function

 Ω(q1,q2,η)=Eq1,q2Θ1,Θ2logπ(Θ1,Θ2,η|Data)q1(Θ1)q2(Θ2), (4)

where and are distributions on and respectively. Our goal is to find , and a point estimate to maximize the objective function. We will refer to the estimate as the Maximum a posteriori (MAP) estimate: if we optimize (4) with respect to and , instead of restricting to take a product form, then the corresponding will be exactly the MAP estimate of

Applying the framework above on the Bayesian variable selection model, we estimate the MAP for and , and approximate the posteriors for ’s and ’s. The approximating posterior distribution of takes the following form:

 q(β,γ)=p∏j=1qj(βj,γj)=p∏j=1[ϕjfj(βj)]γj[(1−ϕj)δ0(βj)]1−γj,

where is a probability density function. That is, we approximate the posterior distribution of by with probability , and following a continuous distribution with probability .

Given all the information above, we define the following objective function for this problem

 Ω(q1,…,qp,θ,σ2) =Eq1,…,qplogp(y|β,σ2)p(β|γ)p(γ|θ)π(θ)π(σ2)∏pj=1[ϕjfj(βj)]γj[(1−ϕj)δ0(βj)]1−γj.

### 2.3 Algorithm 1 : Component-wise VB

The first algorithm is similar to the variational algorithm proposed by Carbonetto and Stephens (2012). In detail, we iteratively update the approximating distributions of ’s, and the MAP estimates and . Since the algorithm loops over the dimensions feature by feature, we refer to it as a “component-wise” VB algorithm, to highlight its difference with Algorithm 2, which we shall propose.

#### 2.3.1 Updating Equations

Update .

For some , by fixing other approximating distributions and point estimates, we maximize the objective function with respect to . As shown in Carbonetto and Stephens (2012), is the probability density function of a Normal distribution (albeit we do not assume to be a normal at the beginning) with

 μj= XTjE[−j](y−∑l≠jXlβl)n+1v1, σ2j= ^σ2n+1v1,

where denotes the expectations over all the ’s with with respect to the variational distributions. By symmetry, we know that the other ’s are also Normal density functions. As such, we can write as

 μj=(y−X[−j]¯β[−j])TXjn+1v1, (5)

where denotes the design matrix without the -th column, and is the mean of w.r.t. .

The log-odds of can be updated as

 Logit(ϕj)=Logit(^θ)+12logσ2jv1^σ2+μ2j2σ2j.
Update .

The point estimate of is updated by

 ^θ=∑pj=1ϕj+a0−1p+a0+b0−2. (6)
Update .

The point estimate of is updated by

 ^σ2=∥y−X¯β∥22+∑pj=1[(n(1−ϕj)+1/v1)ϕjμ2j+(n+1/v1)ϕjσ2j]+νλn+∏pj=1ϕj+ν+2. (7)

#### 2.3.2 The Drawback of Algorithm 1 in High Dimension

To reveal the potential drawback of Algorithm 1 when being applied on a high-dimensional data set, we examine its asymptotic property.

Assume the response is generated from the normal linear regression model (2) with being the true regression coefficients. Consider a relatively easy setting where the minimal eigenvalue of is , i.e., the correlation among columns of is small, and our starting values for ’s and ’s are very close to the truth: if , if , and

 μj−β∗j=OP(1√n), for all j=1,…,p.

Then, suppose we are updating the parameters associated with the -th feature . After the update, will and still be close to the truth?

From Eq (5) we have

 μj= XTj(y−X[−j]¯β[−j])n+1v1 =

Suppose is chosen such that , a condition required for consistency as will be made clear in our analysis in Section 3. Then, we have

 μj=β∗j+OP(p√n). (8)

The result above shows the price we pay for Algorithm 1: even if we start with within a ball around the truth , after the update, the new could be very far away from when is large, due to the accumulation of the errors from other dimensions via .

Next we examine how is affected by the update. Suppose the -th feature is an irrelevant feature, i.e., The new log-odds of is computed as

 Logit(ϕj)= Logit(^θ)+12logσ2jv1^σ2+μ2j2σ2j = O(1)−12log(v1n+1)+μ2jn+1v12^σ2,

where as long as we do not start with or Since by (8), we have

 2Logit(ϕj)=−log(v1n+1)+OP(p2). (9)

When is very large, the right hand side of (9) may be positive. That is, the new could be bigger than , although we start with , a value that is very close to the truth.

Our analysis above is not rigorous, but it clearly reveals an issue with Algorithm 1: the noise can accumulate due to the feature by feature updating scheme. To address this issue, we propose another algorithm which updates simultaneously for all features.

### 2.4 Algorithm 2: Batch-wise VB

Recall the variational parameters we need to update are . At iteration , instead of updating the triplet sequentially for each as in Algorithm 1, we consider the following batch-wise update: update , then update , and finally update . Given , we can update using Eq (6) and Eq (7).

#### 2.4.1 Updating Equations

Update .

We update ’s by maximizing

 Eq1,…,qp{n∑i=1[−(yi−xTiβ)22σ2]+p∑j=1γj[−β2j2v1σ2−log[fj(βj)]]} ∝ n∑j=1−ϕjn2^σ2σ2j−ϕj12v1^σ2σ2j+12ϕjlog(σ2j),

and the updating equation for is

 σ2j=^σ2n+1v1,j=1,…,p. (10)

Note that ’s take the same form for all . The term in the denominator is due to the fact that each column of has been pre-processed such that . Later in Section 3, in light of the asymptotic analysis, we will suggest to replaced by as in Eq (20).

Update .

We update ’s by maximizing

 Eq1,…,qp{n∑i=1[−(yi−xTiβ)22σ2]+p∑j=1γj[−β2j2v1σ2−log[fj(βj)]]},

and the updating equation for is

 μ=(ΦXTXΦ+Δ+1v1Φ)−1ΦXTy, (11)

where .

Update .

The objective function at this step involves a quadratic form of . To simplify the computation, we apply a linear approximation to replace the quadratic term. The detailed derivation can be found in Appendix A, and the final updating equation for is given by

 Logit(ϕj)= Logit(^θ)+12logσ2jv1^σ2+μ2j2^σ2(n+1v1) = Logit(^θ)+12logσ2jv1^σ2+μ2j2σ2j. (12)
Truncate ’s.

The final expression of involves the exponential of , which could trigger the error of numerical overflow when the magnitude of the Logit value is large. In our implementation, we truncate the logit value, or equivalently restrict where is a small constant.

We also stop updating any ’s once they reach the extreme values, or . That is, for

 ϕ(t)j=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩Logit−1(log^θ1−^θ+12logσ2jv1^σ2+μ2j2σ2j),if min(ϕ(t−1)j,1−ϕ(t−1)j)>c;ϕ(t−1)j,if min(ϕ(t−1)j,1−ϕ(t−1)j)≤c. (13)

This stop-early updating scheme can dramatically reduce the computation cost for our algorithm, as explained in Section 2.4.2.

Stopping Criterion.

After we loop over all the parameters mentioned above, we need to decide when to stop. A natural choice is to stop when the change of the objective function is less than some threshold. Since our primary focus is variable selection, we use the maximum entropy criterion: we compute the entropy for each , and stop if the maximum of the change of the entropy is less than a pre-specified threshold.

#### 2.4.2 Computational Complexity

The main computational cost lies in reverting a matrix in Eq (11). The direct computation involves operations which is time-consuming when both and are large. Next we describe the computation trick used in our implementation, which can dramatically reduce the computation cost.

At iteration , Eq (11) can be rewritten as

 μ = (XTXΦ(t)+n(I−Φ(t))+1v1I)−1XTy = (A(t−1)+(A(t)−A(t−1)))−1XTy,

where and

At iteration , we would have in hand. If the rank of is lower than or , then the problem can be reformulated as inverting a matrix of a lower rank.

Write and . Then . Based our experience, after the first several iterations, many ’s are close to or , i.e., they will not be updated any more according to Eq (13). So many diagonal elements of are zero. Without loss of generality, assume only the first elements of are not zero. Then we have , where contains only the first columns of , , and . Applying the Woodbury formula, we have

 μ=(A−1(t−1)−A−1(t−1)U(C−1+VA−1(t−1)U)−1q×qVA−1(t−1))XTy.

So we only need to invert a matrix where is much smaller than and .

## 3 Asymptotic Analysis

Assume the response is generated from the normal linear regression model (2) with being the true regression coefficients. Let denote the true model index, i.e., if and if . Also define the true set of relevant variables as

 S∗={j:β∗j≠0}={j:γ∗j=1}.

In our analysis, the dimension is allowed to diverge with , and therefore , and may also vary with .

We will show that Algorithm 2 achieves both the frequentist consistency and the Bayesian consistency. Recall that Algorithm 2 returns an estimate of the relevant variable set via

 ^Sn={j:ϕj>0.5}.

The frequentist consistency refers to the convergence (in probability) of to . The cut-off value can be changed to any other fixed value in , which will not affect the consistency result as shown in our analysis.

In addition to a point estimate of the true variable set, our algorithm also returns a probability distribution over all variable sets (or sub-models), namely,

The aforementioned frequentist consistency corresponds to is the largest, i.e., the truth model receives the largest posterior probability, while the Bayesian consistency requires converges to in probability, which is stronger than the frequentist consistency.

In addition to the ordinary regularity conditions, for our algorithm to achieve consistency, we need to let , the prior variance on the non-zero ’s as in Eq (3), to grow to infinity at a certain rate of . A similar condition also arises in the asymptotic study on Bayesian variable selection by Narisetty and He (2014) although their prior specification is different from ours. To help the readers to understand this condition, we first give the asymptotic analysis on a simple orthogonal design and then describe the general result.

### 3.1 The Orthogonal Design

Consider a simple case in which the design matrix is orthogonal, i.e., . To simplify our discussion, we also assume that is known, is set to be , and the minimal non-zero coefficient is bigger than some constant (i.e., the non-zero coefficients will not diminish to zero). These conditions will be relaxed in our result for the general case.

Suppose we run our algorithm for one step. From the updating equations of Algorithm 2, we have

 2Logit(ϕj)=−log(v1n+1)+n^β2jσ2nn+1v1, (14)

where is the OLS estimator of the -th coefficient.

When is fixed, as in the classical asymptotic setting, it is easy to show that our algorithm has the desired asymptotic behavior as long as

 v1n→∞,log(v1n)=o(n). (15)

This is because: when , since , the leading term in (14) is the first term that goes to , therefore goes to ; when , the leading term in (14) is the second term that goes to , therefore goes to .

When increases with , the coefficients and the true variable set may vary with . As such, it is no longer meaningful to discuss the limit of Eq (14). Instead, we need to examine the limiting behavior of and .

First we show that the frequentist variable selection consistency could be achieved with , in addition to condition (15). Let be an arbitrary positive number. It suffices to show that

 P(maxj∉S∗nLogit(ϕj)>−C2)+P(minj∈S∗nLogit(ϕj)

By the Bonferroni correction and the tail probability inequality of the normal distribution, we have

 P(maxj∉S∗nLogit(ϕj)>−C2) (17) ≤ ∑j∉S∗nP⎛⎝1σ2n^β2jnn+1v1>log(v1n+1)−C⎞⎠ ≤ c√log(v1n)−Cexp{−log(v1n)−C2+logp}→0,

as long as and , where is some constant. Similarly, we have

 P(minj∈S∗nLogit(ϕj)r√n),

which also goes to zero by the tail probability of the normal distribution, where is some constant.

Secondly, we show that the Bayesian consistency could be achieved with . We can show that, if , Eq (16) still holds with a varying constant where . Then with probability going to 1, we have

 maxj∈S∗n(1−ϕj)⋁maxj∉S∗nϕj<1exp{12Cn}<1(v1n)s/2.

Using the inequality , we have

 q(γ∗) =[∏j:γ∗j=1ϕj][∏j:γ∗j=0(1−ϕj)]≥1−∑j∈S∗n(1−ϕj)+∑j∉S∗nϕj ≥ 1−p(v1n)s/2→1.

Our analysis for the orthogonal case indicates that the choice of affects how fast we could allow to diverge with . For example, if is a constant, then our algorithm can achieve the frequentist consistency for and Bayesian consistency for . In order to achieve consistency for larger , we need to let with .

### 3.2 The General Case

Without loss of generality, assume and the true coefficient , i.e, the first features are the relevant ones. Write the design matrix as accordingly, where is the matrix corresponding to the signal features and is the matrix corresponding to the noise features.

In our analysis, we assume the following conditions hold.

1. Condition on model identification: Denote and as the projection matrices of and onto their column spaces respectively, then assume the rank of is and the spectral norm of is upper bounded by , i.e.,

2. Condition on the design matrix: Let denote the minimal non-zero eigenvalue of matrix and it satisfies

 λ−1n1=O(n−η1),0<η1≤1.
3. Condition on the sparsity of : The norm of the true regression coefficient satisfies the following sparsity condition

 ∥β∗∥22=O(nη2),0≤η2<η1.
4. The beta-min condition: The minimal non-zero coefficient satisfies

 liminfnminj∈S∗n√n|β∗j|nη3/2≥M, (18)

where and are two constants not depending on .

5. Condition on the initial values: the initial value for all the inclusion probability ’s should be set to be 1, i.e., we start our algorithm with all the variables in. The initial values for the error variance and the Bernoulli parameter could be any constants satisfying and . For the proof, we set

 ϕ(0)j=1,^σ2(0)=1,^θ(0)=12