Bayesian Linear Regression for Multivariate Responses Under Group Sparsity

Bayesian Linear Regression for Multivariate Responses Under Group Sparsity

Abstract

We study the frequentist properties of a Bayesian high-dimensional multivariate linear regression model with correlated responses. Two features of the model are unique: (i) group sparsity is imposed on the predictors. (ii) the covariance matrix is unknown and its dimensions can be high. We choose a product of independent spike-and-slab priors on the regression coefficients and a Wishart prior with increasing dimension on the inverse of the covariance matrix. Each spike-and-slab prior is a mixture of a point mass at zero and a multivariate density involving a -norm. We first obtain the posterior contraction rate, the bounds on the effective dimension of the model with high posterior probabilities. We then show that the multivariate regression coefficients can be recovered under certain compatibility conditions. Finally, we quantify the uncertainty for the regression coefficients with frequentist validity through a Bernstein-von Mises type theorem. The result leads to selection consistency for the Bayesian method. We derive the posterior contraction rate using the general theory through constructing a suitable test from the first principle by bounding moments of likelihood ratio statistics around points in the alternative. This leads to the posterior concentrates around the truth with respect to the average log-affinity. The technique of obtaining the posterior contraction rate could be useful in many other problems.

\kwd
\arxiv

0000.0000

\runtitle

Bayesian variable selection for multivariate responses

{aug}

and

Average log-affinity \kwdBayesian variable selection \kwdcovariance matrix \kwdgroup sparsity \kwdmultivariate linear regression \kwdposterior contraction rate \kwdspike-and-slab prior

1 Introduction

Asymptotic behavior of variable selection methods, such as the lasso, have been extensively studied (Bühlmann and van der Geer, 2011). However, theoretical studies on Bayesian variable selection methods are limited to relatively simple settings (Castillo et al., 2015; Chae et al., 2016; Martin et al., 2017; Ročková, 2018; Belitser and Ghosal, 2017; Song and Liang, 2017). For example, Castillo et al. (2015) studied a sparse linear regression model in which the response variable is one-dimensional and the variance is known. However, it is not straightforward to extent those results to study the multivariate linear regression models with unknown covariance matrix (or even the univariate case with unknown variance).

In many applications, predictors are naturally clustered in groups. Below, we give three examples.

  • Cancer genomics study. It is important for biologists to understand the relationship between clinical phenotypes and DNA mutations, which are detected by DNA sequencing. Since these mutations are spaced linearly along the DNA sequence, it is often assumed that the adjacent DNA mutations on the chromosome have a similar genetic effect and thus should be grouped together (Li and Zhan, 2010).

  • Multi-task learning. When information for multiple tasks is shared, it is preferable to solve these tasks at the same time to improve learning efficiency and prediction accuracy. Relevant information is preserved across different equations by grouping them together (Lounici et al., 2009).

  • Causal inference in advertising. Measuring the effectiveness of an advertising campaign running on stores is an important task for advertising companies. Counterfactuals, which are constructed using the sales data of a few stores, chosen by a variable selection method, from a large number of control stores not subject to the advertising campaign, are needed to conduct a causal analysis (Ning et al., 2018). Control stores within the same geographical region—as they share the same demographic information—can be grouped together and selected or not selected at the same time.

Driven by these applications, new variable selection methods designed to select or not select variables as groups, through imposing group sparsity on the regression coefficients, have been developed. For example, the group lasso method is proposed (Yuan and Lin, 2006). It replaces the -norm in the lasso with the -norm, where the -norm is put on the predictors within each group and the -norm is put across the groups. Theoretical properties of the group lasso have been studied (Nardi and Rinaldo, 2008) and its benefits over the lasso in the group selection problem have been demonstrated (Lounici et al., 2009, 2011; Huang and Zhang, 2010). Recently, various Bayesian methods for selecting variables as groups were also proposed (Li and Zhan, 2010; Curtis et al., 2014; Ročková and Lesaffre, 2014; Xu and Ghosh, 2015; Chen et al., 2016; Greenlaw et al., 2017; Liquet et al., 2017). However, large-sample frequentist properties of these Bayesian methods have not been studied yet.

In this paper, we study a Bayesian method for the multivariate linear regression model with two distinct features: group sparsity that is imposed on the regression coefficients and an unknown covariance matrix. To the best of our knowledge, even in a simpler setting without assuming group sparsity, convergence and selection properties of methods for high-dimensional regression with a multivariate response having an unknown covariance matrix have not been studied in either the frequentist or the Bayesian literature. However, it is important to understand the theoretical properties of those models because correlated responses arise in many applications. For example, in the study of the causal effect of an advertising campaign, sales in different stores are often spatially correlated (Ning et al., 2018). Furthermore, when the dimension of the covariance matrix is large, it would affect the quality of the estimation of the regression coefficients.

When the covariance matrix is unknown and high-dimensional, the techniques that were developed for deriving posterior concentration rates (Castillo et al., 2015; Martin et al., 2017; Belitser and Ghosal, 2017) cannot be applied. Also, The general theory of posterior concentration in its basic form (Ghosal and van der Vaart, 2017) is not appropriate to use because it typically deals with the average Hellinger distance which is not sufficient for our analysis. Thus in order to apply the general theory to derive a rate, we shall construct required tests directly by controlling the moments of likelihood ratios in small pieces.

In this study, we consider a multivariate linear regression model

(1.1)

where is a response variable, , is a predictor variable, , is a matrix containing the regression coefficients, and are independent identically distributed (i.i.d) as with being a unknown covariance matrix. In other words, in the regression model, there are non-overlapping groups of predictor variables with the group structure being pre-determined. When , it reduces to the setting that the sparsity is imposed on the individual coordinates. Thus the results derived in our paper are applicable to the ungrouped setting as well. The model can be rewritten in the vector form as

(1.2)

where is a matrix, where , and is a vector. The dimension can be very large. The dimension can be large as well to a lesser extent when the sample size is large. The number of total groups is clearly bounded by . We denote the groups which contain at least a non-zero coordinate as non-zero groups and the remaining groups as zero groups.

To allow derivation of asymptotic properties of estimation and selection, certain conditions on the growth of , , and need to be imposed. We allow (which means that ) but require that the total number of the coefficients in all non-zero groups together are less than in order. We further assume that the number of coordinates in any single group must be of order less than and that . Finally, to make the covariance matrix is consistently estimable, we assume that the dimension of the covariance matrix satisfies the condition that .

As for the priors, we choose a product of independent spike-and-slab priors for and a Wishart prior for , the precision matrix. The spike-and-slab prior is a mixture of point mass for the zero coordinates and a density for non-zero coordinates. In the ungrouped setting, commonly used densities for non-zero coordinates are a Laplace density (Castillo et al., 2015), a Cauchy density (Castillo and Mismer, 2018) and a normal density with mean chosen by empirical Bayes methods (Martin et al., 2017; Belitser and Ghosal, 2017). In this paper, we choose a special density for the non-zero coordinates (see (3.1)). This density involves the -norm, which is used as a penalty to obtain the group lasso in a non-Bayesian setting.

The remainder of the paper is organized as follows. Section 2 introduces notations that will be used in this paper. Section 3 describes the priors, along with the necessary assumptions. Section 4 provides the main results. The proofs of the main results are given in Section 5. Auxiliary lemmas are provided in Appendix A.

2 Notation.

We assume that are disjoint groups such that . Since these groups are given and will be kept the same throughout, their notaions will be dropped from subscription notations. Clearly, is the number of elements in . Let . For each , let stand for the set which contains the indices of the non-zero groups for the -th component and be the cardinality of the set . Also, define and . Let be the set containing the indices of the true non-zero groups, where . Define and

For a vector , let , and be the -, - and -norm of respectively, where with is the submatrix of consisting of coordinates. For a matrix , we denote as the -th column of and as the Frobenius norm of . For a symmetric positive definite matrix , let denote the eigenvalues of ordered from the smallest to the largest and stand for the determinant of . For a scalar , we denote to be the absolute value of .

Let be the negative log-affinity between densities and and be their squared Hellinger distance. The Kullback-Leibler divergence between and is given by and the Kullback-Leibler variation between and is denoted by . The symbol stands for the density of with the parameters at their true values. The notation denotes the total variation distance between two probability measures and .

We let stand for the -covering number of a set with respect to , which is the minimal number of -balls needed to cover the set . Let stand for the dimensional identity matrix and stand for the indicator function.

The symbols and will be used to denote inequality up and down to a constant while stand for for two constants and . The notations and stand for and respectively. The symbol stands for a Dirac measure.

3 Prior specifications

In this section, we introduce the priors used in this study. We place two independent priors on and as they are both unknown. We place products of independent spike-and-slab priors on and a Wishart prior on , which is known as the precision matrix.

3.1 Prior for regression coefficients

We denote the -th column of as and the notations and as collections of the regression coordinates in the non-zero groups and the zero groups respectively. Each spike-and-slab prior is constructed as follows. First, a dimension is chosen from a prior on the set . Next, a subset of cardinality is randomly chosen from the set . Finally, A vector is chosen from a probability density on given by (3.3). The remaining coordinates set to 0. To summarize, the prior for is

(3.1)

where and the density is the prior for the dimension .

Assumption 1 (Prior on dimension).

There are positive constants , , , with

(3.2)

for , .

For example, the complexity prior given by Castillo et al. (2015) by replacing by shall satisfy the above assumption.

The Laplace density (Castillo et al., 2015) or the Cauchy density (Castillo and Mismer, 2018) are generally chosen as , since the normal density has too sharp tail that overshrinks the non-zero coefficients, although some empirical Bayes modifications of the mean can overcome the issue (see Martin et al., 2017; Belitser and Ghosal, 2017). However, in our setting, as sparsity is imposed at the group level, a more natural choice of the prior is the density which incorporates the -norm. We thus consider the prior

(3.3)

where (see Lemma A.1). This density has its tail lighter than the corresponding Laplace density. From Stirling’s approximation, it follows that . We would like to mention that Xu and Ghosh (2015) developed a posterior computational strategy for a similar prior. They also incorporated the -norm into their prior, except for that they did not provide the explicit expression of the normalizing constant for that prior.

The tuning parameter needs to be bounded both from above and below. The value of cannot be too large or it will shrink the non-zero coordinates too much towards 0. It should not be too small because a very small value will be unable to prevent many false signals appear in the model and hence making the posterior to contract slower. The upper and lower bounds for the permissible limits are stated in below.

Assumption 2.

For each , , where

(3.4)

The lower bound of is derived from (5.12). Suppose that (when each group has only one element), the lower bound reduces to which is analogous to the lower bound displayed in Castillo et al. (2015).

The upper bound of is motivated from the following lemma.

Lemma 3.1.

Under Assumption 2,

(3.5)

3.2 Prior for the covariance matrix

We put a Wishart prior on the precision matrix: , where stands for a -dimensional Wishart distribution, is a symmetric positive definite matrix, , , and .

Although, other priors can be used, the Wishart prior is the most commonly used prior in practice as it is conjugate for the multivariate normal likelihood.

4 Main results

4.1 Posterior contraction rate

We study the posterior contraction rate for the model (1.1) and the priors given in Section 3. We write and for the true values for and respectively. Recall that is the set which includes the index of the true non-zero groups of , is the cardinality of that set . Let and . Define the set .

The general theory of posterior contraction for independent non-identically distributed observations (see Theorem 8.23 of Ghosal and van der Vaart, 2017) is often used to derive a posterior contraction rate, which is based on the average squared Hellinger distance. Since the average squared Hellinger distance between multivariate normal densities with an unknown covariance is small does not necessarily imply that the parameters in the two densities are also close on average, we work directly with on the average log-affinity which is still very tractable in the multivariate normal setting.

To derive a posterior contraction rate, we construct a suitable test from the first principle by breaking up the effective parameter space given by a sequence of appropriate sieves under the alternative hypothesis into several pieces. For each piece sufficiently separated from the truth, we pick up a representative and obtain a most powerful test (i.e., the Neymann-Pearson test) for the truth against that alternative. We bound the moments of the likelihood ratio of an arbitrary density in the piece to the density of the representive of that piece to show that the most powerful test for the truth against the representative has adequate power for any alternatives in the corresponding piece.

By using this approach, we require the true values of and to be restricted into certain regions to ensure that the prior concentration around the true point is not too small so that the posterior contraction rate is sufficiently fast. This is unlike Castillo et al. (2015), who obtained results uniformly over the whole space as their case (univariate with known variance and Laplace prior) allows explicit expressions for a direct treatment. More precisely, we require and , where and are shown in the following assumption.

Assumption 3.

The true values of and , where

(4.1)

, are two fixed positive values, , is given in (4.4) and satisfies Assumption 2.

The largest value of is obtained is by taking for all and , which is the ungrouped case. Then the upper bound becomes . For the sake of simplicity, we assume that is bounded above by a large constant. Then that upper bound increases to infinity very quickly since . When , the upper bound increases at a slower rate than the bound when , as increases.

Theorem 4.1.

For the model (1.1) and the priors given in Section 3, suppose that for a fixed positive number , , , where , and that Assumptions 13 hold. Then, for sufficiently large,

(4.2)
(4.3)

where

(4.4)
Remark 1.

A major contribution we make to prove this theorem is the construction of exponentially powerful tests for the truth against the complement of a ball by splitting the complement in suitable pieces (not necessarily balls) where we can control a moment of the likelihood ratio for two points within each piece. This gives a general technique of construction of tests required for the application of the general theory, which can be useful in many other problems.

Remark 2.

Instead of using the prior given in (3.3), one can also choose a Laplace density for the coordinates in the non-zero groups. Then the -norm of , , in the set should be replaced by . Clearly, , hence in the latter case the set will be smaller. In fact, one can replace the -norm with any other -norms, for . Then the norm of in needs to be adjusted accordingly.

Remark 3.

When , the rate reduces to }. The first part of the rate is the same as the rate obtained when the sparsity is imposed at the individual level, such as in Bühlmann and van der Geer (2011) and Castillo et al. (2015). When , the first rate can be obtained when the ratio (i.e., when the number of coordinates in each group takes a fixed number) and is sufficiently slowly growing.

Remark 4.

The second rate in (4.4) reveals that in some situations, by imposing group sparsity, the posterior will contract at a slower rate than imposing sparsity at the individual level. This happens when too many zeros are put into non-zero groups (often known as weakly group-sparse (Huang and Zhang, 2010)).

From Theorem 4.1, if the dimension of the covariance is too large, then the posterior contraction rate can be much slower. Under such a situation, the rate may be improved if we know any special structures for the precision matrix. Here we give two examples.

Example 1 (Independent responses).

If the responses are independent across components, then the model (1.1) can be written as independent model with each one is

Then one can estimate the parameters in the models separately. The posterior concentration rate for each corresponding posterior becomes , where }, .

Example 2 (Sparse precision matrix).

The third rate in may be improved if the precision matrix is appropriately sparse. Banerjee and Ghosal (2014) showed that when the matrix has an exact banding structure with banding size , then using an appropriate -Wishart prior, the posterior for the precision matrix contracts at the rate with respect to the spectral norm. When the sparsity does not possess a specific structure, Banerjee and Ghosal (2015) showed that the rate reduces from to with repsect to the Frobenius norm, where is the number of non-zero off-diagonal elements.

As a consequence of posterior contraction near the truth, the following estimate is easily obtained (see page 200 of Ghosal and van der Vaart, 2017).

Lemma 4.2.

For positive constants and , and the rate in (4.4), define the event

then

(4.5)

4.2 Dimensionality and recovery

In this section, we study the dimensionality and recovery properties of the the marginal posterior of .

Lemma 4.3 (Dimension).

Let a prior satisfying (3.2) for all be given. Assume that , , and . Then for a sufficiently large number ,

(4.6)

Lemma 4.3 also implies that the sum of the cardinalities of the non-zero groups in different columns will not exceed . We state this result in the following corollary.

Corollary 4.4.

Under the setup of Lemma 4.3, with ,

(4.7)

From Corollary 4.4, when either or . This means that the support of the posterior can substantially overshoot the true dimension . In the next corollary, we show that the posterior is still able to recover even when ;

Corollary 4.5 (Recovery).

Under Assumption 2, if , then for a sufficiently large constant ,

(4.8)

where is the restricted eigenvalue (see Definition 4.6 below).

Definition 4.6 (Restricted eigenvalue).

The smallest scaled singular value of dimension is defined as

(4.9)

As , the smallest eigenvalue of the design matrix must be 0. The restricted eigenvalue condition assumes that the smallest eigenvalue for the sub-matrix of the design matrix, which corresponds to the coefficients within non-zero groups, is not 0.

The results for other norms for the difference between and can be also derived by assuming different assumptions on the smallest eigenvalue for the sub-matrix of the design matrix. For example, using the uniform compatibility condition (in Definition 4.7 below), we can conclude that for a sufficiently large number ,

The proof is almost identical to that of Corollary 4.5.

Definition 4.7 (Uniform compatibility, -norm).

The -compatibility number in vectors of dimension is defined as

(4.10)

By the Cauchy-Schwarz inequality, , it follows that for any .

4.3 Distributional approximation

In this section, we show that the posterior distribution can be approximated by a mixture of multivariate normal densities.

We first rewrite the model (1.1) as

(4.11)

where is a vector by stacking all the columns of into a row vector, is a block diagonal matrix. The above model can be also written as

where is a vector, which consists of coordinates of from the set , and is a matrix, which is the submatrix of .

Then log-likelihood function is given by

(4.12)

If , then the maximum likelihood estimator (MLE) of is unique. We denote the MLE as and the Fisher information matrix as . From (4.12), we can obtain that and .

Based on the model (4.11), the marginal posterior distribution of is

(4.13)

with

It is clear that the posterior distribution is a mixture density over different subsets .

Let . In the next theorem, we show that the posterior can be approximated by a mixture of multivariate normal densities given by

(4.14)

where

(4.15)

with , for all .

Before we state the theorem, one more terminology should be introduced. We recall the notion of the small region (see Castillo et al., 2015). In our setting, each belongs to the small region if When belongs to this region, the MLE, , is an asymptotically unbiased estimator and does not depend on the choice of different values of . When choosing the value of outsides the small region, this MLE is no longer asymptotically unbiased and will depend on the choice of (cf. see Theorem 11 of the supplementary material of Castillo et al., 2015). As a result, the posterior will concentrate near a distribution with center differing a lot with different values of .

Theorem 4.8 (Distributional approximation).

For , if satisfies (3.2) and , then for a positive constant ,

(4.16)

Note that the above theorem does not require that the cardinality of the set to be close to . The result still holds when .

4.4 Selection

In the previous two sections, we have shown that even if , the marginal posterior of can recover the truth and can be approximated by a mixture of multivariate normal densities. In this section, we derive conditions for selection consistency. Since selection consistency requires , we need to assume the dimension of the covariance and the coordinates in the non-zero groups are sufficiently small. We also need to assume that the smallest signal cannot be too small, which is a group sparse version of the Beta-min condition. This condition is stated in below:

(4.17)

The lower bound displayed in the condition is derived from (4.8). Unlike the Beta-min condition in Castillo et al. (2015) which the individual components are bounded away from 0, our condition allows a zero to be included in a non-zero group.

Theorem 4.9 (Selection consistency).

If satisfies Assumption 1 for all , , , and , then for with , where is defined in (4.1), and a positive number ,

If the conditions of Theorem 4.9 are satisfied, then the marginal posterior distribution of in non-zero groups can be approximated by a multivariate normal distribution with mean and the covariance matrix . Therefore, credible intervals for can be obtained directly from the approximating multivariate normal density.

5 Proofs

Proof of Lemma 3.1.

Let , then

Therefore, the probability in (3.5) can be also written as

(5.1)

where

By (A.4), the probability (5.1) is bounded above by where . We plug-in the expression for in (A.2) and , then the last display is bounded below by if

(5.2)

By choosing and , (5.1) is bounded above by , which goes to as or . ∎

Proof of Theorem 4.1.

The proof contains two parts. In the first part, we quantify prior concentration around the truth in the sense of Kullback-Leibler divergence from the true density. In the second part, using the results obtained from the first part, we derive (4.2) and (4.3).

Part I. The method we use to obtain the posterior contraction rate is described as follows. We construct a test from the first principle by breaking up the effective parameter space into several pieces which are sufficiently separated from the truth. Then for each piece, we consider the likelihood ratio test for the truth against a representative in the piece. We show that this test works for the entire piece by controlling the likelihood ratio. Finally, we consider the maximum of these tests and control its size by estimating the total number of pieces.

Let be a suitable “sieve”. We shall verify that

(5.3)
(5.4)

for positive constants and and the following condition.

Let and , where and . Then there exists a test such that