Bayesian Linear Regression for Multivariate Responses Under Group Sparsity
We study the frequentist properties of a Bayesian high-dimensional multivariate linear regression model with correlated responses. Two features of the model are unique: (i) group sparsity is imposed on the predictors. (ii) the covariance matrix is unknown and its dimensions can be high. We choose a product of independent spike-and-slab priors on the regression coefficients and a Wishart prior with increasing dimension on the inverse of the covariance matrix. Each spike-and-slab prior is a mixture of a point mass at zero and a multivariate density involving a -norm. We first obtain the posterior contraction rate, the bounds on the effective dimension of the model with high posterior probabilities. We then show that the multivariate regression coefficients can be recovered under certain compatibility conditions. Finally, we quantify the uncertainty for the regression coefficients with frequentist validity through a Bernstein-von Mises type theorem. The result leads to selection consistency for the Bayesian method. We derive the posterior contraction rate using the general theory through constructing a suitable test from the first principle by bounding moments of likelihood ratio statistics around points in the alternative. This leads to the posterior concentrates around the truth with respect to the average log-affinity. The technique of obtaining the posterior contraction rate could be useful in many other problems.
Bayesian variable selection for multivariate responses
Average log-affinity \kwdBayesian variable selection \kwdcovariance matrix \kwdgroup sparsity \kwdmultivariate linear regression \kwdposterior contraction rate \kwdspike-and-slab prior
Asymptotic behavior of variable selection methods, such as the lasso, have been extensively studied (Bühlmann and van der Geer, 2011). However, theoretical studies on Bayesian variable selection methods are limited to relatively simple settings (Castillo et al., 2015; Chae et al., 2016; Martin et al., 2017; Ročková, 2018; Belitser and Ghosal, 2017; Song and Liang, 2017). For example, Castillo et al. (2015) studied a sparse linear regression model in which the response variable is one-dimensional and the variance is known. However, it is not straightforward to extent those results to study the multivariate linear regression models with unknown covariance matrix (or even the univariate case with unknown variance).
In many applications, predictors are naturally clustered in groups. Below, we give three examples.
Cancer genomics study. It is important for biologists to understand the relationship between clinical phenotypes and DNA mutations, which are detected by DNA sequencing. Since these mutations are spaced linearly along the DNA sequence, it is often assumed that the adjacent DNA mutations on the chromosome have a similar genetic effect and thus should be grouped together (Li and Zhan, 2010).
Multi-task learning. When information for multiple tasks is shared, it is preferable to solve these tasks at the same time to improve learning efficiency and prediction accuracy. Relevant information is preserved across different equations by grouping them together (Lounici et al., 2009).
Causal inference in advertising. Measuring the effectiveness of an advertising campaign running on stores is an important task for advertising companies. Counterfactuals, which are constructed using the sales data of a few stores, chosen by a variable selection method, from a large number of control stores not subject to the advertising campaign, are needed to conduct a causal analysis (Ning et al., 2018). Control stores within the same geographical region—as they share the same demographic information—can be grouped together and selected or not selected at the same time.
Driven by these applications, new variable selection methods designed to select or not select variables as groups, through imposing group sparsity on the regression coefficients, have been developed. For example, the group lasso method is proposed (Yuan and Lin, 2006). It replaces the -norm in the lasso with the -norm, where the -norm is put on the predictors within each group and the -norm is put across the groups. Theoretical properties of the group lasso have been studied (Nardi and Rinaldo, 2008) and its benefits over the lasso in the group selection problem have been demonstrated (Lounici et al., 2009, 2011; Huang and Zhang, 2010). Recently, various Bayesian methods for selecting variables as groups were also proposed (Li and Zhan, 2010; Curtis et al., 2014; Ročková and Lesaffre, 2014; Xu and Ghosh, 2015; Chen et al., 2016; Greenlaw et al., 2017; Liquet et al., 2017). However, large-sample frequentist properties of these Bayesian methods have not been studied yet.
In this paper, we study a Bayesian method for the multivariate linear regression model with two distinct features: group sparsity that is imposed on the regression coefficients and an unknown covariance matrix. To the best of our knowledge, even in a simpler setting without assuming group sparsity, convergence and selection properties of methods for high-dimensional regression with a multivariate response having an unknown covariance matrix have not been studied in either the frequentist or the Bayesian literature. However, it is important to understand the theoretical properties of those models because correlated responses arise in many applications. For example, in the study of the causal effect of an advertising campaign, sales in different stores are often spatially correlated (Ning et al., 2018). Furthermore, when the dimension of the covariance matrix is large, it would affect the quality of the estimation of the regression coefficients.
When the covariance matrix is unknown and high-dimensional, the techniques that were developed for deriving posterior concentration rates (Castillo et al., 2015; Martin et al., 2017; Belitser and Ghosal, 2017) cannot be applied. Also, The general theory of posterior concentration in its basic form (Ghosal and van der Vaart, 2017) is not appropriate to use because it typically deals with the average Hellinger distance which is not sufficient for our analysis. Thus in order to apply the general theory to derive a rate, we shall construct required tests directly by controlling the moments of likelihood ratios in small pieces.
In this study, we consider a multivariate linear regression model
where is a response variable, , is a predictor variable, , is a matrix containing the regression coefficients, and are independent identically distributed (i.i.d) as with being a unknown covariance matrix. In other words, in the regression model, there are non-overlapping groups of predictor variables with the group structure being pre-determined. When , it reduces to the setting that the sparsity is imposed on the individual coordinates. Thus the results derived in our paper are applicable to the ungrouped setting as well. The model can be rewritten in the vector form as
where is a matrix, where , and is a vector. The dimension can be very large. The dimension can be large as well to a lesser extent when the sample size is large. The number of total groups is clearly bounded by . We denote the groups which contain at least a non-zero coordinate as non-zero groups and the remaining groups as zero groups.
To allow derivation of asymptotic properties of estimation and selection, certain conditions on the growth of , , and need to be imposed. We allow (which means that ) but require that the total number of the coefficients in all non-zero groups together are less than in order. We further assume that the number of coordinates in any single group must be of order less than and that . Finally, to make the covariance matrix is consistently estimable, we assume that the dimension of the covariance matrix satisfies the condition that .
As for the priors, we choose a product of independent spike-and-slab priors for and a Wishart prior for , the precision matrix. The spike-and-slab prior is a mixture of point mass for the zero coordinates and a density for non-zero coordinates. In the ungrouped setting, commonly used densities for non-zero coordinates are a Laplace density (Castillo et al., 2015), a Cauchy density (Castillo and Mismer, 2018) and a normal density with mean chosen by empirical Bayes methods (Martin et al., 2017; Belitser and Ghosal, 2017). In this paper, we choose a special density for the non-zero coordinates (see (3.1)). This density involves the -norm, which is used as a penalty to obtain the group lasso in a non-Bayesian setting.
The remainder of the paper is organized as follows. Section 2 introduces notations that will be used in this paper. Section 3 describes the priors, along with the necessary assumptions. Section 4 provides the main results. The proofs of the main results are given in Section 5. Auxiliary lemmas are provided in Appendix A.
We assume that are disjoint groups such that . Since these groups are given and will be kept the same throughout, their notaions will be dropped from subscription notations. Clearly, is the number of elements in . Let . For each , let stand for the set which contains the indices of the non-zero groups for the -th component and be the cardinality of the set . Also, define and . Let be the set containing the indices of the true non-zero groups, where . Define and
For a vector , let , and be the -, - and -norm of respectively, where with is the submatrix of consisting of coordinates. For a matrix , we denote as the -th column of and as the Frobenius norm of . For a symmetric positive definite matrix , let denote the eigenvalues of ordered from the smallest to the largest and stand for the determinant of . For a scalar , we denote to be the absolute value of .
Let be the negative log-affinity between densities and and be their squared Hellinger distance. The Kullback-Leibler divergence between and is given by and the Kullback-Leibler variation between and is denoted by . The symbol stands for the density of with the parameters at their true values. The notation denotes the total variation distance between two probability measures and .
We let stand for the -covering number of a set with respect to , which is the minimal number of -balls needed to cover the set . Let stand for the dimensional identity matrix and stand for the indicator function.
The symbols and will be used to denote inequality up and down to a constant while stand for for two constants and . The notations and stand for and respectively. The symbol stands for a Dirac measure.
3 Prior specifications
In this section, we introduce the priors used in this study. We place two independent priors on and as they are both unknown. We place products of independent spike-and-slab priors on and a Wishart prior on , which is known as the precision matrix.
3.1 Prior for regression coefficients
We denote the -th column of as and the notations and as collections of the regression coordinates in the non-zero groups and the zero groups respectively. Each spike-and-slab prior is constructed as follows. First, a dimension is chosen from a prior on the set . Next, a subset of cardinality is randomly chosen from the set . Finally, A vector is chosen from a probability density on given by (3.3). The remaining coordinates set to 0. To summarize, the prior for is
where and the density is the prior for the dimension .
Assumption 1 (Prior on dimension).
There are positive constants , , , with
for , .
For example, the complexity prior given by Castillo et al. (2015) by replacing by shall satisfy the above assumption.
The Laplace density (Castillo et al., 2015) or the Cauchy density (Castillo and Mismer, 2018) are generally chosen as , since the normal density has too sharp tail that overshrinks the non-zero coefficients, although some empirical Bayes modifications of the mean can overcome the issue (see Martin et al., 2017; Belitser and Ghosal, 2017). However, in our setting, as sparsity is imposed at the group level, a more natural choice of the prior is the density which incorporates the -norm. We thus consider the prior
where (see Lemma A.1). This density has its tail lighter than the corresponding Laplace density. From Stirling’s approximation, it follows that . We would like to mention that Xu and Ghosh (2015) developed a posterior computational strategy for a similar prior. They also incorporated the -norm into their prior, except for that they did not provide the explicit expression of the normalizing constant for that prior.
The tuning parameter needs to be bounded both from above and below. The value of cannot be too large or it will shrink the non-zero coordinates too much towards 0. It should not be too small because a very small value will be unable to prevent many false signals appear in the model and hence making the posterior to contract slower. The upper and lower bounds for the permissible limits are stated in below.
For each , , where
The upper bound of is motivated from the following lemma.
Under Assumption 2,
3.2 Prior for the covariance matrix
We put a Wishart prior on the precision matrix: , where stands for a -dimensional Wishart distribution, is a symmetric positive definite matrix, , , and .
Although, other priors can be used, the Wishart prior is the most commonly used prior in practice as it is conjugate for the multivariate normal likelihood.
4 Main results
4.1 Posterior contraction rate
We study the posterior contraction rate for the model (1.1) and the priors given in Section 3. We write and for the true values for and respectively. Recall that is the set which includes the index of the true non-zero groups of , is the cardinality of that set . Let and . Define the set .
The general theory of posterior contraction for independent non-identically distributed observations (see Theorem 8.23 of Ghosal and van der Vaart, 2017) is often used to derive a posterior contraction rate, which is based on the average squared Hellinger distance. Since the average squared Hellinger distance between multivariate normal densities with an unknown covariance is small does not necessarily imply that the parameters in the two densities are also close on average, we work directly with on the average log-affinity which is still very tractable in the multivariate normal setting.
To derive a posterior contraction rate, we construct a suitable test from the first principle by breaking up the effective parameter space given by a sequence of appropriate sieves under the alternative hypothesis into several pieces. For each piece sufficiently separated from the truth, we pick up a representative and obtain a most powerful test (i.e., the Neymann-Pearson test) for the truth against that alternative. We bound the moments of the likelihood ratio of an arbitrary density in the piece to the density of the representive of that piece to show that the most powerful test for the truth against the representative has adequate power for any alternatives in the corresponding piece.
By using this approach, we require the true values of and to be restricted into certain regions to ensure that the prior concentration around the true point is not too small so that the posterior contraction rate is sufficiently fast. This is unlike Castillo et al. (2015), who obtained results uniformly over the whole space as their case (univariate with known variance and Laplace prior) allows explicit expressions for a direct treatment. More precisely, we require and , where and are shown in the following assumption.
The largest value of is obtained is by taking for all and , which is the ungrouped case. Then the upper bound becomes . For the sake of simplicity, we assume that is bounded above by a large constant. Then that upper bound increases to infinity very quickly since . When , the upper bound increases at a slower rate than the bound when , as increases.
A major contribution we make to prove this theorem is the construction of exponentially powerful tests for the truth against the complement of a ball by splitting the complement in suitable pieces (not necessarily balls) where we can control a moment of the likelihood ratio for two points within each piece. This gives a general technique of construction of tests required for the application of the general theory, which can be useful in many other problems.
Instead of using the prior given in (3.3), one can also choose a Laplace density for the coordinates in the non-zero groups. Then the -norm of , , in the set should be replaced by . Clearly, , hence in the latter case the set will be smaller. In fact, one can replace the -norm with any other -norms, for . Then the norm of in needs to be adjusted accordingly.
When , the rate reduces to }. The first part of the rate is the same as the rate obtained when the sparsity is imposed at the individual level, such as in Bühlmann and van der Geer (2011) and Castillo et al. (2015). When , the first rate can be obtained when the ratio (i.e., when the number of coordinates in each group takes a fixed number) and is sufficiently slowly growing.
The second rate in (4.4) reveals that in some situations, by imposing group sparsity, the posterior will contract at a slower rate than imposing sparsity at the individual level. This happens when too many zeros are put into non-zero groups (often known as weakly group-sparse (Huang and Zhang, 2010)).
From Theorem 4.1, if the dimension of the covariance is too large, then the posterior contraction rate can be much slower. Under such a situation, the rate may be improved if we know any special structures for the precision matrix. Here we give two examples.
Example 1 (Independent responses).
If the responses are independent across components, then the model (1.1) can be written as independent model with each one is
Then one can estimate the parameters in the models separately. The posterior concentration rate for each corresponding posterior becomes , where }, .
Example 2 (Sparse precision matrix).
The third rate in may be improved if the precision matrix is appropriately sparse. Banerjee and Ghosal (2014) showed that when the matrix has an exact banding structure with banding size , then using an appropriate -Wishart prior, the posterior for the precision matrix contracts at the rate with respect to the spectral norm. When the sparsity does not possess a specific structure, Banerjee and Ghosal (2015) showed that the rate reduces from to with repsect to the Frobenius norm, where is the number of non-zero off-diagonal elements.
As a consequence of posterior contraction near the truth, the following estimate is easily obtained (see page 200 of Ghosal and van der Vaart, 2017).
For positive constants and , and the rate in (4.4), define the event
4.2 Dimensionality and recovery
In this section, we study the dimensionality and recovery properties of the the marginal posterior of .
Lemma 4.3 (Dimension).
Let a prior satisfying (3.2) for all be given. Assume that , , and . Then for a sufficiently large number ,
Lemma 4.3 also implies that the sum of the cardinalities of the non-zero groups in different columns will not exceed . We state this result in the following corollary.
Under the setup of Lemma 4.3, with ,
From Corollary 4.4, when either or . This means that the support of the posterior can substantially overshoot the true dimension . In the next corollary, we show that the posterior is still able to recover even when ;
Corollary 4.5 (Recovery).
Definition 4.6 (Restricted eigenvalue).
The smallest scaled singular value of dimension is defined as
As , the smallest eigenvalue of the design matrix must be 0. The restricted eigenvalue condition assumes that the smallest eigenvalue for the sub-matrix of the design matrix, which corresponds to the coefficients within non-zero groups, is not 0.
The results for other norms for the difference between and can be also derived by assuming different assumptions on the smallest eigenvalue for the sub-matrix of the design matrix. For example, using the uniform compatibility condition (in Definition 4.7 below), we can conclude that for a sufficiently large number ,
The proof is almost identical to that of Corollary 4.5.
Definition 4.7 (Uniform compatibility, -norm).
The -compatibility number in vectors of dimension is defined as
By the Cauchy-Schwarz inequality, , it follows that for any .
4.3 Distributional approximation
In this section, we show that the posterior distribution can be approximated by a mixture of multivariate normal densities.
We first rewrite the model (1.1) as
where is a vector by stacking all the columns of into a row vector, is a block diagonal matrix. The above model can be also written as
where is a vector, which consists of coordinates of from the set , and is a matrix, which is the submatrix of .
Then log-likelihood function is given by
If , then the maximum likelihood estimator (MLE) of is unique. We denote the MLE as and the Fisher information matrix as . From (4.12), we can obtain that and .
Based on the model (4.11), the marginal posterior distribution of is
It is clear that the posterior distribution is a mixture density over different subsets .
Let . In the next theorem, we show that the posterior can be approximated by a mixture of multivariate normal densities given by
with , for all .
Before we state the theorem, one more terminology should be introduced. We recall the notion of the small region (see Castillo et al., 2015). In our setting, each belongs to the small region if When belongs to this region, the MLE, , is an asymptotically unbiased estimator and does not depend on the choice of different values of . When choosing the value of outsides the small region, this MLE is no longer asymptotically unbiased and will depend on the choice of (cf. see Theorem 11 of the supplementary material of Castillo et al., 2015). As a result, the posterior will concentrate near a distribution with center differing a lot with different values of .
Theorem 4.8 (Distributional approximation).
For , if satisfies (3.2) and , then for a positive constant ,
Note that the above theorem does not require that the cardinality of the set to be close to . The result still holds when .
In the previous two sections, we have shown that even if , the marginal posterior of can recover the truth and can be approximated by a mixture of multivariate normal densities. In this section, we derive conditions for selection consistency. Since selection consistency requires , we need to assume the dimension of the covariance and the coordinates in the non-zero groups are sufficiently small. We also need to assume that the smallest signal cannot be too small, which is a group sparse version of the Beta-min condition. This condition is stated in below:
The lower bound displayed in the condition is derived from (4.8). Unlike the Beta-min condition in Castillo et al. (2015) which the individual components are bounded away from 0, our condition allows a zero to be included in a non-zero group.
Theorem 4.9 (Selection consistency).
If the conditions of Theorem 4.9 are satisfied, then the marginal posterior distribution of in non-zero groups can be approximated by a multivariate normal distribution with mean and the covariance matrix . Therefore, credible intervals for can be obtained directly from the approximating multivariate normal density.
Proof of Lemma 3.1.
Let , then
Therefore, the probability in (3.5) can be also written as
Proof of Theorem 4.1.
The proof contains two parts. In the first part, we quantify prior concentration around the truth in the sense of Kullback-Leibler divergence from the true density. In the second part, using the results obtained from the first part, we derive (4.2) and (4.3).
Part I. The method we use to obtain the posterior contraction rate is described as follows. We construct a test from the first principle by breaking up the effective parameter space into several pieces which are sufficiently separated from the truth. Then for each piece, we consider the likelihood ratio test for the truth against a representative in the piece. We show that this test works for the entire piece by controlling the likelihood ratio. Finally, we consider the maximum of these tests and control its size by estimating the total number of pieces.
Let be a suitable “sieve”. We shall verify that
for positive constants and and the following condition.
Let and , where and . Then there exists a test such that