Bayesian semiparametric analysis of multivariate continuous responses, with variable selection
We develop models for multivariate Gaussian responses with nonparametric models for the means, the variances and the correlation matrix, with automatic variable selection based on spike-slab priors. We use the separation strategy to factorize the covariance matrix of the multivariate responses into a product of matrices involving the variances and the correlation matrix. We model the means and the logarithm of the variances nonparametrically, utilizing radial basis function expansion. We describe parametric and nonparametric models for the correlation matrix. The parametric model assumes a normal prior for the elements of the correlation matrix, constrained to lie in the space of correlation matrices while the nonparametric model is utilizes Dirichlet process mixtures of normal distributions. We discuss methods for posterior sampling and inference and present results from a simulation study and two applications.
The software we implemented can handle response vectors of arbitrary dimension and it is freely available via R package BNSP.
Keywords: Clustering; Covariance matrix models; Model averaging; Seemingly unrelated regression models; Separation strategy
Many systems are too complex to be adequately described by a single response variable. For instance, human reaction to a drug may require results on multiple blood tests, evaluation of student performance may require results on tests on diverse topics, and in medicine evaluation of the health of an individual may require multiple measurements. Multivariate response models would be needed for the analysis of data arising from these and many other experimental setups. Our main goal here is to develop Bayesian multivariate response models for continuous responses, assuming multivariate Gaussian distributions and with nonparametric models for the mean vectors and covariance matrices.
Modelling unconstrained means non-parametrically, as general functions of the covariates, is straight forward and by now fairly standard. In the work that we present here, nonparametric effects are represented as linear combinations of radial basis functions. Generally, our approach is to utilize a large number of basis functions that enables flexible estimation of true effects that are locally adaptive. Potential over-fitting is mitigated by utilizing spike-slab priors for variable selection (see e.g. \@BBOPcitet\@BAP\@BBNO’Hara and Sillanpää (2009)\@BBCP for a review on variable selection methods).
Modelling covariance matrices non-parametrically is not as straight forward as modelling the means, due the positive definiteness constraint that complicates matters. To overcome this constraint and model the elements of the covariance matrix in terms of regressors, a first, necessary step is to decompose the covariance matrix into a product of matrices. Such decompositions include the spectral and Cholesky, and variations of the latter. \@BBOPcitet\@BAP\@BBNPinheiro and Bates (1996)\@BBCP review the spectral and Cholesky decompositions and several parameterisation. Based on the spectral decomposition and the matrix logarithmic transformation, \@BBOPcitet\@BAP\@BBNChiu et al. (1996)\@BBCP model the structure of a covariance matrix in terms of explanatory variables. \@BBOPcitet\@BAP\@BBNPourahmadi (1999)\@BBCP and \@BBOPcitet\@BAP\@BBNChen and Dunson (2003)\@BBCP describe two modifications of the Cholesky decomposition that result in statistically meaningful, unconstrained, reparameterisation of the covariance matrix, provided that there is a natural ordering in the responses \@BBOPcitep\@BAP\@BBN(Pourahmadi, 2007)\@BBCP, as it happens in longitudinal studies, where the time of observation provides this natural ordering.
The spectral and the modified Cholesky decompositions, outside the context of longitudinal studies, lack simple statistical interpretations, making it difficult for practitioners to incorporate their prior beliefs into the model. A decomposition, however, that is statistically simple and intuitive comes from the separation strategy of \@BBOPcitet\@BAP\@BBNBarnard et al. (2000)\@BBCP according to which is separated into a diagonal matrix of variances and a correlation matrix . This decomposition makes it easy to model the variances in terms of covariates as the only constrained on them is the positiveness. Here we use a -link and linear predictors that are constructed in the same way as for the mean parameters.
\@BBOPcitet\@BAP\@BBNChan et al. (2006)\@BBCP describe several reasons why allowing the variances to be general functions of the covariates is meaningful. Firstly, prediction intervals obtained from heteroscedastic regression models can be more realistic than those obtained by assuming constant error variance, or as \@BBOPcitet\@BAP\@BBNMüller and Mitra (2013)\@BBCP put it, it can result in more honest representation of uncertainties. Secondly, it allows the practitioner to examine and understand which covariates drive the variances, and in the multivariate response case, examine if the same or different subsets of covariates are associated with the variances of the responses. Thirdly, modelling the variances in terms of covariates results in more efficient estimation of the mean functions, and lastly, it produces more accurate standard errors for the estimates of unknown parameters.
Our approach for variable selection and model averaging can be thought of as a generalization of the approach of \@BBOPcitet\@BAP\@BBNGeorge and McCulloch (1993)\@BBCP who describe methods for univariate linear regression and the approach of \@BBOPcitet\@BAP\@BBNChan et al. (2006)\@BBCP who focus on methods for flexible mean and variance modelling for a single response. We develop an efficient stochastic search variable selection algorithm by using Zellner’s g-prior \@BBOPcitep\@BAP\@BBN(Zellner, 1986)\@BBCP that allows integrating out the regression coefficients in the mean function. Further, in the Markov chain Monte Carlo (MCMC) algorithm that we have implemented, we generate the variable selection indicators in blocks \@BBOPcitep\@BAP\@BBN(Chan et al., 2006)\@BBCP and choose the MCMC tuning parameters adaptively \@BBOPcitep\@BAP\@BBN(Roberts and Rosenthal, 2001)\@BBCP.
Of course, the separation of the variances from the correlations alone does not solve the problem of positive definiteness, as the constraint has now been transferred from the covariance matrix to the correlation matrix . Here, we place a normal prior on the Fisher’s transformation of elements of , , where denotes the space of correlation matrices and denotes the indicator function that restricts the range of the correlations and induces dependence among them \@BBOPcitep\@BAP\@BBN(Daniels and Kass, 1999)\@BBCP. We rely on the ‘shadow prior’ of \@BBOPcitet\@BAP\@BBNLiechty et al. (2004)\@BBCP to maintain positive definiteness. The model is intuitive and easy to interpret, allowing practitioners to represent their substantive prior knowledge.
However, the normal model for the correlations is quite restrictive, which can have a negative impact on the estimated correlations, especially in small samples \@BBOPcitep\@BAP\@BBN(Daniels and Kass, 1999)\@BBCP. Here, to achieve a nonparametric model for the correlation matrix, we consider mixtures of normal distributions for the transformed . This is in the spirit of the ‘grouped correlations model’ of \@BBOPcitet\@BAP\@BBNLiechty et al. (2004)\@BBCP, who also propose a ‘grouped variables model’. The latter clusters the variables instead of the correlations and it is more structured than the nonparametric grouped correlations model. Here, we consider both the grouped correlations and variables models.
In what follows, we work with generic Dirichlet process \@BBOPcitep\@BAP\@BBN(Ferguson, 1973)\@BBCP mixtures of normal distributions for the correlations, utilizing the stick breaking construction \@BBOPcitep\@BAP\@BBN(Sethuraman, 1994)\@BBCP. However, one of the attractive features of the grouped correlations and variables models is that they allow the researcher to represent prior information and beliefs about the strength of correlations among variables and the general structure of the correlation matrix, see \@BBOPcitet\@BAP\@BBNLiechty et al. (2004)\@BBCP and \@BBOPcitet\@BAP\@BBNTsay and Pourahmadi (2017)\@BBCP for examples on structured correlation matrices.
Our work is related to two further strands of the literature. The first one is known as ‘seemingly unrelated regressions’ (SUR) and it originates from the work of \@BBOPcitet\@BAP\@BBNZellner (1962)\@BBCP. The second one is known as ‘generalized additive models for location, scale and shape’ (GAMLSS) and it originates from the work of \@BBOPcitet\@BAP\@BBNRigby and Stasinopoulos (2005)\@BBCP.
Concerning SUR, \@BBOPcitet\@BAP\@BBNZellner (1962)\@BBCP showed how efficiency gains can be achieved by simultaneous estimation of linear regression equations, accommodating potentially correlated error terms. This gain in efficiency, measured in terms of reduction in the variance of the estimates of regression coefficients, can be substantial when the correlation among error terms are high and covariates in different regression equations are not highly correlated. As the methodology presented in this paper is a Bayesian semi-parametric version of Zellner’s model, similar gains are to be expected from our approach too, and these are investigated in a simulation study presented in Section 4.
GAMLSS, and the Bayesian analogue termed as BAMLSS \@BBOPcitep\@BAP\@BBN(Umlauf et al., 2018)\@BBCP, provides a general framework for the analysis of data in a very wide class of univariate distributions, utilizing flexible models for the distribution parameters. The popularity of these methods stems from the fact that for most realistic problems, the assumption that the parameters are linearly dependent on the covariates, or even constant (as in homoscedastic regression), is not tenable. Applying this level of regression flexibility to multivariate response models is currently an active area of research. \@BBOPcitet\@BAP\@BBNSmith and Kohn (2000)\@BBCP implemented the multivariate normal regression model with smooth additive terms in the mean function and with homoscedastic errors. \@BBOPcite\@BAP\@BBNKlein et al. (2015)\@BBCP present applications of the GAMLSS framework to bivariate regression with normal and -distributed errors, and on Dirichlet regression. \@BBOPcite\@BAP\@BBNKlein and Kneib (2016)\@BBCP used copulas in bivariate response models, relating the parameters of the marginals and those of the dependence structure to additive predictors. Here, we focus attention to models with Gaussian errors and we develop a fully multivariate model with nonparametric models for the means, the variances and the correlation matrix, with automatic variable selection based on spike-slab priors.
The remainder of this paper is arranged as follows. Section 2 introduces the proposed model. Posterior sampling is discussed in Section 3. Section 4 presents results from a simulation study that examines the efficiency gains one may have when fitting a multivariate model instead of many univariate ones. The simulation study also examines the run times needed to fit a multivariate model instead of the univariate ones. In Section 5 we present two applications. The first one represents a standard situation where there is a multivariate response, and the objectives are to estimate their relation, in both the mean and variance, on a common set of covariates. The second one is on graphical modelling, where the conditional independence properties of the inverse covariance matrix are combined with flexible regression modelling. The paper concludes with a brief discussion. The software that we used for obtaining the results in this paper is available in the R package BNSP \@BBOPcitep\@BAP\@BBN(Papageorgiou, 2019)\@BBCP.
2 Multivariate response model
Let denote a -dimensional response vector and and denote covariate vectors, observed on the th sampling unit, .
We model the mean of the th response, , using
where and . As we detail below, the linear predictor may include parametric and nonparametric terms. Even though it may appear from (1) that all regression equations have the same set of predictors, the introduction of binary indicators for variable selection will allow each response to have its own set of covariates.
The implied model for the mean of vector is
The mean vector can also be written as , where and have the same structure as and above, but with and replaced by and .
We let denote the covariance matrix of the th vector response
which, just like the mean function in (2), will be modelled in terms of a linear predictor and additionally, in terms of a correlation matrix .
The model specification is completed by assuming a normal distribution for the response vector
Alternatively, the model can be written in the usual form
where , , and .
In the following subsections we detail how the mean and covariance functions are modelled nonparametrically.
2.1 Mean model
The mean function takes the following general form
where denotes the regressors with parametrically modelled effects and denotes the regressors with effects that are modelled as unknown functions. Further, denotes the total number of regressors that enter the mean models.
Unknown functions are represented using
where and are the vectors of basis functions and regression coefficients.
In the current paper, the basis functions of choice are the radial basis functions, given by , where are the knots.
Our general approach for representing unknown functions is to utilize a large number of basis functions. With this approach, under-fitting may be avoided. Our approach for dealing with over-fitting is to allow positive prior probability that the regression coefficients are exactly zero. The latter is achieved by the introduction of binary variables that allow coefficients to drop out of the model. These, for parametric effects, are denoted as , while for nonparametric effects we have . Binary indicators will be grouped as the regression coefficients after (6), .
2.2 Covariance model
A first step in modelling the covariance matrices in terms of covariates is to employ the separation strategy of \@BBOPcitet\@BAP\@BBNBarnard et al. (2000)\@BBCP, according to which is expressed as a diagonal matrix of variances, , and a correlation matrix ,
The next subsections consider models for the diagonal elements of and for the correlation matrix .
2.2.1 Diagonal variance matrices
Modelling the diagonal matrices in terms of covariates is straight forward as the only requirement on these elements is that they are nonnegative. Hence, an additive model with a log-link may be utilised
where and denote covariates with parametric and nonparametric effects on the log-variance, respectively, and denotes the total number of effects that enter the variance models. Further, are unknown functions of covariates, represented as linear combinations of radial basis functions and regression coefficients, . Hence, model (8) may be written as
where and .
Consider now vectors of indicator variables for selecting the elements of that enter the th variance regression model. In line with the indicator variables for the mean model, these are denoted by .
Given , model (9) can be expressed as
Let . The model for can be expressed as
where the design matrix consists of rows, with the th row containing the elements of that corresponds to the non-zero elements of .
2.2.2 Common correlations model
Turning our attention now to the correlation matrix , the first prior model we consider, termed the ‘common correlations model’, takes the following form
Here denotes the space of correlation matrices, is the indicator function that ensures the correlation matrix is positive definite and is the normalizing constant
Function may be taken to be the Fisher’s transformation , considered within Bayesian hierarchical modeling by \@BBOPcitet\@BAP\@BBNDaniels and Kass (1999)\@BBCP. With this choice, . Another choice is the identity function that simplifies the model formulation.
The ‘common correlations model’ utilizes two parameters, a common mean and a common variance, to describe the distribution of the nonredundant elements of . As this can be restrictive, we consider two more flexible models, the ‘grouped correlations’ and ‘grouped variables’ models.
2.2.3 Grouped correlations model
The ‘grouped correlations model’ includes a clustering on the elements of , and it takes the form
where denotes the number of correlation groups and denotes the mean of the th group, .
2.2.4 Grouped variables model
The ‘grouped variables model’ is another clustering model that clusters the variables instead of the correlations. The prior takes the form
where is the number of groups in which the variables are distributed, creating clusters for the correlations.
A clustering on the variables is more structured than a clustering on the correlations. In other words, a clustering on the variables implies a clustering on the correlations. The converse, however, is not necessarily true. Figure 1 provides an example. Here, the number of responses is taken to be . The response variables are grouped into two clusters, the first group consisting of variables , and the second one of variables . These two groups create three groups of correlations, two of which describe the correlations within each group and one that describes the correlation between the two groups.
2.3 Prior specification
Let . The prior for is specified as \@BBOPcitep\@BAP\@BBN(Zellner, 1986)\@BBCP
Further, the prior for is specified as inverse Gamma, .
For the vectors of indicator variables, we specify independent binomial priors for each of its subvectors,
where for parametric effects, , and for nonparametric effects, We work with Beta priors for , , , although sparsity inducing, zero-inflated Beta priors, are also an attractive option.
Continuing with the priors on the covariance parameters, we specify independent normal priors for
Further, the priors we consider for , are the half-normal, , and the inverse Gamma,
For the subvectors of we specify independent binomial priors
where for parametric effects, , and for nonparametric effects, We specify independent Beta priors for , .
For we consider inverse Gamma and half-normal priors, denoted as and .
Lastly, we describe the priors on the parameters of the correlation models. Starting with the ‘common correlations model’ in (11), we place the following priors on its parameters
We take the ‘grouped correlations model’ to be arising from the ‘common correlations model’, by treating the prior on as another unknown model parameter. In symbols, , where is an unknown distribution. Here, we place a Dirichlet process (DP) prior on \@BBOPcitep\@BAP\@BBN(Ferguson, 1973)\@BBCP. Due to the almost sure discreteness of the DP, the prior admits the following representation
where is an indicator function, . The prior weights are constructed utilising the so called stick-breaking process \@BBOPcitep\@BAP\@BBN(Sethuraman, 1994)\@BBCP. Let be independent draws from a distribution. We have, , for , . We take the concentration parameter to be unknown and we assign to it a gamma prior with mean . Further, are generated from the so called base distribution, here taken to be .
The ‘grouped correlations model’ in (2.2.3) is obtained by first writing
where and denote the vectors of group means and weights respectively.
In practice, we truncate to include components. In this case, the prior weights are constructed as before, except for the th one that is now constructed as . Further, we introduce allocation variables to indicate the component in which has been assigned to, . The stick-breaking weights provide the prior on the allocation variables: . With these observations, it is clear how model (2.2.3) follows.
The development on the ‘grouped variables model’ is very similar, with the clustering now performed on the variables rather than the correlations.
In the simulation study and applications that we present in Sections 4 and 5, we use the following priors. For we specify as a -variate analogue of the prior of \@BBOPcitet\@BAP\@BBNLiang et al. (2008)\@BBCP. For all inclusion probabilities, and , we define Beta or uniform priors. The prior on all is specified to be IG. Further, for all , we define the prior to be HN. In addition, we specify and . Lastly, the DP base distribution is taken to be the standard normal while the concentration is taken to have a prior.
3 Posterior Sampling
To carry out posterior sampling we consider two likelihood functions and use the one that is more computationally convenient for each step of the MCMC algorithm.
We first consider the full likelihood i.e. the one that involves all model parameters. The contribution of , using decomposition (7), may be expressed as
where and . Hence, the likelihood function, based on all observations, is
To improve mixing of the MCMC algorithm, we can integrate out vector from the likelihood (14), to obtain
with , and is the total number of columns in .
A more convenient way of computing is provided by the following
where and .
Sampling from the posterior of the parameters of the correlation matrices poses the greatest challenge. Consider, for instance, sampling from the posterior of parameter of the ‘common correlations model’, given in (11), using the Metropolis-Hastings algorithm. Letting and denote current and proposed values, the acceptance probability will involve the ratio of the normalising constants , which can be very computationally demanding to calculate.
Posterior sampling, however, may be simplified by utilising the ‘shadow prior’ \@BBOPcitep\@BAP\@BBN(Liechty et al., 2004)\@BBCP. The basic idea is to introduce latent variables between correlations and mean , by which prior (11) becomes
Further, variables are assumed to be independently distributed as
and is taken to be a small constant number. Sampling from the posterior will still involve the ratio of the normalising constants, , but that, as was argued by \@BBOPcitet\@BAP\@BBNLiechty et al. (2004)\@BBCP, for small can reasonably be approximated by one. In addition, now sampling for the posterior of given is straight forward. Hence, the computational burden is greatly alleviated.
To obtain a proposal density and sample from (18) we utilize the method of \@BBOPcitet\@BAP\@BBNZhang et al. (2006)\@BBCP and \@BBOPcitet\@BAP\@BBNLiu and Daniels (2006)\@BBCP. We start by considering a symmetric, positive definite and otherwise unconstrained matrix in place of , assumed to have an inverse Wishart prior , with mean equal to the realization of from the previous iteration of the sampler. Given the inverse Wishart prior on , we obtain the following easy to sample from inverse Wishart posterior
We decompose into a diagonal matrix of variances , and a correlation matrix . The Jacobian associated with this transformation is . It follows that the joint density for is