Flexible linear mixed models with improper priors for longitudinal and survival data
Abstract
We propose a Bayesian approach using improper priors for hierarchical linear mixed models with flexible random effects and residual error distributions. The error distribution is modelled using scale mixtures of normals, which can capture tails heavier than those of the normal distribution. This generalisation is useful to produce models that are robust to the presence of outliers. The case of asymmetric residual errors is also studied. We present general results for the propriety of the posterior that also cover cases with censored observations, allowing for the use of these models in the contexts of popular longitudinal and survival analyses. We consider the use of copulas with flexible marginals for modelling the dependence between the random effects, but our results cover the use of any random effects distribution. Thus, our paper provides a formal justification for Bayesian inference in a very wide class of models (covering virtually all of the literature) under attractive prior structures that limit the amount of required user elicitation.
stats.ME/1301.0463
Flexible linear mixed models
T1This research was partially supported by EPSRC. We gratefully acknowledge insightful comment from two referees.
and
Bayesian inference \kwdheavy tails \kwdMEAFT models \kwdposterior propriety \kwdskewness \kwdstochastic frontier models
class=AMS] \kwd62F15 \kwd62J05 \kwd62N01
Contents:
1 Introduction
Longitudinal and survival data appear in many application areas such as medicine, biology, agriculture, epidemiology, economics and small area estimation. This kind of observations typically present within and among subject dependence due to grouping and/or repeated measurements. Hierarchical linear mixed models (LMM) are often used to account for parameter variation across groups of observations as well as dependence. The general formulation of this type of model is given by
(1) 
where is the observed response for subject at time , with recording the number of repeated measurements for subject , and where denotes the number of subjects, is a vector of fixed effects, are mutually independent random vectors (often called random effects) and are i.i.d. residual errors. Let be the vector of response variables and and denote the known design matrices of dimension and , respectively, while is the vector of residual errors, and . We assume that , and throughout. In matrix notation we can write (1) as
These models are used in different fields under different names and notation. In Bayesian analysis of variance, the use of these models goes back to Tiao and Tan (1965). In survival analysis, the survival times grouped in are usually transformed to logarithms and modelled through
(2) 
which is often referred to as a Mixed Effects Accelerated Failure Time model (MEAFT, Komárek and Lesaffre, 2007). In this literature, the random effects are typically called frailties. An additional challenge that often appears in this case is the presence of censored observations. In the context of production and cost frontiers used in economics, the logarithm of the output (or cost) of firms are modelled using (1) with certain regularity restrictions on the regression coefficients and sign restrictions on the random effects , which have a clear interpretation here in terms of inefficiencies. These models are known as stochastic frontier models (see e.g. Fernández et al., 1997).
The random vectors and are often assumed to be normally distributed. Given that the normality assumption can be restrictive in practice (and even impossible for stochastic frontier models), alternative distributional assumptions have been explored, such as smooth flexible (seminonparametric) distributions for the random effects and normal residual errors (Zhang and Davidian, 2001), skew Student distributions for the random effects (Lee and Thompson, 2008), normal random effects with residual errors modelled through a finite mixture of normals (Komárek and Lesaffre, 2007, 2008), and residual errors and random effects jointly distributed as a multivariate scale mixture of skewnormal distributions (Lachos et al., 2010), which makes the residual errors and the random effects dependent. Jara et al. (2008) model both the residual errors and the random effects using multivariate skew–elliptical distributions, and Dunson (2010) and Jara et al. (2009) use Bayesian nonparametric approaches. Although flexible, Bayesian nonparametric approaches tend to be more computationally expensive than simpler flexible parametric models, while leading to similar conclusions as shown in our examples.
In a Bayesian framework with improper priors, Hobert and Casella (1996) and Sun et al. (2001) present conditions for the propriety of the posterior under a certain improper prior structure for the parameters of model (1) with normal assumptions on both the random effects and the residual errors. When using improper prior structures, it becomes essential to check the propriety of the posterior distribution in order to justify their use and interpretation in a probabilistic framework. Hobert and Casella (1996) characterise cases where certain improper prior structures used in practice lead to proper posteriors. The use of such priors was becoming common at that time since they do allow for the construction of a Gibbs sampler. However, Hobert and Casella (1996) show that the posterior may still be improper in some cases and, consequently, their use is not theoretically justified. Rubio (2015) presents extensions of these propriety results to the use of certain classes of flexible random effects distributions. Hobert and Casella (1996) and Rubio (2015) assume independence of the random effects, while Sun et al. (2001) assume a specific structure for the covariance of the random effects. Fernández et al. (1997) proposed an improper prior structure which allows for using arbitrary random effects distributions, as long they are proper. More specifically, they adopt the following stochastic structure
(3) 
where is proper, and is proper or bounded. The condition that is proper is rather mild as this allows for using any parametric distribution for the random effects with proper priors on the “deeper” parameters that index the random effects distribution. The propriety of is essential since an improper never leads to a posterior (see Theorem 2 from Fernández et al., 1997). The prior structure on is interesting as it includes priors with useful properties such as invariance with respect to reparameterisation under affine transformations. In this paper, we extend (1) by relaxing the assumption of normality of the residual errors and derive conditions for the existence of the corresponding posterior distribution that also allow for censored observations, covering virtually all models in the recent literature. Before presenting our results, we clarify our main contributions relative to our previous work in flexible linear regression models. Rubio and Genton (2016) study linear regression models with skewsymmetric errors and improper priors, covering the case with censored observations. Analogous results are presented in Rubio and Yu (2017) in the context of linear regression models with twopiece residual errors. Both papers focus on the effect of using skewed residual error distributions in terms of the prediction of the residual life of censored individuals. They do not study the propriety of the posterior distribution of linear regression models with random effects, which requires special attention as shown in this paper.
If not presented in the text, proofs are provided in the Supplementary Material (Appendix A).
2 Multivariate scale mixtures of normals: a warning
In the context of linear regression, a common strategy for relaxing the assumption of normality of the residual errors consists in using multivariate scale mixtures of normals (Lachos et al., 2010). Recall that a variate scale mixture of normal distribution () with location parameter , symmetric positive definite scale matrix , and parametric mixing distribution is defined through
For example, when is a Gamma distribution with parameters (and mean one), we obtain the variate Student distribution with degrees of freedom. Osiewalski and Steel (1993) and Breusch et al. (1997) have pointed out inferential peculiarities using such error distributions, both in Bayesian and classical regression settings. Here we explore the implications of using multivariate scale mixtures of normal in the context of LMM.
Consider the following hierarchical structure for model (1)
(4)  
(5) 
where is proper. For our theoretical results, it does not matter whether is assumed to be a fixed distribution without further recourse to deeper parameters or whether it is a marginal distribution obtained from a proper joint distribution. In practice, the latter is typically the case and it is derived from a proper parametric distribution with a proper prior on its parameters. Throughout Sections 2 and 3, we will only refer to for the sake of simplicity of notation.
We adopt the following prior structure
(6) 
where , is bounded, and is a proper prior with support . The following result shows that the marginal likelihood of the data for this hierarchical model can be factorised as the product of the marginal likelihood under normality multiplied by a quantity that does not depend on the data.
Remark 1.
The factorisation of the marginal likelihood of the data can be used to obtain conditions for the propriety of the posterior distribution as follows.
Corollary 1.
Consider the model (1), (4)(6) and let denote the entire design matrix, and suppose that is proper. Consider also the following conditions:

,

,

.
Then, conditions (a) and (c) are necessary for the propriety of the posterior and conditions (a), (b), and (c) are sufficient for the propriety of the posterior.
For , condition (c) is satisfied by any mixing distribution. However, for this condition may rule out the use of some mixing distributions and/or priors on , as discussed in the Supplementary Material (Appendix B). Remark 1 and Corollary 1 point out some issues with this extension. If we use Bayes factors to compare two models and of the type described in Remark 1, with different distributions for the residuals and corresponding marginal likelihoods and , we obtain (using obvious notation)
Therefore, the Bayes factor is determined solely by the mixing distributions and their priors, but not by the data (!). Intuitively, these results indicate that the data do not contain information about the additional shape parameter . The factorisation (7) indicates that the use of this kind of multivariate extension is vacuous, in the sense that the additional shape parameters do not really help model selection. Similar issues with the use of certain classes of multivariate distributions and priors have been studied in Maruyama and Strawderman (2014) for linear regression models. The next section presents an alternative extension that overcomes such problems.
3 The proposed extension: flexible linear mixed models
The inferential issues pointed out in the previous section are due to the use of a single mixing variable in the construction of the joint distribution of the residual errors. In other words, (5) really implies only a single realisation of the mixing variable, irrespective of the size of . Thus, we only have a single observation of the mixing variable, and inference on its (precision) parameter is impossible from the sample. One way to avoid these issues consists of using one mixing variable for each individual error. In this case, each observation contributes to the estimation of the parameter of the distribution of the mixing variable (see West, 1984 for a discussion in the context of Bayesian linear regression). Thus, we consider the model in (1) and (4), but replace (5) by
(8) 
This is a rather wide class of symmetric and unimodal distributions, with tails which are either normal or fatter than normal, covering e.g. the normal, Student, logistic, Laplace, Cauchy and the exponential power family with power (see Fernández and Steel, 2000 for more details). These distributions naturally lead to models that are robust to the presence of outliers (see Rosa et al., 2003, who also adopt residual errors, but restrict their study to the use of proper priors and normal random effects).
Using the same prior structure (6), we derive the following conditions for the propriety of the posterior distribution.
Theorem 1.
Condition (a) indicates the need for repeated measurements (otherwise and for sensible choices of ), while conditions (b) and (c) characterise the priors and mixing distributions that produce a proper posterior. Given that all the distributional assumptions in Theorem 1 are continuous, it follows that condition (d) is satisfied with probability one. Condition (d) is equivalent to saying that has no solution, which can be checked numerically.
The only requirement on the distribution of the random effects is that the marginal is proper, so as stated before, we can use an arbitrary parametric distribution , with a proper prior on the parameter . This includes, for example, the use of finite mixtures of normals, scale mixtures of normals, skewed scale mixtures of normals, truncated distributions, etc. In our examples, we explore the use of distributions that can capture departures from normality in terms of asymmetry and tail behaviour.
3.1 Mixed effects accelerated failure time models with censoring
The propriety result presented in Theorem 1 can be extended to models used in survival analysis. Let be a sample of survival times. The MEAFT in (2) is exactly the same as (1) for the log survival times. However, a common difficulty that arises with survival regression models is the presence of censored observations (Komárek and Lesaffre, 2007; Vallejos and Steel, 2015). If the sample of survival times does not contain censored observations, the existence of the posterior is provided by Theorem 1. If the sample contains both censored and uncensored observations, the propriety of the posterior can be based on the uncensored observations as follows.
Theorem 2.
The following result presents conditions for the case when the sample contains only censored observations.
Theorem 3.
Suppose that survival times , , are observed as closed intervals , and the rest of the observations exhibit another type of censoring. Thus, let be finitelength intervals on the positive real line, and let represent the design matrix associated to the interval–censored observations. Consider the model (2), (4), (8) with prior (6), assumed to be proper, and the following condition:

The set and the column space of are disjoint.
Conditions (a)–(c) from Theorem 1 together with (d’) are sufficient for the propriety of the posterior.
Condition (d’) means that there is no –unobserved– true response that can be written as a linear combination of the columns of and , and can be relaxed in terms of the dimensionality as follows.

There exists a set , , for some set of indexes , such that and the column space of are disjoint, where and are the design submatrices associated to the indexes .
Another way of checking condition (d’) consists of formulating it as a linear programming problem. Denote , , and , . Define the problem:
Find  
Subject to  
and  (9) 
Then, condition (d’) is equivalent to verifying the infeasibility of (3.1), for which several theoretical and numerical tools are available.
3.2 Stochastic frontier models
The stochastic frontier model with composed error takes the form of (1) where with . Several specifications of interest in the context of economics are studied in Fernández et al. (1997). The propriety results in Theorem 1 apply to this model, which represents an extension of Theorem 1 in Fernández et al. (1997).
3.3 Asymmetric errors
So far, we have considered symmetric extensions of the standard linear mixed model with normal errors. We now extend to asymmetric errors by using the class of two–piece distributions. Let be a continuous density with a unique mode at , support on and shape parameter . A random variable is distributed according to a twopiece distribution (denoted ) if its density function can be written as:
(10) 
where denotes the indicator function, and are positive differentiable functions. This density is continuous, unimodal, with mode at , scale parameter , and skewness parameter . It coincides with the density when , and is asymmetric for while retaining the tails of in each direction. The most common choices for and correspond to the inverse scale factors parameterisation , (Fernández and Steel, 1998), and the epsilonskew parameterisation , (Mudholkar and Hutson, 2000). Rubio and Steel (2014) study other parameterisations as well as some interpretable choices for the prior of the skewness parameter .
Suppose now that in (10) is a univariate scale mixture of normals with mixing distribution and consider the model (1) with the following error structure
(11) 
and prior
(12) 
where , is bounded, and and are proper priors with
Theorem 4.
Condition (e) is automatically satisfied for and any parameterisation . For the epsilonskew parameterisation, condition (e) is satisfied for any and any proper prior given that the functions and are upper bounded. Thus, the introduction of skewness does not destroy the existence of the posterior distribution while allowing for more flexibility.
4 Numerical examples
4.1 Simulation study I (Student errors, skewed random effects)
In this section we conduct a simulation study, inspired by that presented in Zhang and Davidian (2001), in order to assess the effect of different distributional assumptions on the random effects and the residual errors. We study the model:
(13) 
where , , , , , . According to Zhang and Davidian (2001), represents a case where the amount of information available to estimate both the fixed effects and the random effects is modest. Four scenarios are considered for the distribution of the residual errors and the random effects . For the first scenario we simulate from and ; where represents a Student distribution with location parameter , scale parameter and degrees of freedom, and represents a two–piece normal with location parameter , scale parameter and skewness parameter using the epsilonskew parameterisation. For the second scenario we use and . The third scenario consists of and . The fourth scenario uses and . We simulate data sets under these configurations. For each of these simulated samples, the model (13) was fitted assuming that and with the prior structure:
(14) 
where is the weakly informative priors for the degrees of freedom of a Student distribution proposed in Rubio and Steel (2015), is a uniform prior on , is a halfCauchy density with mode and unit scale parameter, which has been shown to induce a posterior with good frequentist properties in the context of Bayesian hierarchical models (Polson and Scott, 2012). Finally, is a uniform prior on , which represents a weakly informative prior as described in Rubio and Steel (2014). The prior in (21) is consistent with prior (6) and ensures a proper . The propriety of the posterior is then guaranteed by Theorem 1. For each of the simulated sets we obtain a posterior sample of size using an adaptive Metropolis within Gibbs algorithm (Roberts and Rosenthal, 2009) using the ‘spBayes’ R package (Finley et al., 2007). The posterior samples are obtained after a burnin period of iterations and thinned every iterations in order to reduce correlation ( draws in total). Table 1 presents summary statistics of the posterior samples, the median (over the datasets) Bayes factors in favour of (symmetric random effects), calculated using the SavageDickey density ratio, and the average odds , where , which provides information about the appropriateness of the assumption of normal tails for the residual errors. The SavageDickey density ratio (Verdinelli and Wasserman, 1995) is calculated as the ratio of the marginal posterior density function and the prior density of the parameter under consideration () evaluated at the point of interest (). In order to evaluate the posterior density, we employ a kernel density estimator (with Gaussian kernel) of the posterior samples. The use of the (standard) SavageDickey density ratio is valid in our setting as we are using independent proper priors for the parameters that differ across models (see Verdinelli and Wasserman, 1995). Posterior medians are presented instead of posterior means as the existence of the posterior mean is not guaranteed in all the scenarios. From this table, we can observe that despite the relatively small sample size, these Bayesian model selection criteria correctly identify the true model in each scenario.
In order to assess the impact of the assumption of normality, we implement the model with normal residual errors and random effects. Table 2 summarizes posterior inference for this model. The misspecification of the distribution of the residual errors leads to an overestimation of , which is in line with the fact that this scale parameter has a different interpretation in normal and Student cases, while the presence of unmodelled skewness affects the estimation of the location parameter . Note also that misspecification dramatically increases the uncertainty about the fixed effects.
Scenario 1  Scenario 2  Scenario 3  Scenario 4  
Parameter  
1.99 (1.96,2.04)  2.00 (1.98,2.03)  2.00 (1.96, 2.03)  2.00 (1.98,2.03)  
0.99 (0.74,1.23)  0.99 (0.80,1.20)  1.01 (0.75, 1.21)  1.00 (0.76,1.19)  
0.50 (0.42,0.57)  0.48 (0.44,0.52)  0.50 (0.42,0.57)  0.48 (0.44,0.52)  
1.97 (1.61,2.71)  31.54 (13.64, 203.5)  1.96 (1.60,2.70)  30.48 (13.73,203.4)  
1.50 (1.97, 1.20)  1.50 (1.92,1.22)  1.47 (1.91,1.04)  1.50 (1.95,1.11)  
0.49 (0.38,0.60)  0.49(0.41,0.57)  0.48 (0.37,0.59)  0.49 (0.41,0.59)  
0.50 (0.00,0.79)  0.54 (0.04,0.79)  0.00 (0.51,0.58)  0.01 (0.52,0.41)  
Median BF  
0.78  0.43  1.98  2.50  
Median odds  
0  17.54  0  19.02 
Scenario 1  Scenario 2  Scenario 3  Scenario 4  
Parameter  
2.00 (1.89,2.09)  2.00 (1.98, 2.03)  2.00 (1.89, 2.10)  2.00 (1.97,2.02)  
0.99 (0.05,2.24)  1.07 (0.13,2.39)  1.08 (0.05,2.05)  1.01 (0.78,1.20)  
1.53 (1.01,3.05)  0.50 (0.47,0.53)  1.53 (1.01,3.05)  0.50 (0.46,0.53)  
3.41 (4.21,2.82)  3.51 (4.28,2.97)  1.53 (2.18,0.92)  1.50 (1.65,1.35)  
2.57 (2.18,3.00)  5.27 (2.24,2.94)  2.44 (2.07,2.86)  0.50 (0.42,0.59) 
4.2 Simulation study II (Student errors and random effects with censoring)
In this section, we present a simulation study with censored responses. We consider a model which is used in practice for modelling the evolution of a marker in longitudinal studies (see e.g. Vaida and Liu, 2009), in particular
where and are mutually independent, , and . We generate data using two scenarios: (I) , and ; and (II) , and . The theoretical values of the regression parameters are . We simulate 100 data sets under these two configurations and truncate the values that are greater than . This truncation strategy produces samples with 7%–18% censored observations. Then, we fit the following four models: Model 1, , and , Model 2, , and , Model 3, , and and Model 4, , . For the most general model (Model 4) we adopt the prior structure:
(15) 
where and are the weakly informative priors for the degrees of freedom of a Student distribution proposed in Rubio and Steel (2015), is a halfCauchy density with mode and unit scale parameter. For Models 13 we use the obvious reduction of this prior. The propriety of the posterior is then guaranteed by Theorem 2.
Tables 3 and 4 (with a star indicating the model used to generate the data) give a summary of the posterior results.
Parameter  Model 1*  Model 2  Model 3  Model 4 

2.49 (2.37,2.61)  2.50 (2.37,2.61)  2.50 (2.38,2.61)  2.49 (2.37,2.61)  
2.99 (2.90,3.08)  2.99 (2.91,3.08)  2.99 (2.91,3.07)  2.98 (2.90,3.07)  
3.51 (3.39,3.60)  3.51 (3.39,3.60)  3.51 (3.39,3.60)  3.51 (3.39,3.60)  
3.99 (3.90,4.14)  3.99 (3.90,4.14)  3.99 (3.89,4.14)  4.00 (3.90,4.14)  
4.49 (4.40,4.62)  4.50 (4.41,4.63)  4.49 (4.40,4.63)  4.49 (4.40,4.62)  
0.50 (0.47,0.54)  0.49 (0.46,0.53)  0.50 (0.47,0.54)  0.48 (0.44,0.53)  
–  59.51 (29.15,201.12)  –  32.73 (11.28,212.87)  
0.25 (0.20,0.31)  0.25 (0.20,0.31)  0.22 (0.15,0.28)  0.22 (0.16,0.28)  
–  –  11.33 (4.64,97.28)  10.77 (4.41, 18.78) 
Table 3 suggests that using a more complex model than necessary does not affect the inference on the parameters in the simple model (Model 1), which is encouraging, as we do not really seem to lose anything by allowing for flexibility. We merely find out that the assumed Student distributions have large degrees of freedom parameters, which makes them close to normal.
Parameter  Model 1  Model 2  Model 3  Model 4* 

2.45 (2.06,2.67)  2.49 (2.33,2.68)  2.49 (2.18,2.69)  2.50 (2.35,2.66)  
2.96 (2.55,3.26)  3.00 (2.83,3.22)  3.00 (2.65,3.34)  3.00 (2.85,3.20)  
3.43 (3.08,3.70)  3.48 (3.31,3.68)  3.47 (3.22,3.70)  3.49 (3.34,3.65)  
3.98 (3.68,4.21)  4.00 (3.80,4.20)  4.00 (3.71,4.24)  4.01 (3.82,4.20)  
4.54 (4.18,4.87)  4.50 (4.31,4.67)  4.57(4.20,4.92)  4.50 (4.33,4.64)  
1.10 (0.87,2.64)  0.50 (0.41,0.58)  1.09 (0.87,2.47)  0.50 (0.43,0.58)  
–  1.91 (1.33,2.84)  –  1.96 (1.52,2.86)  
0.52 (0.26,1.52)  0.48 (0.34,0.76)  0.20 (0.06,0.40)  0.28 (0.18,0.41)  
–  –  1.65 (0.97,5.07)  2.17 (1.31,6.55) 
From Table 4 we conclude that wrongly assuming normality of the distributions tends to make inference less precise, especially if we get the tails of the residual errors wrong. It also seems that the (more latent) tail parameter of the random effects is harder to estimate than that of the ’s.
4.3 Framingham heart study
We now illustrate the performance of the proposed models with real data. We use the data set reported in Zhang and Davidian (2001), which consists of measurements of the cholesterol level for 200 randomly selected individuals from the Framingham heart study. The measurements were taken at the beginning of the study and then every 2 years for 10 years. We are interested in the relationship between the cholesterol level and the age (at baseline) and gender of the patients. Zhang and Davidian (2001) model this relationship through the LMM:
(16) 
where represents the cholesterol level divided by 100 at the th time for subject and is , with time measured in years from baseline. Zhang and Davidian (2001) assume normal residual errors and the density of the random effects is represented by a seminonparametric truncated series expansion.
Model (16) is simply the LMM in (1) with and . Here we adopt the following hierarchical structure:
where GC denotes a bivariate Gaussian copula with marginals and , and correlation parameter . For these marginal distributions we use a twopiece sinharcsinh distribution with the epsilon–skew parameterisation (Rubio et al., 2016; Rubio and Steel, 2015) which contains a scale parameter , a skewness parameter , and a kurtosis parameter , . We denote this model as Model 1. We adopt the prior structure:
(17) 
As a weakly informative prior for the degrees of freedom of the Student distribution, we employ the prior proposed in Rubio and Steel (2015). For each of the scale parameters ) we adopt a halfCauchy prior (Polson and Scott, 2012). The shape parameters are assigned the prior proposed in Rubio and Steel (2015) for a general class of kurtosis parameters. For each of the skewness parameters we adopt a uniform distribution on . In a bivariate Gaussian copula the Spearman measure of association, , can be calculated in closed form as
for which we assume a uniform prior , inducing the following prior density on
The propriety of the corresponding posterior distribution is guaranteed by Theorem 1. We also consider the following submodels: Model 2 (two–piece normal marginal random effects); Model 3 (one marginally normal random effect and one marginally twopiece normal random effect); Model 4 (normal random effects); and Model 5 (normal random effects and normal residual errors). Finally, Model 6 assumes normality of all distributions as well as independence between both random effects. A posterior sample of size was obtained after a burnin period of iterations and thinning every iterations ( simulations in total). Table 6 presents a summary of the posterior results.
The Bayes factors, calculated using the SavageDickey density ratio, slightly favour the general model. However, the evidence in favour of this model is not decisive (see Table 5). The Bayes factors associated to Models 4–6 (against Model 1) are not shown in Table 5 since they are virtually zero, but the Bayes factors associated to other submodels are presented in this table for illustration. Bayes factors clearly indicate support for skewness of and dependence of both random effects, which leaves us with Models 13. An argument of parsimony could, then, lead to selecting Model 3. Notice that the intervals for the general model are more spread out (e.g. for in Table 6). This indicates that the sample does not contain enough information to accurately estimate all the parameters of this model. The posterior probability of in Models 1 and 3 is approximately and for Model 2 is approximately , which suggests that models with heavier tails than normal are favoured.
To assess the predictive performance of the different models we use the log pseudomarginal likelihood (LPML), which is defined as the sum of the log conditional predictive ordinate (CPO) statistic for each observation. The CPO for subject is defined as the predictive density of the vector of observations given the rest of the observations. This is, , where the predictive density is evaluated at the vector of observations corresponding to the subject (see Bandyopadhyay et al., 2012). These quantities are approximated using a harmonic mean estimator as described in Bandyopadhyay et al. (2012). Table 6 shows the LPML for the different models, which favour Model 3. A larger sample would likely provide stronger evidence for model selection.
The analysis in Zhang and Davidian (2001) suggests departures from normality in the distribution of random intercepts, while the estimated random slope distribution can be approximated by a normal density. Our conclusions are in line with the conclusions of the parametric, seminonparametric and the nonparametric analyses in Zhang and Davidian (2001), Jara et al. (2008), and Jara et al. (2009). However, the models proposed here also allow us to interpret the meaning of the parameters, separating those controlling the shape of the distribution from those controlling the dependence between the random effects.
Model  

BF  0.04  0.65  0.77  1.04  0.02  0.63  0.42 
Model  1  2  3  4  5  6 

(intercept)  1.494 (1.218,1.794)  1.649 (1.436,1.973)  1.640 (1.366,1.878)  1.621 (1.336,1.957)  1.610 (1.223,1.939)  1.715 (1.399,1.980) 
(age)  0.016 (0.008,0.024)  0.012 (0.005,0.017)  0.013 (0.007,0.020)  0.017 (0.010,0.024)  0.018 (0.010,0.027)  0.015 (0.009, 0.023) 
(sex)  0.057 (0.154,0.044)  0.059 (0.162,0.040)  0.061 (0.159,0.042)  0.048 (0.159,0.074)  0.060 (0.172, 0.048)  0.015 (0.123,0.094) 
(time)  0.130 (0.011,0.384)  0.121 (0.002,0.321)  0.290 (0.243,0.336)  0.289 (0.241,0.337)  0.282 (0.233,0.327)  0.282 (0.235,0.329) 
0.163 (.146,0.182)  0.163 (0.147,0.181)  0.164 (0.147,0.182)  0.163 (.146,0.182)  0.209 (0.198,0.221)  0.209 (0.198,0.220)  
5.094 (3.513,8.095)  5.116 (3.522,8.153)  5.202 (3.586,8.459)  5.115 (3.524,8.235)  –  –  
0.275 (0.181,0.412)  0.372 (0.336,0.418)  0.372 (0.335,0.416)  0.384 (0.347,0.428)  0.380 (0.344,0.425)  0.376 (0.340,0.419)  
0.201 (0.075,0.829)  0.186 (0.133,0.244)  0.203 (0.144,0.257)  0.204 (0.147,0.261)  0.193 (0.125,0.254)  0.193 (0.127,0.252)  
0.331 (0.524,0.115)  0.345 (0.521,0.145)  0.339 (0.524,0.138)  –  –  –  
0.546 (0.972,0.277)  0.597 (0.969,0.086)  –  –  –  –  
0.820 (0.656,1.057)  –  –  –  –  –  
1.047 (0.630,3.407)  –  –  –  –  –  
0.425 (0.177,0.640)  0.397 (0.158,0.611)  0.383 (0.137,0.620)  0.397 (0.141,0.638)  0.4243 (0.168,0.686)  –  
LPML  22.81  25.24  19.12  26.75  30.88  32.98 
4.4 HIV1 viral load after unstructured treatment interruption
Here we revisit the data set analysed in Vaida and Liu (2009), which concerns the study of 72 perinatally HIVinfected children and adolescents. Some of the subjects present unstructured treatment interruption, which is a common phenomenon among HIVinfected people, due mainly to treatment fatigue (Vaida and Liu, 2009). The number of observations from baseline (month 0) to month 24 ranges from 13 to 71. Out of 362 observations, 26 observations (7%) were below the detection limits and were censored at these values. Vaida and Liu (2009) proposed the model:
where is the HIV RNA (HIV Ribonucleic acid, which is used to monitor the status of HIV) for subject at time , using , , , , , , and months.
We allow for Student tails in the random effects and the errors (with zero locations), and thus use Model 4 in Subsection 4.2 with the same prior structure as in (15). We also try the simpler models 13 as defined in Subsection 4.2.
The propriety of the posterior is ensured by Theorem 2.
Table 7 reveals clear evidence for fat tails in both distributions, especially the residual errors. The predictive criterion LPML also favours the general model and particularly penalizes model with normal residual errors. Clearly, ignoring the heavy tails affects the inference on the fixed effects, especially for .
Model  1  2  3  4 

3.62 (3.36,3.87)  4.00 (3.76,4.24)  3.70 (3.45,3.93)  4.16 (3.95,4.38)  
4.19 (3.93,4.43)  4.22 (4.00,4.45)  4.26 (4.02,4.50)  4.37 (4.18,4.56)  
4.26 (4.00,4.52)  4.26 (4.04,4.48)  4.34 (4.09,4.59)  4.41 (4.22,4.60)  
4.38 (4.12,4.64)  4.43 (4.20,4.65)  4.46 (4.21,4.70)  4.58 (4.38,4.77)  
4.61 (4.33,4.89)  4.53 (4.31,4.76)  4.68 (4.41,4.96)  4.68 (4.48,4.87)  
4.60 (4.30,4.89)  4.51 (4.28,4.74)  4.67 (4.38,4.95)  4.65 (4.45,4.85)  
4.70 (4.36,5.02)  4.70 (4.45,4.95)  4.77 (4.46,5.08)  4.84 (4.61,5.06)  
4.81 (4.40,5.22)  4.73 (4.44,5.03)  4.88 (4.47,5.28)  4.88 (4.62,5.14)  
0.60 (0.55,0.66)  0.20 (0.16,0.25)  0.61 (0.56,0.66)  0.20 (0.16,0.24)  
–  1.27 (0.97,1.66)  –  1.30 (1.01,1.69)  
0.90 (0.75,1.10)  0.88 (0.72,1.07)  0.66 (0.46,0.90)  0.56 (0.37,0.79)  
–  –  3.86 ( 1.65,16.81)  2.48 (1.24,7.71)  
LPML  397.4  291.4  401.5  287.6 
Finally, the predictive performance as measured by LPML is substantially better for our models 4 and 2 than for the best model in Bandyopadhyay et al. (2012), who obtain on the same data using the Skewcontaminated normal model of Lachos et al. (2010), which has normal tails for both random effects and error distributions. Also, their posterior distribution of is substantially shifted towards smaller values (consistent with our models 1 and 3, which impose normal errors).
5 Discussion
This paper focuses on hierarchical linear mixed models which are used frequently for both longitudinal and survival modelling in a number of application areas, such as medicine, epidemiology and economics. These models are often based on simple distributional assumptions, such as normality for both the residual errors and the random effects. There is ample evidence in the literature that such assumptions can seriously affect inference on the random effects and the predictive distribution (see e.g. Lee and Thompson, 2008; Zhang and Davidian, 2001) and we add to that evidence in our analysis of simulated data. Misspecification of the distribution of the residual errors and the random effects tends to have a substantial effect on the predictive distribution, which explicitly depends on the aforementioned distributions. The use of normal assumptions when the true generating model is not normal also has a marked effect on the estimation of the variance of the residual errors and the random effects, as well as a decrease in the precision of the estimation of the fixed effects.
We consider Bayesian inference for these models with flexible distributions and sensible prior structures. These priors are intended to convey a lack of strong prior information; as they are based on a combination of formal arguments (such as invariance) and pragmatic common sense, they end up being improper. Our results provide a formal justification for Bayesian inference in these wide classes of models: our models accommodate any proper distribution for the random effects (which could even be dependent, if required) combined with twopiece scale mixtures of normal distributions for the residual errors, as well as potential censoring. Our results thus cover all random effects distributions (both parametric and nonparametric) that were proposed in the literature except for those that are dependent on the residual errors (such as those in Lachos et al., 2010, which where used in Bandyopadhyay et al., 2012). The conditions for the propriety of the posterior distribution are rather mild and easy to check in practice, especially compared to those in previous papers (Hobert and Casella, 1996; Sun et al., 2001; Rubio, 2015), which obtain sets of conditions that combine the sample size, the number of clusters, the design matrix, and the value of the hyperparameters of the priors on the random effects. This also suggests that most of the posterior impropriety issues come from assuming improper priors on the “deeper” parameters of the random effects.
The analysis of simulated datasets leads to quite promising results, which suggest that priors are wellcalibrated and sensible, and Bayes factors seem to provide reliable indications. In addition, Appendix C in the Supplementary Material presents results for the case where both the residual errors and the random effects display skewness, which corroborate the results presented in the main text.
Inference on both real datasets strongly indicates departures from normality in that the residual error distributions have much fatter tails than normal. In addition, the Framingham heart data support dependence between the random effect components with normal random slope and skewnormal random intercept. The HIV data present clear evidence for heavytailed random effects. By using simple Student distributions on both errors and random effects, we obtain considerably better predictive results on the HIV data than those presented by Bandyopadhyay et al. (2012), using their preferred model. We remind the reader that the distributions considered in Bandyopadhyay et al. (2012) are multivariate scale mixtures of skewnormal distributions, so perhaps might be affected by the kind of issues we describe for multivariate scale mixtures of normals in Section 2. For the real data, evidence based on Bayes factors points in the same direction as the predictive performance measured by LPML, which is reassuring.
Here we have employed the R package ‘spBayes’ in all of our numerical examples, mainly for reasons of efficiency and simplicity. However, the user can also implement the models proposed here in any other “general purpose” MCMC software such as Stan (gradientbased MCMC), PROC MCMC (from SAS, which implements a random walk Metropolis), or any package which includes a MetropoliswithinGibbs sampler.
Supplementary material
Supplementary material includes proofs not mentioned in the text (Appendix A), conditions on the mixing distributions (Appendix B) and a further simulation study with skewed errors and random effects (Appendix C). The R codes used in the real data examples are available at http://www.rpubs.com/FJRubio.
Acknowledgements
We thank two referees, an Associate Editor, and the Editor for helpful comments.
Supplementary Material
Appendix A: Proofs
Proof of Remark 1
The marginal density of the data is given by
Consider the change of variable , with corresponding Jacobian . Then,
where we can identify the first factor as the marginal density of the data associated to the model with normal distributional assumptions on .
Proof of Theorem 1
For practical purposes we will work with instead of . The marginal density of the data is given by
(18)  
Denote and . Then, we can obtain the following lower bound
Consider the change of variable