Robust Bayesian model selection for heavy-tailed linear regression using finite mixtures

# Robust Bayesian model selection for heavy-tailed linear regression using finite mixtures

## Abstract

In this paper we present a novel methodology to perform Bayesian model selection in linear models with heavy-tailed distributions. We consider a finite mixture of distributions to model a latent variable where each component of the mixture corresponds to one possible model within the symmetrical class of normal independent distributions. Naturally, the Gaussian model is one of the possibilities. This allows for a simultaneous analysis based on the posterior probability of each model. Inference is performed via Markov chain Monte Carlo - a Gibbs sampler with Metropolis-Hastings steps for a class of parameters. Simulated examples highlight the advantages of this approach compared to a segregated analysis based on arbitrarily chosen model selection criteria. Examples with real data are presented and an extension to censored linear regression is introduced and discussed.

Departamento de Estatística, Universidade Federal de Minas Gerais, Brazil

1

Keywords: Scale mixtures of normal; t-student; Slash; Penalised complexity priors; MCMC.

## 1 Introduction

Statistical practitioners generally use model selection criteria in order to select a best model in different applications. However, model selection has been shown not to be an easy task and each criterion performs better under different situations. For more complex models, it is not clear which criterion is preferable (Carlin:2006; Chen:2006; Gelman). Recently, Gelman studied and compared different model criteria and concluded that “The current state of the art of measurement of predictive model fit remains unsatisfying”. From their study it is clear that different criteria (AIC, DIC, WAIC) fail in selecting the most adequate model under a variety of circumstances. For example, settings with strong prior information or when the posterior distribution is not well summarized by its mean or in a spatial or network setup (for more details, see Gelman). Other authors also comment about the model selection problem, e.g., “In summary, model choice is to Bayesians what multiple comparisons is to frequentists: a really hard problem for which there exist several potential solutions, but no consensus choice”(Carlin:2006), “we saw that no single measure is dominant in all three cases. The L measure performed better when the true model becomes more complex and BIC performed better when the true model is more parsimonious.”(Chen:2006).

Under the Bayesian paradigm, a more robust and elegant solution is available, at least in theory, by considering one “full model” that embeds all the individual models of interest. More specifically, this means that a multinomial r.v. with each category corresponding to one of the individual models is specified. This way, model selection may be performed based on the posterior distribution of this r.v. i.e., the posterior probability of each model. Nevertheless, this approach may be challenging in some cases, specially when the individual models have different dimensions and distinct parameter. Available solutions may need to rely on complicated reversible jump MCMC algorithms and, therefore, other model criteria may be preferred.

A simple and generally efficient solution may be obtained when a mixture distribution can be adopted for one of the model’s component (parameter or latent variable) in a way that each mixture component corresponds to one of the individual models (see, for example, flavio; G&M). This would typically lead to a simple and efficient solution, allowing for model selection to be based on the models’ posterior probability.

We consider a model selection problem concerning the specification of the error distribution in linear regression models. In particular, we consider different heavy-tailed distributions and the traditional Gaussian specification. Existing solutions use model selection criteria arbitrarily chosen (see lachos; basso2010robust; cabral2012multivariate) and, therefore, motivates the development of a more robust methodology.

The distributions of random errors and other random variables are routinely assumed to be Gaussian. However, the normality assumption is doubtful and lacks robustness especially when the data contain outliers or show a significant violation of normality. Thus, previous works have shown the importance of considering more general structures than the Gaussian distribution for this component such as heavy-tailed distributions (Fern:1999; Galea2003; Rosa:2003; Galea:2005; garay2015). These structures provide appealing robust and adaptable models, for example, the Student-t linear mixed model presented by pinheiro2001efficient, who showed that it performed well in the presence of outliers. Furthermore, the scale mixtures of normal (SMN) distributions have also been applied into a wide variety of regression models (see Lange93; osorio2007assessment; lachos2011estimation). It is one of the most important subclasses of the elliptical symmetric distributions. The SMN distribution class contains many heavier-than-normal tailed members, such as Student-t, Slash, power exponential, and contaminated normal. Recently, lin2013estimation (see also lachos2011estimation) investigated the inference of a measurement error model under the SMN distributions and demonstrated its robustness against outliers through extensive simulations.

As defined by Andrews74, a continuous random variable has a SMN distribution if it can be expressed as follows

 Y=μ+κ1/2(U)W,

where is a location parameter, is a normal random variable with zero mean and variance , is a positive weight function, is a mixing positive random variable with density and is a scalar or parameter vector indexing the distribution of . As in Lange93 and Choy08, we restrict our attention to the case where , that is, the normal independent (NI) class of distributions. Thus, and the marginal pdf of is given by

 f(y∣μ,σ2,\boldmathν)=∫∞0ϕ((y−μ)/√u−1σ2)h(u∣\boldmathν)du. (1)

Note that when , we retrieve the normal distribution. Following the steps of basso2010robust, we have the following properties for the SMN family

• If , then ,

• If , then ,

• If , then the excess kurtosis coefficient is given by

 γ2=E[Y−E[Y]]4(Var[Y])2−3=3k4k22−3,

where .

Apart from the normal model, we explore two different types of heavy-tailed densities based on the choice of .

• The Student-t distribution, .
The use of the Student-t distribution as an alternative robust model to the normal distribution has frequently been suggested in the literature (Lange89). For the Student-t distribution with location , scale and degrees of freedom , the pdf can be expressed as

 f(y∣μ,σ,νt)=∫∞0ϕ((y−μ)/√u−1σ2)fG(u∣νt2,νt2)du,

where is the Gamma density function with shape and rate parameters given by and , respectively. That is, is equivalent to the following hierarchical form:

 Y∣μ,σ2,νt,u∼N(μ,u−1σ2), U∣ν∼G(νt/2,νt/2).

For the Student-t distribution we have that

 km=(νt2)m2Γ(νt−m2)Γ(νt2),

therefore, the Student-t has variance , for , and excess kurtosis , for .

• The Slash distribution, .
This distribution presents heavier tails than those of the normal distribution and it includes the normal case when . Its pdf is given by

 f(y∣μ,σ,νs)=νs∫10uνs−1ϕ((y−μ)/√u−1σ2)du.

Thus, the Slash distribution is equivalent to the following hierarchical form:

 Y∣μ,σ2,νs,u∼N(μ,u−1σ2),U∣νs∼B(νs,1),

where denotes the beta distribution. For the Slash distribution we have that

 km=2νs2νs−m,

therefore, the Slash has variance , for , and excess kurtosis , for .

The SMN formulation described above is used in a linear regression approach by taking where is the vector of coefficients and is the design matrix.

The aim of this paper is to propose a general formulation to perform Bayesian model selection for heavy-tailed linear regression models in a simultaneous setup. That is achieved by specifying a full model which includes the space of all individual models under consideration - specified using the SMN approach described above. This way, the model selection criterion can be based on the posterior probability of each model. A mixture distribution is adopted to one of the full model’s variable, with each component of the mixture referring to one of the individual models. This approach has two main advantages when compared to an ordinary analysis where each model is fitted separately and some model selection criterion is used. Firstly, there is a significant gain in the computational cost since we eliminate the need to fit all the individual models separately. Secondly, the proposed model selection criterion is fully based on the Bayesian Paradigm, meaning that the model choice is based on the posterior probability of each model. This is more robust when compared to some other arbitrarily chosen model selection criteria such as DIC, EAIC, EBIC (Spiegelhalter), CPO (Geisser) WAIC (Watanabe). The examples presented in the paper are meant to provide empirical evidence for this argument. The posterior distribution of the unknown quantities has a significant level of complexity which motivates the derivation of a MCMC algorithm to obtain a sample from this distribution.

This paper is organised as follows: Section 2 presents the general model; Section 3 presents a MCMC algorithm to make inference for the proposed model; a variety of simulated examples are presented in Section 4 and the analysis of two real data sets is shown in Section 5. Finally, Section 6 discusses some extensions of the proposed methodology.

## 2 Linear regression model with heavy-tailed mixture structured errors

Model selection is an important and complex problem in statistical analysis and the Bayesian approach is particularly appealing to solve it. In particular, the use of mixtures is a nice way to pose and solve the problem, whenever possible. It allows for an analysis where all models are considered and compared in a simultaneous setup without the need of complicated reversible jump MCMC algorithms. Note that, from (1), each model is determined by the distribution of the scale factor , which suggests that a mixture distribution could be used for this latent variable. We present a general finite mixture model framework capable of capturing different behavior of the response and indicate which individual distribution is preferred.

### 2.1 The model

Define the -dimensional response vector , the design matrix , the -dimensional coefficient vector and two -dimensional vectors and . Finally, let be a -dimensional diagonal matrix with -th diagonal , . We propose the following general model:

 (Y|Zj=1,U=u) ∼ N(X\boldmathβ,σ2γjdiag(u−1)) (2) (Ui|Zj=1) \lx@stackreliid∼ Fj(νj),i=1,…,n, (3) Z ∼ Mult(1,p1,…,pK) (4) γj = gj(νj),j=1,…,K, (5)

where is the Multinomial distribution and each represents a positive distribution controlled by parameter(s) , which may need to be truncated to guarantee that has finite variance under each .

The particular structure chosen for the variance in (2) was thought of so that, for each , the variance of the model is the same - . This is achieved through specific choices for the functions and allows us to treat as a common parameter to all of the individual models. Otherwise we would need one scale parameter for each model. Therefore, in our approach, since we have a common and , the models will mainly differ from each other in terms of tail behavior which will favor the model selection procedure.

Note that each component from the mixture distribution of corresponds to one of the models being considered. Model selection is made through the posterior distribution of . A subtle but important point here is the fact that there is no index for . This means that we assume that all the observations come from the same model, which poses the inference problem in the model selection framework.

Another advantage of the simultaneous approach is that it allows the use of Bayesian model averaging (see RMV). This is particularly useful in cases where more than one model have a significant posterior probability. Note that the models we consider can be quite similar in some situations - specially for higher values of the degrees of freedom (df) parameters.

### 2.2 Prior distributions

The Bayesian model is fully specified by (2)-(5) and the prior distribution for the parameter vector , for . Due to the complexity of the proposed model, the prior distribution plays an important role on the model identifiability and selection process and, for that reason, needs to be carefully specified.

Prior specification firstly assumes independence among all the components of . Secondly, standard priors and are adopted.

The prior distributions of the tail behavior parameters require special attention. This type of parameter is known to be hard to estimate (see steel) and the most promising solutions found in the literature tackle the problem through special choices of prior distributions (see fonseca). Recently, rue proposed a general family of prior distributions for flexibility parameters which includes tail behavior parameters.

In this paper we adopt the penalised complexity priors (PC priors) from rue. In a simple way, the PC priors have as main principle to prefer a simpler model and penalise the more complex one. To do so, the Kullback-Leibler divergence (KLD) (kullback) is used to define a measure of information loss when a simpler model is used to approximate a more flexible model . The measure is defined to be a measure of complexity of model in comparison to . Further, a density function is set for the measure . Finally, the prior distribution of is given by

 π(νj)=λexp(−λd(νj))∣∣∣∂d(νj)∂νj∣∣∣j=1,…,K.

martins showed that in a practical way, for the Student-t regression model, the PC prior can behave very similar to the Jeffrey’s priors constructed by fonseca. Another interesting practical usage of this prior is that the selection of an appropriate is done by allowing the researcher to control the prior tail behavior of the model. For example, for the Student-t distribution the user must select and such that , in other words, how much mass probability is assigned to (where defines the distribution such that the response follows a Student-t distribution). Clearly, the same procedure applies for any other distribution in the NI family that has a flexibility parameter. For more details on the PC priors see rue.

The prior distribution for also requires special attention. Note that even in the extreme (unrealistic) case where is observed, it does not provide much information about , in fact, it is equivalent to the information contained in a sample of size one from a distribution. The fact that is unknown aggravates the problem. A simple and practical way to understand the consequences of this is given by the following lemma, which is a generalisation of Lemma 1 from flavio where, to the best of our knowledge, this problem was firstly encountered.

###### Lemma 1.

For a prior distribution , the posterior mean of , is restricted to the interval .

###### Proof.

See Appendix A. ∎

For example, if , then . This result indicates that the estimation of may be compromised by unreasonable choices of the ’s.

A reasonable solution for this problem is to use a Dirichlet prior distribution with parameters (much) smaller than 1, which makes it sparse. It is important, though, to choose reasonable values for the ’s, in the light of Lemma 1. flavio claim that leads to good results and, in the cases where prior information is available, some of the ’s may be increased accordingly.

## 3 Bayesian Inference

We derive a MCMC algorithm considering the three most common choices in the NI family - Normal, Student-t, Slash. Nevertheless, based on the formulation presented in Section 2.1, including other possibilities is straightforward. One should be careful, however, as it may lead to serious identifiability issues due to similarities among the individual models. The model is given by:

 (Y|Zj=1) ∼ N(X\boldmathβ,σ2γjdiag(u−1)) (6) Z ∼ Mult(1,p1,p2,p3) (7) Ui \lx@stackreliid∼ ⎧⎪⎨⎪⎩δ1,if Z1=1G(νt/2,νt/2),if Z2=1,i=1,…,n,B(νs,1),if Z3=1 (8) γj = ⎧⎪⎨⎪⎩1,if Z1=1(νt−2)/νt,if Z2=1(νs−1)/νs,if Z3=1, (9)

where is a degenerate r.v. at 1 and and are the Gamma and Beta distributions, respectively. We impose that and so that has finite variance () under each individual model.

Inference is performed via MCMC - a Gibbs sampling with Metropolis Hastings (MH) steps for the degrees of freedom parameters. Details of the algorithm are presented below.

### 3.1 Mcmc

We choose the following blocking scheme for the Gibbs sampler:

 (p,Z,U),\boldmathβ,σ2,(νt,νs). (10)

This blocking scheme minimises the number of blocks among the algorithms with only one MH step (which is inevitable for the df parameters). The minimum number of blocks reduces the correlation among the components, which speeds the convergence of the chain. Moreover, the most important and difficult step is the one that samples from and sampling directly from its full conditional also favors the convergence properties of the chain.

The full conditional densities of (10) are all derived from the joint density of all random components of the model.

 π(Y,\boldmathβ,σ2,p,Z,U,γ,νt,νs|X)∝ π(Y|\boldmathβ,σ2,Z,U,γ,X)π(U|Z,νt,νs)π(Z|p)π(p)π(νt)π(νs)π(% \boldmathβ)π(σ2).

The first two terms on the right hand side of (3.1) are given in Section 1, for each individual model (). The remaining terms are given in Section 2.

The full conditional distributions of and are easily devised and given by:

 (\boldmathβ|⋅) ∼ Nq(\boldmathΣβ((τ20Iq)−1\boldmathμ0+(√u⊙X)′(√u⊙y)/(γjσ2)),\boldmathΣβ) (σ2|⋅) ∼ IG(a0+n/2,b0+n∑i=1ui(yi−Xi⋅\boldmathβ)22γj),

where , is the -dimensional vector with entries , is the Hadamard product which multiplies term by term of matrices with the same dimension and is the identity matrix with dimension .

The df parameters are sampled in a MH step with the following transition distribution (at the -th iteration):

 q(νkt,νks) = q(νkt)q(νks) (12) q(νkt) = ((1−Z2)\mathds1(νkt=νk−1t)+Z2fN(νkt;νk−1t,τ2t)) (13) q(νks) = ((1−Z3)\mathds1(νks=νk−1s)+Z3fN(νks;νk−1s,τ2s)), (14)

where is the density of a normal distribution with mean and variance evaluated at . The respective acceptance probability of a move is

 α(k−1→k)=min{1,Z1+Z2π(νkt|⋅)π(νk−1t|⋅)+Z3π(νks|⋅)π(νk−1s|⋅)}, (15)

where

 π(νt|⋅) ∝ π(U|Z2=1,νt)π(νt) π(νs|⋅) ∝ π(U|Z3=1,νs)π(νs).

This result is obtained by adopting the following dominating measure for both the numerator and the denominator of the acceptance probability: if and if , where is the counting measure and is the -dimensional Lebesgue measure. The detailed balance along with the fact that chain is irreducible, makes this a valid MH algorithm (see tierney).

Note that, once we have the output of the chain, estimates of the df parameters will be based on samples of and , which justifies the transition distributions in (12)-(14).

From (3.1), the full conditional density of is

 π(U,Z,p|⋅) ∝ π(y|\boldmathβ,σ2,Z,U,γ,X)[n∏i=1π(Ui|Z,νt,νs)]π(Z|p)π(p) ∝ [n∏i=1π(Ui|⋅)](r1p1)Z1(r2p2)Z2(r3p3)Z3π(p).

Defining and , for , we get

 π(U,Z,p|⋅)∝[n∏i=1π(Ui|⋅)](w1)Z1(w2)Z2(w3)Z3wπ(p). (16)

We can sample from (16) using the following algorithm.

Simulate from a density ; Simulate ; Simulate from the density ; OUTPUT .

Steps 2 and 3 are straightforward once we have that:

 r1 = n∏i=1exp(−12γ1σ2~y2i); r2 = (νt−2νt)−n/2(νt/2)nνt/2(Γ(νt+12))n(Γ(νt2))n∏ni−1(~y2i2γ2σ2+νt2)(νt+1)/2; r3 =

where and is the distribution function of a Gamma distribution with parameters evaluated at . Moreover,

 (Ui|Z1=1,⋅) ∼ δ1; (Ui|Z2=1,⋅) ∼ G((νt+1)/2,~y2i/(2γ2σ2)+νt/2); (Ui|Z3=1,⋅) ∼ G[0,1](νs+1,~y2i/(2γ3σ2)),

where is a truncated Gamma distribution in .

Step 1 is performed via rejection sampling (RS) proposing from the prior and accepting with probability . Simulated studies indicated that the algorithm is computationally efficient.

Monte Carlo estimates of the posterior distribution of (denoted by ), i.e. the models’ posterior probabilities, based on a sample of size , are given by

 ^ρj=ˆP(Zj=1|y)=1MM∑m=1\mathds1(Z(m)j=1),j=1,2,3.

### 3.2 Practical implementation

The MCMC algorithm described in the previous section requires special attention to some aspects to guarantee its efficiency.

An indispensable strategy consists of warming up the chain inside each of the heavy-tailed models (Student-t and Slash). It contributes in several ways to the efficiency of the algorithm.

Firstly, it contributes to the mixing of the chain among the different models. If the chain starts at arbitrary values for the df parameters, it may move to high posterior density values for one of them while the other is still at a low posterior density value. This will make moves from the former model to the latter very unlike, jeopardising the convergence. More specifically, one may take the sample mean of the df parameters from their respective warm-up chains, after discarding a burn-in, as the starting values for the full chain.

Secondly, the warm-up chains will achieve or approach local convergence (inside each model). This will significantly speed the convergence of the full chain, which will have as main purpose the convergence of the coordinate.

Finally, the warm-up chains are a good opportunity to tune the MH steps of the df parameters. Given the unidimensional nature of the step and the random walk structure, the acceptance rates should be around 0,44 (see roberts).

### 3.3 Prediction

An often common step in any regression analysis is prediction for a new configuration of the covariates. This procedure is straightforward in a MCMC context where a sample from the posterior predictive distribution of can be obtained by adding two simple steps at each iteration of the Gibbs sampler after the burn-in.

Let be the state of the chain at the -th iteration after the burn-in. Then, for each , firstly sample from (8) and finally sample

 Y(m)n+1∼N(Xn+1\boldmathβ(m),σ2(m)(Z′(m)\boldmathγ(m))(u(m)n+1)−1), (17)

where is a row vector and is a column vector.

One can also consider the posterior predictive distribution of under one particular model, for example, the one with the highest posterior probability. In that case, it is enough to consider the sub-sample of the sample above corresponding to the chosen model.

## 4 Simulated examples

In this section we introduce synthetic data examples to better understand the properties of the proposed methodology. Our goal are two fold: to provide strong empirical evidence that, (1) as long as information is available, the true model is selected using the proposed methodology and (2) this selects the correct or more adequate model more often than some traditional criteria.

We firstly present a study to see how well the proposed methodology correctly identifies the true model. A second study shows the performance of the criteria when the model has correlated covariates, which may cause problems in the estimate of the fixed effects and variance parameter. Finally, a third synthetic data set is generated from a residual mixture model to investigated if the model that better approximates the true mixture distribution is chosen.

### 4.1 Study I

In this study, data is generate data from one of the proposed distributions: Normal, Student-t and Slash. We consider an intercept and two covariates, i.e. , where is a standard Normal random variable, is a Bernoulli random variable with parameter and . The regression coefficients are . Finally for all models, the variance is set to . The synthetic data were generated from each of the following distributions:

1. Normal;

2. Student-t with degrees of freedom and ;

3. Slash with degrees of freedom and .

Different sample sizes are also considered - , , and , giving a total of 20 scenarios. The degrees of freedom for the Slash were chosen to minimise the Kullback-Leibler divergence between the Student-t with and , respectively.

For each simulated scenario a Markov chain runs for iterations, with a burn-in of giving a total posterior chain of iterations. Convergence is checked using Geweke’s criterion (Gewe:eval:1992) since we only ran one chain. The same chain size, burn-in period and convergence verification were performed for all the examples in the paper. Notice that the parametrisation adopted allows some parameters to be estimated using the whole chain, independently of the model that is visited in each iteration. This favors the chain convergence and the MC variance of the estimates of ().

The summary posterior results of one run are presented in Table 1. They show that as the sample size increases the proposed methodology selects the correct model. Moreover, in the case where data is generated from the Normal distribution, not only the correct model is correctly chosen in all but one case, but also the estimated degrees of freedom of the Student-t and of the Slash distributions are high - making these distributions similar to Gaussian. Another important feature presented in the Table 1 is that the degrees of freedom parameter of the generating model is well estimated. For the non-generating model, the degrees of freedom parameter is reasonably estimated, in the sense of making the respective model as close as possible to the true one. For example, when the data is generated from the Student-t with the is estimated close to , which is the value that minimises the Kullback-Leibler divergence between the two distributions. Table 1 also emphasises that, for small sample sizes or , there is not enough information about the tail behavior to clearly distinguish among the models.

To check the capability of the proposed methodology in selecting the correct model, we performed a Monte Carlo study with replicates of each of the generation schemes. Table 2 presents the Mean Square Error (MSE) of the posterior estimates of the model parameters. The MSE of the parameters was calculated considering only the replications in which the true model was selected. The MSE of is the mean square error between the posterior estimate of the true model’s probability and 1. From Table 2 we can see that, even for small sample sizes, the MSE values of and indicate a very good recovery of the true values. The MSE values of include some large values for small samples sizes when there is not enough information to estimate precisely the degrees of freedom. The difference in the magnitudes of and are explained by the difference in the scale of those parameters. Finally, the last column of the table shows the percentage of times that the correct model was selected (i.e. had the highest posterior probability).

### 4.2 Study II

This study investigates how the model selection procedure and the parameter estimation is affected in the presence of correlated covariates. We generate data from a model with , , and which induces an average correlation of between the two covariates.

We reproduce replicates of this scenario with different sample sizes , , and . Table 3 shows the percentage of times that the proposed methodology selects each model and compares it with the other model selection criteria. It is clear that the traditional criteria have problems to distinguish between models with heavy tails even when the sample size increases, whilst the proposed methodology performs a robust selection specially for large sample sizes where tail information is more abundant.

Figure 1 shows some results regarding the estimation of the regression coefficients. They are quite similar between the proposed methodology and the other model selection criteria. The same, however, does not happen when we look at the estimates for the variance . The poor performance of the other criteria in selecting the correct model is clearly reflected in the estimation of the variance, which is significantly overcome by the respective estimates obtained with the proposed methodology.

### 4.3 Study III

In this study the generating distribution for the error term is not a specific distribution as in Section 4.1 and 4.2, but a mixture of the Normal, Student-t and Slash distributions. More specifically, we consider , with the same , ’s and from Study I. The sample sizes considered are the same as in Study II.

Again, replications are generated. It is important to notice that our modeling framework to perform robust model selection cannot retrieve the generating model, since we assume that all the residuals must be from the same distribution. Nevertheless, a good fit may still be provided by one of the individual models. Table 4 show the result of one of the replications. It is clear that the posterior distribution identifies the Student-t distribution as the best candidate model, specially as the sample size increases.

For sample size , Figure 3 shows the fit of the selected model (Student-t) and the other two models, Normal and Slash, fitted individually. It also shows the true generating distribution for the error term. It is clear that, although the posterior distribution is different from the true generating distribution, by definition, it approximates fairly very well the original one.

Table 5 shows that the proposed model consistently chooses the dominating model (student-t) as the sample size increases. The same does not happen for the other model selection criteria.

## 5 Application

### 5.1 Ais

In this section we introduce a biomedical study from the Australian Institute of Sports (AIS) in athletes (cook). To exemplify our modeling we consider the body mass index (BMI) as our response and the percentage of body fat (Bfat) as our covariate. This way, we have a regression model with for .

We fit each individual model separately and the proposed mixture model. Results are presented in Tables 6 and 7. Note that, although the Slash model is chosen by all the criteria, and the estimates of the regression coefficients are similar between the individual fit and our model, significant, though not large, differences can be found for the estimates of the variance . This highlights the model averaging feature of our approach, which is particularly appealing when one of the models is not chosen with very high probability - in this example - .

### 5.2 Wage

The wage rate data set presented in mroz is used to extend our modeling framework for censored data. The data consist of the wage of 753 married white women, with ages between 30 and 60 years old in 1975. Out of the 753 women considered in this study, 428 worked at some point during that year. When the wives did not work in 1975, the wage rates were set equal to zero. However, it is considered that they may had a cost in that year and, therefore, these observations are considered left censored at zero. The considered response is - the wage rate, and the explanatory variables are the wife’s age (), years of schooling (), number of children younger than six years old in the household () and number of children between six and nineteen years old (). Thus, , .

Since the Wage data is censored, we have the following characteristic for our response variables

 Yobsi={κi,if Yi≤κi,Yiif Yi>κi,

with .

Suppose that, out of the responses, of them are censored as . From a Bayesian perspective, these observations, , can be viewed as latent and sampled at each step of the MCMC. Because of the model structure presented in (6)-(9), it is simple to notice that

 (Yc|Zj=1,uc,\boldmathβ,σ2,νj)∼TN(X\boldmathβ,σ2γju−1c),⌊−∞,κc⌋),c=1,…,C, (18)

where is a truncated Normal distribution with limits . Therefore, we simply add a new sampling step in the blocking scheme as

 (p,Z,U),YC,\boldmathβ,σ2,(νt,νs).

This simple extension allows our modeling framework to deal with any kind of censored data, where, for each type of censoring scheme, the new limits of (18) must be calculated.

To obtain our final chain with observations, a Markov Chain of iterations is run and the first observations are discarded for burn-in. The posterior estimate for is , which indicates the Slash distribution as the preferred one.

Table 8 summarises the posterior results. garay2015 studied this data set from a Bayesian perspective fitting a variety of independent models in the NI family. In their study, the Slash distribution was selected as the preferred one as in our case. Moreover, the posterior mean estimates of the fixed effects parameters in Table 8 are very similar to the ones presented in garay2015 as well as the statistical significance of each covariate. The wife’s age, the number of children younger than six years old in the household and the number of children between six and nineteen years old tend to decrease the wage rate, while years of schooling tend to increase the salary. The posterior estimates encountered by garay2015 for and , fitting separate models, were and , respectively, which agree with our results. Our mixture model approach was able to correctly capture the Slash distribution without separately fitting the three models. Moreover, it provides a high computational gain given the high posterior probability of the Slash model.

## 6 Conclusions and some extensions

Our proposed methodology has shown considerable flexibility to perform model selection for heavy-tailed data explained by covariates under a regression framework. From theoretical arguments, simulation studies and application to real data sets, it is clear that the methodology provides a robust alternative to select the best model instead of relying on model selection criteria which can be unstable (Gelman).

In Section 5.2 we extend the methodology to censored heavy-tailed regression, showing that the extension is straightforward and achieved by adding one simple step to the Gibbs sampler. Also, the extension of the algorithm described in Section 3.1 to include more distributions in the finite mixture is almost direct. Finally, it is clear from our results that this finite mixture idea can be used in a variety of problems where a common parametrisation exists for a family of distributions.

Besides the computational advantage of fitting one general model instead of separated models, we also emphasise that our robust model selection framework automatically performs multiple comparison between the models, which gives an advantage if one, instead, prefer to use the Bayes factor performing 2 by 2 comparisons in each individual model. Moreover, our approach also allows the use of model averaging.

Although the proposed methodology enriches the class of traditional censored regression models, we conjecture that the it may not provide satisfactory result when the response exhibit asymmetry besides the non-normal behavior. To overcome this limitation extending the work to account for skewness behavior is also a possibility, for example by using the scale mixtures of skew-normal (SMSN) distributions proposed in lachos. Nevertheless, a deeper investigation of those modifications in the parametrisation and implementations is beyond the scope of this paper, but provides stimulating topics for further research. Another possibility of future research is to generalise these modeling framework to linear mixed model, e.g., clustered, temporal or spatial dependence. These extensions are being studied in a different manuscript.

## Appendix A Proof of Lemma 1

The posterior density of is given by

 f(p|Y=y)=K∑k=1f(p|Y=y,Zk=1)P(Zk=1|Y=y).

If we multiply both sides by , integrate with respect to and use the fact that and are conditionally independent given , we get

 E[pj|y]=K∑k=1E[pj|Zk=1]P(Zk=1|Y=y),

which is a weighted average of and, therefore, implies that

 E[pj|y]∈(mink{E[pj|Zk=1]},maxk{E[pj|Zk=1]}).

Now note that and is if and is if , where . This concludes the proof.

## Appendix B Model Comparison Criteria

The DIC (Spiegelhalter) is a generalisation of the Akaike information criterion (AIC) and is based on the posterior mean of the deviance, which is also a measure of goodness-of-fit. The DIC is defined by

 DIC=¯¯¯D(\boldmathθ)+ρD=2¯¯¯D(\boldmathθ)−D(~% \boldmathθ),

where , is the posterior expectation of the deviance and is a measure of the effective number of parameters in the model. The effective number of parameters, , is defined as , with .

The computation of the integral is complex, a good solution can be obtained using the MCMC sample from the posterior distribution. Thus, we can obtain an approximation of the DIC by first computing the sample posterior mean of the deviations and then .

The expected Akaike information criterion (EAIC), and the expected Bayesian information criterion (EBIC) (see discussion at Spiegelhalter) are given by

 ˆEAIC=¯¯¯D+2ϑ\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak and\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak ˆEBIC=¯¯¯D+ϑlog(n),

respectively, where is the number of model parameters and can be used for model comparison.

Recently, Watanabe introduced the Widely Applicable Information Criterion (WAIC). The WAIC is a fully Bayesian approach for estimating the out-of-sample expectation. The idea is to compute the log pointwise posterior predictive density () given by , and then, to adjust for overfitting, add a term to correct for effective number of parameters , where . Finally, as proposed by Gelman, the WAIC is given by

 WAIC=−2(lppd−ρWAIC).

So far, for the DIC, EAIC, EBIC and WAIC, the model that best fits a data set is the model with the smallest value of the criterion.

Another common alternative is the conditional predictive ordinate () approach (Geisser). This statistic is based on the cross validation criterion to compare the models. Let be an observed sample from . For the -th observation, the can be written as:

 CPOi=p(yi|y(−i))=∫% \boldmathθ∈\boldmathΘf(yi|\boldmathθ)π(\boldmathθ|y(−i))d\boldmathθ={∫\boldmathθ∈% \boldmathΘπ(\boldmathθ|y)f(yi|\boldmathθ)d