Bayesian Robust Quantile Regression
Traditional Bayesian quantile regression relies on the Asymmetric Laplace distribution (ALD) mainly because of its satisfactory empirical and theoretical performances. However, the ALD displays medium tails and it is not suitable for data characterized by strong deviations from the Gaussian hypothesis. In this paper, we propose an extension of the ALD Bayesian quantile regression framework to account for fat–tails using the Skew Exponential Power (SEP) distribution. Beside having the -level quantile as parameter, the SEP distribution has an additional key parameter governing the decay of the tails, making it attractive for robust modeling of conditional quantiles at different confidence levels. Linear and Generalized Additive Models (GAM) with penalized spline are considered to show the flexibility of the SEP in the Bayesian quantile regression context. Lasso priors are considered in both cases to account for shrinking parameters problem when the parameters space becomes wide. To implement the Bayesian inference we propose a new adaptive Metropolis–Hastings algorithm in the linear model and an adaptive Metropolis within Gibbs one in the GAM framework. Empirical evidence of the statistical properties of the proposed SEP Bayesian quantile regression method is provided through several example based on simulated and real dataset.
keywords:Bayesian quantile regression, robust methods, Skew Exponential Power, GAM.
bernardi]email@example.com bottone]firstname.lastname@example.org petrella]email@example.com
Quantile regression has become a very popular approach to provide a wide description of the distribution of a response variable conditionally on a set of regressors. While linear regression analysis aims to estimate the conditional mean of a variable of interest, in quantile regression we may estimate any conditional quantile of order with . Since the seminal works of Koenker and Basset (1978) and Koenker and Machado (1999), several papers have been proposed in literature considering the quantile regression analysis both from a frequentist and a Bayesian points of view. For the former, following Koenker (2005) and the references therein, the estimation strategy relies on the minimization of a given loss function. From the Bayesian point of view Yu and Moyeed (2001) introduced the ALD as likelihood tool to perform the inference. After that a wide Bayesian literature has been growing on quantile regression and ALD see for example Dunson and Taylor (2005), Kottas and Gelfand (2001), Kottas and Krnjiajic (2009), Thomson et al. (2010), Salazar et al (2012) Lum and Gelfand (2012), Sriram et al (2013) and Bernardi et al. (2015). Although the ALD is widely used in the Bayesian framework it has the main disadvantage of displaying medium tails which may give misleading informations for extreme quantiles in particular when the data are characterized by the presence of outliers and heavy tails. In fact the absence for the ALD of a parameter governing the tails fatness may influence the final inference. Recently Wichitaksorn et al. (2014) tried to generalize the classical Bayesian quantile regression by using some skew distributions obtained through mixture of scaled Normal ones. This class of distributions allows for different degrees of asymmetry of the response variable but they all impose a given structure of the tails. To overcome this drawback we propose an extension of the Bayesian quantile regression by using the Skew Exponential Power (SEP) distribution proposed in Zhu and Zinde–Walsh, 2009. The SEP distribution, like the ALD, has the property of having the -level quantile as the natural location parameter but it also has an additional parameter (the shape parameter) governing the decay of the tails. Using the proposed distribution in quantile regression we are able to robustify the inference in particular when outliers or extreme values are observed. In linear regression analysis several works have extensively considered the non skewed version of the SEP i.e. the Exponential Power distribution (EP), for the related robustness properties given by the shape parameter. Box and Tiao (1973) first show how to robustify the classical Gaussian linear regression model introducing the EP as distribution assumption for the error term. Choy and Smith (1997), explore the robustness properties of posterior moment based on the EP distribution, while Choy and Walker (2003) present further extension of the work of Choy and Smith (1997) introducing the case in which the shape parameter assumes values greater than two.
Finally, Naranjo et al. (2015) and Kobayashi (2016) consider the use of the SEP distribution in the regression and stochastic volatility models.
For the best of our knowledge this is the first attempt to consider the SEP distribution in order to provide a robust framework for quantile regression analysis.
In this paper we propose to use of the SEP distribution to develop a Bayesian robust quantile regression framework. In particular due to the specific characteristics of the SEP distribution we will show how to estimate the quantile function firstly considering the simple linear regression problem then extending it to the generalized additive models (GAM) one. For the latter case we will adopt the Penalized Spline (P–Spline) approach to carry out the statistical inference. The Bayesian paradigm is implemented by means of a new adaptive Metropolis MCMC sampling scheme with a full set of informative prior. In particular, for the GAM framework, the proposed algorithm turns into an Adaptive Metropolis within Gibbs MCMC for an efficient estimate of the penalization parameter and the P–Spline coefficients.
When dealing with model building the choice of appropriate predictors and consequently the variable selection issue plays an important role. In this paper we approach the problem in the Bayesian quantile regression framework, by considering the Bayesian version of Lasso penalization methodology introduced by Tibshirani (1996). In particular for the linear quantile regression model we will assume, as prior distribution on each regressors, the generalized version of the univariate independent Laplace distribution proposed by Park and Casella (2008) and Hans (2009) already considered in Alhamzawi et al. (2012). With this prior we shrink each parameter separately. As second step, when dealing with the GAM models we generalize the Lang and Brezger (2004) second order random walk prior for the Spline coefficients assuming a multivariate Laplace distribution accounting for a correlation structure among parameters. This prior corresponds to the group lasso penalty one of Yuan and Lin (2006), Meier et al. (2008) and Li et al. (2010) which in the spline contest has a natural interpretation in terms of knots associated with each regressor.
To analyze the performance of the proposed models we consider simulation studies in which we control for the weight of the outliers, the number of the parameters, the shape of the regressors and the presence of heteroschedasticity. Furthermore we analyze three popular real dataset: the corrected version (see Li et al. 2010) of the Boston housing data first analyzed by Harrison and Rubinfeld (1978); the Munich rental dataset with geoadditive spatial effect considered in Rue and Held (2005) and Yue and Rue (2011) among the others; the Barro growth data firstly studied by Barro and Sala i-Martin (1995) and then extended in the quantile regression framework by Koenker and Machado (1999). Compared with the existing literature, the models we propose introduce robustness, variable selection and non linearity in the estimation process, providing a more flexible framework and new interesting interpretation of some regression coefficient and, on average, lower posterior standard deviations.
The remainder of the paper is organized as follows. In Section 2, we introduce the SEP distribution and discuss its properties relevant to model conditional quantiles as function of exogenous covariates. In Section 3 we introduce the model specification and the MCMC algorithms proposed. In Section 4 we approach the non–linear extension of the linear quantile approach via GAM models. Section 5 explores the sampling performances of the proposed models through some simulation experiments. Section 6 discusses three well known empirical applications while Section 7 concludes.
2 The Skewed Exponential Power distribution
A random variable is said to be Skewed Exponential Power distributed, i.e. , if its density has the following form:
where is the location parameter, and are the scale and shape parameters respectively, while with being the complete gamma function. Moreover, the parameter controls for the skewness of the distribution.
One of the most nice property of (1), which induces us to propose it for quantile regression inference, is that the location parameter coincides with the –level quantile (we will theoretically prove it in Appendix A). It can be also shown (see Zhu and Zinde–Walsh, 2009) that the kurtosis of the SEP is directly determined by its parameter . In Figure 1 we present the pdf of the SEP distribution for different values of shape () and skewness () parameters, with fixed values of the location and scale parameters .
It is worth noting that, for a fixed value of (see subfigure 1(a)), we retrive the Laplace and the Normal distribution when the shape parameter is equal to and , respectively. Moreover the smaller is , the fatter are the tails of the distribution and in particular as the SEP becomes the Chauchy distribution while as it becomes equal to the Uniform one. It is hence evident the importance of the parameter in capturing the behaviour of the tails which may be fundamental when outliers or heavy tails data are modelled. Furthermore, subfigure 1(b) displays the behavior of the SEP for different combination of and . In this case, the ALD () and the Skew Normal distribution () can be obtained because of the role of the skewness parameter . In the same figure, it should be also evident the relation between and the location parameter . For a fixed ( in the graph) by varying , the shape of the distribution changes in such a way that becomes its quantile of level .
3 Robust Bayesian linear quantile regression
In this section we propose the use of the SEP distribution to implement the Bayesian inference for linear quantile regression combined with the prior distributions specification. Since we are interested in Lasso penalization problem in order to achieve sparsity within the quantile regression model, we propose as prior distribution for the regression parameters, a generalized version of the univariate independent Laplace distribution proposed by Park and Casella (2008) and Hans (2009). In line with Alhamzawi et al. (2012), for each quantile regression parameter we assume a Laplace distribution having different scale parameter in order to shrink each regression parameter in a different way. To achieve the Bayesian procedure we provide an adaptive MCMC sampling scheme obtained by running a block-move Independent Metropolis within Gibbs.
3.1 Model specification
Let a random sample of observations and , the associated set of covariates . Consider the following linear quantile regression model
where is the vector of unknown regression parameters varying with the quantile level. Here, , for any , are independent random variables which are supposed to have zero – quantile and constant variance. Assuming a realization of and a realization of , then the likelihood function for the model (2) based on the SEP distribution (1) with fixed can be written as
where in this case the parameter of equation (1) has been replaced by the regression function . As discussed in the previous section, due to the property of the SEP distribution, the regression function corresponds to the conditional –level quantile of i.e.
. In what follows, we omit the subscript for sake of simplicity.
The Bayesian inferential procedure requires the specification of the prior distribution for the unknown vector of parameters . As mentioned before, for the parameters of the regression function we generalize the prior proposed in Park and Casella assuming the hierarchical structure in (5) and (6) which allows to efficiently shrink each parameter. The priors for the parameters are:
where , are given positive hyperparameters and are the parameters of the univariate Laplace distribution:
with zero location and scale parameter. In (6)-(8) , and denote the Gamma, Inverse Gamma and Beta distributions, respectively.
As known due to its characteristics, the Laplace distribution is the Bayesian counterpart of the Lasso penalization methodology introduced by Tibshirani (1996) to achieve sparsity within the classical regression framework. The original Bayesian Lasso, see also, e.g., Park and Casella (2008) and Hans (2009), introduces the same univariate independent Laplace prior distribution for each regression parameters. Here, as in Alhamzawi et al. (2012), we consider a more general case using the parameters , allowing us to overcome the problem that may arise in presence of regressors with different scales of measurement by shrinking each regression parameter in a different way.
As shown in Park and Casella (2008) and Kozumi and Kobayashi (2011), the Laplace distribution can be expressed as a location–scale mixture of Gaussians which adapted to our case becomes
for , where the mixing variable is exponentially distributed with shape parameter . Furthermore, to retain a parsimonious model parameterization, we introduce a second layer hierarchical prior representation for the vector of shape parameters , in equation (6). Using the location–scale representation of the Laplace distribution, the prior structure defined in equations (5)–(6), can be represented as follows
where is a column vector of zeros of dimension , , and is the exponential distribution.
Concerning the specification of the values for the hyperparameters of the prior distributions,
typically, vague priors are imposed on the scale because it is regarded as a nuisance parameter, see e.g. Yu and Moyeed (2001) and Tokdar and Kadane (2012). Concerning the prior specification for the shape parameter , we impose a Beta distribution with and in order to allow for a large prior variance without incurring in the problem of U–shaped Beta distribution which gives large probability mass to extreme values. Moreover, we extend the Beta distribution to cover the support where the special case allows to consider the so called conditional “expectile” of Newey and Powell (1987), while the case the conditional quantiles based on the ALD introduced by Yu and Moyeed (2001).
As mentioned in Section 2, the parameter regulates the tails–fatness of the SEP distribution so that smaller values implies larger probabilities of extreme observations. Therefore, choosing we encompass both quantile and expectile regression issue addressing at the same time the robustness task relying on a distribution with fatter tails than the Skew Normal.
In the following Section, we introduce the Bayesian parameter estimation procedure which aims to simulate from the posterior distribution using an Adaptive Independent Metropolis–Hastings MCMC algorithm.
3.2 Adaptive IMG for linear quantile regression
The Bayesian inference is carried out using an adaptive MCMC sampling scheme based on the following posterior distribution
where indicates the likelihood function specified in equation (3). After choosing a set of initial values for the parameter vector , simulations from the posterior distribution at the –th iteration of , for , are obtained by running iteratively a block–move Independent Metropolis within Gibbs (IMG). The simulation algorithm requires as first step the specification of a proposal distribution for the parameters .
To propose a move for each block of the parameters, we choose the following proposal distributions:
where the scale parameter is considered on a log–scale and subsequently transformed to preserve positiveness. The jacobian term in equation (16) is required to get the distribution of the transformation . At each iteration , the IMG algorithm proceeds by simulating a candidate draw from each parameter block, i.e. which is subsequently accepted or rejected. The generic probability that the proposed candidate parameter , for becomes the new state of the chain is evaluated on the basis of the following acceptance probability
for , where indicates the probability to move from the old to the proposed state of the chain, is the generic prior given in equations (5) - (8) and refers to the whole set of parameters at iteration without the –th element of . To complete the algorithm we sample , for with a Gibbs step by simulating directly from the respective full conditional distributions
where denotes the Generalized Inverse Gaussian distribution. Since most of the statistical properties of the Markov chain as well as the performance of the Monte Carlo estimators crucially depend on the definition of the proposal distribution (see Andrieu and Moulines, 2006 and Andrieu and Thoms, 2008) we improve the basic IMG–MCMC algorithm with an additional step adapting the proposal parameters using the following equations:
where denotes a tuning parameter that should be carefully selected at each iteration to ensure the convergence and the ergodicity of the resulting chain (see Andrieu and Moulines, 2006). Roberts and Rosenthal (2007) provide two conditions for the convergence of the chain: the diminishing adaptation condition, which is satisfied if and only if , as , and the bounded convergence condition, which essentially guarantees that all transition kernels considered have bounded convergence time. Andrieu and Moulines (2006) show that both conditions are satisfied if and only if where . For those reasons we choose where is set to 10, i.e. . As argued by Roberts and Rosenthal (2007), together these two conditions ensure asymptotic convergence and a weak law of large numbers for this algorithm.
4 Nonlinear extension
In this section, we propose an additive extension of the robust linear quantile regression model considered previously to the class of Generalized Additive Models (GAM) introduced by Hastie and Tibshirani (1986). We will set up GAM models using the SEP likelihood considered before. In order to define the quantile function we make use of the P-Spline functions ending up with a semiparametric problem. The Bayesian analysis is carried out by generalizing the Lang and Brezger (2004) second order random walk prior for the Spline coefficients assuming a multivariate Laplace distribution accounting for a correlation structure among parameters able to take into account for selection variable issue.
4.1 Non–linear model specification
Generalized Additive models extend multiple linear regression by allowing for the response variable to be modeled as sum of unknown smooth functions of continuous covariates. The aim of this section is to set up a robust non linear and semi–parametric framework for quantile regression following a GAM approach using the SEP likelihood. In particular we assume that the –level conditional quantile can be modeled as a parametric component jointly with a sum of smooth functions as follows:
where is the parametric component while each is a nonparametric continuous smooth function and is an additional set of covariates. To implement the statistical analysis we assume that the nonparametric component can be approximated using a polynomial spline of order , with equally spaced knots between and . Using the well known representation of splines in terms of linear combinations of B–splines, we can rewrite equation (24) as:
where denote B–spline basis functions and are the unknown coefficients. In this framework, the value of the estimated coefficients and the shape of the fitted functions strongly depend upon the number and the position of the knots. With respect to the position, in absence of any prior information we consider equidistant knots as a natural choice. Regarding the number of knots, to catch properly the smoothness of the data a careful trade off needs to be considered between few and too many knots since it may cause underfitting or overfitting respectively. A possible solution to this problem is known as Penalized Spline (P–Spline) proposed by O’Sullivan (1986 and 1988) and generalized by Eilers and Marx (1996) which relies on the introduction of a penalty element on the first or second differences of the B–Spline coefficients. This setting has been embedded in the Bayesian framework by Lang and Brezger (2004), Brezger and Lang (2006) and Brezger and Steiner (2008) using a second order random walk for all the B–Spline coefficients, i.e.:
where the generic stochastic component and and are initialized with diffuse priors, i.e., , for . In their work Lang and Brezger (2004) assume that the stochastic components driving the random walk process are independent, i.e. , for all with . Since there are no reasons to assume a priori and independent here we consider an extension of (26) and we assume a multivariate Laplace distribution on the vector of regressors accounting for a correlation structure among them. Moreover it can easily proved, that the original marginal shrinkage effect is preserved under the assumed prior structure, because each marginal prior reduces to a univariate Laplace, see, e.g., Kotz et al. (2001).
Moreover using the Laplace distribution as prior distribution allows to extend the Bayesian Lasso approach to estimation.
Let , we assume , where denotes the multivariate Laplace distribution and is the identity matrix of dimension . Furthermore, let to be the difference matrix of dimension and is the order of the differential operator, such that , then
where . As shows in Kotz et al. (2001), the multivariate Laplace distribution can be expressed as a location–scale mixture of Gaussians, where the mixing variable follows a Gamma distribution
where the integrand in the previous equation (31) is proportional to a Generalized Inverse Gaussian distribution with parameters , and from which we have
where . Substituting this latter expression into equation (31) we obtain
which corresponds to the ALD defined in equations (27)–(28). The proposed prior distribution for corresponds to the group lasso penalty of Yuan and Lin (2006), Meier et al. (2008) and Li et al. (2010) which accounts for the group structure when performing variable selection. It is worth emphasizing that, in our context, the group variables have a natural interpretation because they correspond to knots accounting for the smoothness level of the same regressor over different regions of the support. The overall smoothness of the fitted curve is controlled by the variance of the error term which correspond to the inverse of the penalization parameter used by Eilers et al. (1996) in the frequentist framework. We choose a conjugate Gamma prior for , that is with . Different choice of hyper parameters may be considered but they all bring to very similar results. Summarizing, putting a Gamma prior on , and assuming the prior structure defined in equations (5)–(6) for the shape and scale parameters , we then have the following hierarchical model
4.2 Adaptive IMG for quantiles GAM
In order to perform the Bayesian inference the Adaptive MCMC algorithm proposed in Section 3.2 will be slightly modified to deal with the simulation from the posterior distribution of the generalized quantiles GAM parameters. The posterior distribution becomes equal to
where the vector contains now three more additional set of parameters, namely , and . The likelihood function defined in equation (3) for the linear model should be adapted to account for the additional splines coefficients. Here to perform the Bayesian analysis it is necessary to add three more steps to the algorithm described in Section 3.2. Specifically, after having updated all the parameters of the linear part of the model, a candidate for , for is drawn from a Gaussian proposal distribution, i.e., and accepted on the basis of the following acceptance probability
where denote the whole set of B–Spline coefficients without the –th component. Furthermore, as specified in Section 3.2 for the regression parameters, an adaptive step for the mean and the variance of the proposal distribution of each is implemented using the following equation
for , where is the vanishing factor fixed as discussed above. The hyperparameters are updated by single–move Gibbs sampling steps by simulating from the following full conditionals which are proportional to the GIG distribution
5 Simulation Studies
In this Section, simulation studies are performed to highlight the improvements in robustness obtained by implementing SEP based quantile regression compared with those obtained by the traditional Bayesian quantile regression based on the ADL distribution. More specifically, our purpose is to illustrate how the SEP misspecified model assumption in the quantile regression framework generate posterior distributions of the regression parameters centered on the true values. The first simulation experiment assesses the robustness properties of the proposed methodology for quantile estimation when the joint distribution of the couple , for , is contaminated by the presence of outliers. The second study shows the effectiveness of shrinkage effect obtained by imposing the Lasso–type prior proposed when multiple quantile linear model is the main concern. The last experiment aims at highlighting the ability of the model to adapt to non–linear shapes when data come from heterogeneous fat–tailed distributions. The hyperparameters of the priors distribution are all chosen such that the priors are non informative. In particular regarding the nuisance parameter we choose which corresponds to a proper Inverse Gamma distributions with infinite second moments. When lasso prior is assumed the hyperparameters in the Gamma priors for are chosen to be .
5.1 Simple linear quantile regression
For this experiment we consider a sample of drawn from the following homoskedastic mixture of distributions
where denote the density function of a Gaussian distribution with mean and variance and covariance matrix and is the vector of weights. We fix the number of components equal to , with mixture weights , locations and scale matrix specified as
The quantile regression model implemented is a simple model with only one exogenous variable i.e. for . The aim of this toy example is to show the performances of the Bayesian quantile linear regression analysis assuming both the ALD and the SEP likelihood when the data are strongly contaminated by the presence of outliers. Since we have only one regressor, for this illustrative example we use a simplified version of the sampler proposed in Section 3.2, in which a simple Gaussian prior is considered for . For we run the MCMC algorithm with iterations and a burn–in of . For both the ALD and the SEP distribution assumption, initial values for the parameters to be estimated, namely , are randomly drawn from and , respectively. The additional initial value for the parameter , required only for the SEP distribution case, is randomly drawn form .
Figure 2 depicts the estimated regression lines as well as the % HPD credible sets. The blue line refers to the ALD estimation while the red line to the SEP one. It can be easily observed that the two curves overlap for and increasingly diverge for more extreme level quantiles i.e. for . It is in fact the case that for the posterior mean of is very close to one, which implies that the SEP reduces to the the ALD distribution.
As far as we move away from the median, it is notable the differences in the estimated regression quantile parameters under the ALD and the SEP assumption. Looking at the subplots 2(a) and 2(c) it is evident that the intercept and the slope of the regression line obtained using ALD distribution is strongly influenced by the of the outliers from the two external components of the mixture. In those cases, the estimated of the SEP is considerably smaller than one. The estimation of the ’s parameters is therefore made under a distribution with fatter tails than the ALD strongly penalizing more extreme observations and providing, as consequence, more robust results.
For the regression parameters Table 1 contains the estimated posterior means and standard deviations under the ALD and the SEP assumption. Under the data generating process the theoretical slope should be always equal to . It can be seen that moving from the median to more extreme quantiles, the posterior mean of the intercept and the slope, estimated with the ALD, is farther from the true value than those obtained with the SEP. In addition it is worth noting that also the standard errors are always lower implying that the estimated are more sharp when using the SEP distribution .
5.2 Multiple quantile regression
In this section, we carry out a Monte Carlo simulation study specifically tailored to evaluate the performance of the model when the Lasso prior (5) is considered for the regression parameters. The simulation are similar to the one proposed in Li et al. (2010) and Alhamzawi et al. (2012). Specifically, we simulate observations from the linear model , where the true values for the regressors are set as follows:
The first simulation corresponds to a sparse regression case, the second to a dense case while the third to a very sparse one. The covariates are independently generated from a with . Two different choices for the error terms distribution of the generating process distribution are considered for each simulation study. The first choice is a Gaussian distribution , with chosen so that the -th quantile is 0, while is set as 9, as in Li et al. (2010). The second choice is a Generalized Student’t distribution with two degrees of freedom, i.e. , and chosen so that the -th quantile is 0. For three different quantile level, we run 50 simulations for each vector of parameters () and each choice of the error term. Table 2 reports the median of mean absolute deviation (MMAD), i.e. median, and the median of the parameters over 50 estimates. To be concise only results for simulation1 are reported since results from the other two simulations are similar. It is immediate to see that the proposed Bayesian quantile regression method based on the SEP likelihood performs better in terms of MMAD for both the distributions of the error term. This evidence confirms that the presence of the shape parameter in the likelihood allows to better capture the behavior of the data. The estimated shape parameter is indeed greater and lower then 1 in the Gaussian and Generalized Student case respectively, giving us a more reliable estimation of the vector regardless to the tails weight of the distribution of the error term. These results are confirmed in simulation 2 and simulation 3 (not reported here) in which we exasperate the density and the sparsity in the structure of the predictors. Furthermore, looking at the values of we can see that the proposed robust method reduces the bias of the estimates for all the quantile confidence levels. Concerning the shrinkage ability of the proposed estimator we observe that where the true parameters are zero, the SEP distribution is able to identify them better than the ALD.
Generalized Student t
5.3 Non Linear Model
In this simulation example we illustrate the performances of model assumptions when a simple GAM model is considered with a single continuous smooth non–linear function and where the parametric linear components are set to zero. Following Chen and Yu (2009), we consider two data generating process , for and where represents the wave function and the doppler function, defined as follows
with . These functions are usually used (see also Denison et al. 1998) to check the nonlinear fitting ability of a proposed model. Starting form these two curves, we generate a sample of and observations for the wave and the doppler functions, respectively using four different sources of error