Prediction Measures in Beta Regression Models

# Prediction Measures in Beta Regression Models

Patrícia L. Espinheira Luana Cecília Meireles da Silva Alisson de Oliveira Silva Departamento de Estatística, Universidade Federal de Pernambuco, Cidade Universitária, Recife/PE, 50740–540, Brazil
###### Abstract

We consider the issue of constructing PRESS statistics and coefficients of prediction for a class of beta regression models. We aim at displaying measures of predictive power of the model regardless goodness-of-fit. Monte Carlo simulation results on the finite sample behavior of such measures are provided. We also present an application that relates to the distribution of natural gas for home usage in São Paulo, Brazil. Faced with the economic risk of to overestimate or to underestimate the distribution of gas was necessary to construct prediction limits using beta regression models (Espinheira et al., 2014). Thus, it arises the aim of this work, the selection of best predictive model to construct best prediction limits.

###### keywords:
Beta distribution, beta regression, PRESS, prediction coefficient.

, Corresponding author. ,

## 1 Introduction

The beta distribution is commonly used to model random variables that assume values in , such as percentages, rates and proportions. The beta density can display quite different shapes depending on the parameter values. Oftentimes the variable of interest is related to a set of independent (explanatory) variables. Ferrari and Cribari-Neto (2004) introduced a regression model in which the response is beta-distributed, its mean being related to a linear predictor through a link function. The linear predictor includes independent variables and regression parameters. Their model also includes a precision parameter whose reciprocal can be viewed as a dispersion measure. In the standard formulation of the beta regression model it is assumed that the precision is constant across observations. However, in many practical situations this assumption does not hold. Smithson and Verkuilen (2006) consider a beta regression specification in which dispersion is not constant, but is a function of covariates and unknown parameters. Parameter estimation is carried out by maximum likelihood (ML) and standard asymptotic hypothesis testing can be easily performed. Practitioners can use the betareg package, which is available for the R statistical software (http://www.r-project.org), for fitting beta regressions. Cribari-Neto and Zeileis (2010) provide an overview of varying dispersion beta regression modeling using the betareg package.

Recently Espinheira et al. (2014) built and evaluated bootstrap-based prediction intervals for the class of beta regression models with varying dispersion. However, a prior approach it is necessary, namely: the selection of the model with the best predictive ability, regardless of the goodness-of-fit. Indeed, the model selection is a crucial step in data analysis, since all inferential performance is based on the selected model. Bayer and Cribari-Neto (2014) evaluated the performance of different selection criteria models in samples of finite size in beta regression model, such as Akaike Information Criterion (AIC) (Akaike, 1973), Schwarz Bayesian Criterion (SBC) (Schwarz, 1978), residual sum of squares (RSS), and various functions of RSS such as the coefficient of determination, and the adjusted . However, these methods do not offer any insight about the quality of the predictive values. In this context, Allen (1974), proposed the PRESS (Predictive Residual Sum of Squares) criterion, that can be used as an indication of the predictive power of a model. The PRESS statistic is independent from the goodness-of-fit of the model, since that its calculation is made by leaving out the observations that the model is trying to predict. The PRESS statistics can be viewed as a sum of squares of external residuals. Thus, similarly of the approach of Mediavilla et al. (2008) proposed a coefficient of prediction based on PRESS namely . The statistic can be used to select models from a predictive perspective adding important information about the predictive ability of the model in various scenarios.

## 2 On beta regression residuals

Let be independent random variables such that each , for , is beta distributed, i.e., each has density function given by

 f(yt;μt,ϕt)=Γ(ϕt)Γ(μtϕt)Γ((1−μt)ϕt)yμtϕt−1t(1−yt)(1−μt)ϕt−1,0

where and . Here, and , where . In the beta regression model introduced by Ferrari and Cribari-Neto (2004) the mean of can be written as

 g(μt)=x⊤tβ=ηt. (2)

In addition to the relation given in (2), it is possible to assume that the precision parameter is not constant and write

 h(ϕt)=z⊤tγ=ϑt. (3)

In (2) and (3), and are linear predictors, and are unknown parameter vectors (; ), and are fixed covariates () and and are link functions, which are strictly increasing and twice-differentiable.

The PRESS statistic is based on sum of external residuals obtained from exclusion of observations. For beta regression models Ferrari et al. (2011) present a standardized residual obtained using Fisher’s scoring iterative algorithm for under varying dispersion. Here, we propose a new residual based on a combination of ordinary residuals obtained using the algorithms for and under varying dispersion. At the outset, consider the Fisher’s scoring iterative algorithm for estimating (see the Appendix A). From (21) it follows that the th step of the scoring scheme is

 β(m+1)=β(m)+(X⊤Φ(m)W(m)X)−1Φ(m)X⊤T(m)(y∗−μ∗(m)), (4)

where the th elements of the vectors and are given, respectively, by

 y∗t=log{yt/(1−yt)}andμ∗t=ψ(μtϕt)−ψ((1−μt)ϕt), (5)

denoting the digamma function, i.e., for . The matrices and are given in (17) and (19), respectively, is an matrix whose th row is and . Note that (see (22); Appendix A). Similarly, from (21) it follows that the th step of the scoring scheme for is given by

 γ(m+1)=γ(m)+(Z⊤D(m)Z)−1Z⊤H(m)a(m), (6)

where the th element of is give by

 at=μt(y∗t−μ∗t)+log(1−yt)−ψ((1−μt)ϕt)+ψ(ϕt) (7)

and the matrices and are given in (18) and (20), respectively, and is an matrix that th row is . It is possible to write the iterative schemes in (4) and (6) in terms of weighted least squares regressions, respectivelly as and . Where , with , , with and given in (7). Upon convergence,

 ˆβ=(X⊤ˆΦˆWX)−1ˆΦX⊤ˆWu1%andˆγ=(Z⊤ˆDZ)−1Z⊤ˆDu2,whereu1=ˆη+ˆW−1ˆT(y∗−ˆμ∗)andu2=ˆϑ+ˆD−1ˆHˆa. (8)

Here, , , and are the matrices , , and respectively, evaluated at the maximum likelihood estimator. We note that and in (8) can be viewed as the least squares estimates of and obtained by regressing and on and , respectively. The residuals ordinary obtained of interactive process of and are given by and , respectively. Hence, using the definitions of the matrices given from (17) to (21), we can rewrite the residuals obtained from the iterative process of and respectively, as

 rβt=y∗t−ˆμ∗t√ˆvtandrγt=ˆat√ˆςt, (9)

where and are given in (19) and (20), respectively. Thus, we propose a new residual based on and , which we shall refer to as the combined residual where and are given in (5). Assuming that and are known and from (22) to (26) it follows that , with

 ζt=(1+μt)2ψ′(μtϕt)+μ2tψ′((1−μt)ϕt)−ψ′(ϕt). (10)

Then, we can define the following standardized combined residual:

 rβγp,t=(y∗t−ˆμ∗t)+ˆat√ˆζt (11)

Here, is in (10) evaluated at e . It is important to note that when is constant it is only necessary replace by at all elements of (11). We should emphasize that here we are just interested in evaluating the in the composition of the PRESS statistic.

## 3 P2 Statistics

Consider the linear model, where is a vector of responses, is a known matrix of covariates of dimension , is the parameter vector of dimension and is a vector of errors distributed as . Let , , and let be the estimate of without the i observation and be the case deleted predicted value of the response when the independent variable has value . Thus, for multiple regression which can be rewritten as , where is the t diagonal element of the matrix .

In the beta regression model in (8) can be viewed as the least squares estimate of obtained by regressing

 ˇy=ˆΦ1/2ˆW1/2u1onˇX=ˆΦ1/2tˆW1/2X. (12)

Thus, the prediction error is . Using the ideas proposed by (Pregibon, 1981) and fact that

 ˆβ(t)=ˆβ−(X⊤ˆΦˆWX)−1xtˆϕ1/2tˆw1/2trβt(1−h∗tt),

where is given in (9) and is the th diagonal element of

 H∗=(ˆWˆΦ)1/2X(XˆΦˆWX)−1X⊤(ˆΦˆW)1/2

it then follows that Finally, for the beta regression model the PRESS statistic is given by

 PRESS=n∑t=1(ˇyt−^ˇy(t))2=n∑t=1(rβt1−h∗tt)2. (13)

In (13) the th observation is not used in fitting the regression model to predict , then both the external predicted values and the external residuals are independent of . This fact enables the PRESS statistic to be a true assessment of the prediction capabilities of the regression model regardless of the overall quality of the fit of the model.

Considering the same approach of the coefficient of determination , we can think in a prediction coefficient based on PRESS, namely

 P2=1−PRESSSST(t), (14)

wherein and is the arithmetic average of the . It can be shown that , wherein is the number of model parameters. In the beta regression model with varying dispersion, , is the is the arithmetic average of the given in (12) and .

Cook and Weisberg (1982) suggest other versions of PRESS statistics based on different residuals. Thus, we present another version of PRESS statistics and associated considering a new residual presented in (11), such that

 PRESSβγ=n∑t=1⎛⎝rβγp,t1−h∗tt⎞⎠2andP2βγ=1−PRESSβγSST(t), (15)

respectively. It is noteworthy that the measures and are distinct, since that the propose to measure the quality of fit of the model and the and measure the predictive power. Additionally, and are not positive measure. In fact, the is a positive quantity, thus the and the associated given in (14) and (15), respectively, take values in . The closer to one the better is the predictive power of the model. In order to check the goodness-of-fit of the estimated model, we used the approach suggested by Bayer and Cribari-Neto (2014) for beta regression models with varying dispersion, a version of based on likelihood ratio, given by: wherein is the maximum likelihood achievable (saturated model) and is the achieved by the model under investigation.

### 3.1 Monte Carlo results

The Monte Carlo experiments were carried out using using both fixed and varying dispersion beta regressions as data generating processes. All results are based on 10,000 Monte Carlo replications. Table 1 contains numerical results for the fixed dispersion beta regression model as data generating processe, given by

 log(μt1−μt)=β1+β2xt2+β3xt3+β4xt4+β5xt5,t=1,…,n,

The covariate values were independently obtained as random draws of the following distributions: , and were kept fixed throughout the experiment. The precisions, the sample sizes and the mean response are, respectively, , , , and . To investigate the performances of statistics in the omission of covariates, we considered the Scenarios 1, 2 and 3, in which are omitted, three, two and one covariate, respectively. In the fourth scenario the estimated model is correctly specified. Additionally we calculate the for the same scenarios. The results in Table 1 show that the values of all statistics increase as important covariates are included in the model. Statistics behave similarly as the sample size and the precisions values indicating that the most important factor is the correct specification of the model. Considering the three ranges for the it should be noted that the statistic values are considerably larger when and the values approaching one when the estimated model is closest to the true model. For instance, in Scenario 4 for , the values of and are, respectively, (0.8354, 0.9357, 0.9748) and (0.8349, 0.9376, 0.9758).

The statistics finite sample behavior substantially change when . It is noteworthy the reduction of the statistic values, revealing the difficulty in to fit the model and make prediction when . Indeed, in this range of is more difficult to make prediction that to fit the model. For example, in Scenario 1, when three covariates are omitted from the model, when and the values equals, 0.0580, 0.0636 and 0.0972 whereas the the values are 0.1553, 0.1999 and 0.2496, respectively. Similar results were obtained for . Even when for the correctly specified four covariate model (Scenario 4) the predictive power of the model is more affected than the quality of fit of the model by the fact of . In this situation, it is noteworthy that the finite sample performances predictive power model improve when the value of the precision parameter increases. For instance, when and we have and , respectively. Here it is possible see that the statistic always shows larger values than the statistic when the mean responses are close to of the upper limit of the standard unit interval. However, the two measures behave similarly when used to investigate model misspecification.

The same difficulty in obtaining predictions and in fitting the regression model occurs when . Once again the greatest difficulty lies on the predictive power of the model. It is also noteworthy that when the point prediction becomes even less reliable than when , since the and values decreased substantially and become considerably distant from the values. When the mean responses are close to of the lower limit of the standard unit interval, the seems to be more able in identify poor predictions. For instance, in Scenario 4 (model correctly specified; four covariates) when and , we have and , respectively.

We have also carried out Monte Carlo simulations using a varying dispersion beta regression model, in which we increased the number of covariates, used different covariates in the mean and precision submodels. In this case the data generating process and the postulated model is the same . We report results for , , , and . Here,

 λ=ϕmaxϕmin=maxt=1,…,n{ϕt}mint=1,…,n{ϕt}, (16)

is the measure the intensity of nonconstant dispersion. The covariate values in the mean submodel and in the precision submodel were obtained as random draws from the and distributions, respectively, such that the covariate values in the two submodels are not the same. At the end, we also considered a covariate values generated from (Student’s t-distribution with 3 degrees of freedom). The results are presented in Table 2. We should emphasize that were generated only covariates values and the covariates values are replications of original set. In this sense, the intensity of nonconstant dispersion remains the same over the sample size.

When the mean responses are scattered on the standard unit interval () the three statistics display similar values. It seems that neither the degree of intensity of nonconstant dispersion nor the simultaneous increase in the number of covariates in the two submodels noticeably affect the predictive power and fit of the model when the sample size is fixed. However, it is noteworthy a reduction of statistic values when the response values are close to one or close to zero, making clear the difficulty in fitting the regression model and obtaining good predictions when or and the precision is modelled. The minor values of statistic reveals the problem in to make good predictions when , whereas when this problem is singled out by smaller values of statistic. Here, the model fit is more affect when the number of covariates increases simultanealy in the two submodels. For instance consider , and . At the Scenario 5 (one covariate in both submodels), we have = 0.8117, = 0.3677 and = 0.8228. Whereas in Scenario 8 (four covariate in both submodels) we have = 0.8627, = 0.4863 and = 0.6447;

We also displayed in Table 2 the statistic values when the model is correctly specified, but we introduced leverage points in the data. To that end, only the values were obtained as random draws of the distribution and concerned ourselves with , which yielded one point which has leverage measure ten times greater than the average value when , two high leverage points when and three, . Here, we used as measure of leverage the leverage generalized (Espinheira et al., 2008). Notice that in Scenarios 5, 6 and 7 the measure seems more able to identify correctly that the leverage points affect the goodness of prediction than the measure. On the order hand, the outperforms the in Scenario 8. It is interesting to notice that in Scenario 5, which represents one covariate in both submodels, with the only one covariate of mean submodel had values generated from the occurs the smaller values of the three statistics. Thus, the statistics correctly lead to the conclusion that as greatest is the influence of leverage point in the data, worst are the predictions and the model fit.

Finally, were carried out Monte Carlo simulations to assess the performance of statistics when the dispersion modelling is neglected. To that end, the true data generating process considers varying dispersion but a fixed dispersion beta regression is estimated; see Table 3. In this case we have misspecification. Thus, we hope that the statistics display smaller values in comparison with Table 3. In this sense, when