A Random Forest Approach for Modeling Bounded Outcomes
Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of complex predictor-response relationships. For bounded outcome variables restricted to the unit interval, however, classical random forest approaches may severely suffer as they do not account for the heteroscedasticity in the data. A random forest approach is proposed for relating beta distributed outcomes to explanatory variables. The approach explicitly makes use of the likelihood function of the beta distribution for the selection of splits during the tree-building procedure. In each iteration of the tree-building algorithm one chooses the combination of explanatory variable and splitting rule that maximizes the log-likelihood function of the beta distribution with the parameter estimates derived from the nodes of the currently built tree. Several simulation studies demonstrate the properties of the method and compare its performance to classical random forest approaches as well as to parametric regression models.
Keywords: Random forests; Beta distribution; Bounded outcome variables; Regression modeling
In observational studies one frequently encounters bounded outcome variables. Important examples are (i) relative frequency measures restricted to the unit interval (0,1), like the household coverage rate (Fang and Ma, 2013), the percentage of body fat (Fang et al., 2018) or DNA methylation levels (Weinhold et al., 2016), and (ii) continuous response scales bounded between 0 and 100, like visual analogue scales (Bilcke et al., 2017) or health-related quality of life (HRQoL) scales (Hunger et al., 2012).
The objective of statistical analyses typically is to build a regression model that relates the bounded outcome to a set of explanatory variables = . When modeling bounded outcomes a problem arises when the variance is not constant, which implies heteroscedasticity and violates the assumption of Gaussian regression. Therefore, a popular approach is to apply a transformation, e.g. by using the logistic function, that maps the unit interval (0,1) to and to fit a Gaussian regression model to the transformed outcome. A remaining limitation of an analysis based on transformations is that inference is not possible on the original scale but only on the (logit-)transformed scale. This affects the interpretability of the estimated regression coefficients and the conclusions drawn from the results of associated hypothesis tests.
A more flexible method that does not require variable transformation is beta regression, see Ferrari and Cribari-Neto (2004), Cribari-Neto and Zeileis (2010) and the extensions considered by Smithson and Verkuilen (2006), Simasand et al. (2010) and Grün et al. (2012). One of the main benefits that makes the beta regression a popular tool for modeling bounded outcome variables is the flexibility of the shape of the probability density function (p.d.f.) of the beta distribution, including symmetrical shape and left or right skewness. Furthermore, beta regression allows for a simple interpretation of the predictor-response relationships in terms of multiplicative increase/decrease of the expected outcome, which is analogous to logistic regression for binary outcome.
A large part of the methodology on beta regression refers to parametric regression models using simple linear combinations of the explanatory variables, that is, one assumes that the effects of the explanatory variables on the outcome are linear. In many applications, however, this assumption is too restrictive, for example, when higher-order interactions between the explanatory variables are present. Also, the specification of a parametric model may results in a large number of parameters relative to the sample size, which may effect estimation accuracy. When the number of explanatory variables exceeds the number of observations, parameter estimation is even infeasible.
These issues can be addressed by the use of recursive partitioning techniques or tree-based modeling, which has its root in automatic interaction detection. The most popular version is due to Breiman et al. (1984) and is known by the name classification and regression trees (CART). An easily accessible introduction into the basic concepts is found in Hastie et al. (2009). A disadvantage of tree-based methods, however, is that the resulting trees are often affected by a large variance, i.e. that minor changes in the data may result in completely different trees. Therefore, in particular when the focus is on prediction accuracy, it is often worth stabilizing the results obtained from single trees by applying ensemble methods, like random forests (Breiman, 2001; Ishwaran, 2007). In general, the idea of random forests is to reduce the variance of the predictions while retaining low bias by averaging over many noisy trees.
In this article, we propose a random forest approach tailored to the modeling of bounded outcome variables. In contrast to classical approaches, which use impurity measures (for classification) or the mean squared error (for regression), we propose to use the likelihood of the beta distribution as splitting criterion for tree building. In each iteration of the tree-building algorithm one chooses the combination of explanatory variable and split point that maximizes the log-likelihood function of the beta distribution, with the parameter estimates directly derived from the nodes of the currently built tree. Similar strategies have been proposed by Su et al. (2004) for continuous outcomes and Schlosser et al. (2018) in the context of distributional regression. The proposed method has been implemented in an extended version of the R add-on package ranger (Wright and Ziegler, 2017).
The remainder of the article is organized as follows: in Section 2 we present the notation and methodology of the proposed random forest approach for modeling bounded outcomes, along with a description of classical parametric beta regression models. Section 3 compares the properties and the performance of the proposed method to classical beta regression models and alternative random forest approaches considering several simulation studies. The main findings of the article are summarized in Section 4.
2.1 Notation and Methodology
Let be the bounded outcome variable of interest. In this article we assume that follows a beta distribution with the following probability density function
where is the gamma function. The p.d.f. in (1) is defined by two parameters, namely the location parameter and the precision parameter . With this parametrization, the mean and variance of are, respectively,
Depending on the values of the parameters and , the p.d.f. of the beta distribution can take on a number of different shapes, including symmetrical shape and left or right skewness, or even uniform shape (Cribari-Neto and Zeileis, 2010). The flexibility is illustrated in Figure 1, which shows the p.d.f. of for several different parameter combinations. Of note, the beta distribution can not only be used to characterize outcome variables bound to the unit interval , but also for bounded variables of the type (with , ) after scaling to the standard unit interval by the transformation (Cribari-Neto and Zeileis, 2010). For further details on the beta distribution and alternative parameterizations, see Ferrari and Cribari-Neto (2004).
To relate to the vector of explanatory variables , one usually considers the general class of parametric regression models and assumes that the relationship between the conditional mean and is given by
where is a monotonic, twice differentiable response function, e.g. the inverse logit function, is the predictor function, and is a vector of unknown real-valued coefficients. When using the inverse logit function, the terms have a simple interpretation in terms of multiplicative increase or decrease of the conditional mean . For example, if , increasing by one unit implies that the odds of the outcome () is increased by the factor . The model based on the inverse logit function will be used for comparison purposes in the simulation study in Section 3.
Estimates of the regression parameters are obtained by using classical maximum likelihood estimation. Let be a set of independent realizations of , with mean and precision and let be real-valued observations on explanatory variables. Then, the log-likelihood function of model (3) is defined as
Typically, in beta regression models, the precision parameter is assumed to be constant for all observations (). This is the same as the assumption for the variance in a Gaussian regression model. The more general form of the log-likelihood function specified in Equations (4) and (5) is exploit by the proposed approach (see Section 2.2).
The parametric model (3) is linear in , implying that the explanatory variables have a linear effect on the transformed outcome. In practice, this assumption may be too restrictive, because explanatory variables may show complex interaction patterns. In principle the linear relation in (3) can be extended by incorporating interaction terms. However, for modeling interactions in classical regression approaches, the specification of the corresponding terms in the predictor function of the model is required, i.e. they need to be known beforehand. This is a major challenge in real applications, due to the mostly unknown nature of these terms. Yet another problem when applying classical regression approaches arises when the number of parameters exceeds the number of observations (which is a likely scenario when many interaction terms are contained), i.e. when . This leads to an overdetermined system for which maximum likelihood estimation is bound to fail.
2.2 Tree-Based Beta Regression
To address the aforementioned problems, several statistical methods were developed, including the popular Classification and Regression Trees (CART; Breiman et al., 1984). CART are built by recursively dividing the predictor space (defined by the explanatory variables X) into disjoint regions (i.e. subgroups of the observations) applying binary splits. Starting from the root of the tree (called top node), which contains the whole predictor space, in each splitting step, a single explanatory variable and a corresponding split point is selected, along which the node, denoted by , is divided into two subsets and (called child nodes). The scale of the variable defines the form of the binary splitting rule. For a metrically scaled or ordinal variable , the partition into two subsets has the form and , with regard to split point . For multi-categorical variables without ordering , the partition has the form and , where and \ are disjoint, non-empty subsets. Splitting is repeated in each newly created child node until a specified stopping criterion based on predefined tuning parameters is met (see Section 2.3). In each resulting terminal node, the conditional mean of all observations belonging to this node is fitted by a constant, e.g. the mean of the outcome values in regression trees and the most frequent class in classification trees.
For classification trees, one of the most common splitting criteria (to choose among the explanatory variables and splitting rules in each step), is the Gini impurity, which is minimal if all the observations are correctly classified. For regression trees, the usual splitting criterion is the mean squared error (MSE). In the presence of a beta distributed outcome variable, one inherently assumes heteroscedasticity, because the variance depends on the mean parameter, see Equation (2). Hence, the use of the mean squared error as splitting criterion may lead to biased predictions, since splits are stronger affected by data with high variability. More specifically, because regression trees seek to minimize the within-node variance, there will be a tendency to split nodes with high variance, which may result in poor predictions in low-variance nodes (Moisen, 2008).
Therefore, we introduce an alternative splitting criterion based on the log-likelihood of the beta distribution, which forms the building block of the proposed random forest approach (see Section 2.3). Given the data , the log-likelihood (4) can be obtained by estimating the distribution parameters and for each observation . During the tree-building procedure, the estimation of the two parameters is conducted node-wise: in each node the location parameter and the precision parameter of the beta distribution are estimated by
where is the set of observations and is the number of observations falling into node , respectively. Given nodes of a currently built tree, the individual distribution parameters for each observation are then obtained by
In each iteration of the proposed tree-building algorithm, one chooses the combination of explanatory variable and splitting rule, that maximizes the log-likelihood function (4), when node is split into the child nodes and . As in the classical tree approach, after termination of the algorithm the fitted conditional mean for each observations is obtained by Equation (7).
2.3 Random Forests for Beta Regression
A great advantage of tree-based methods is the simple and intuitive interpretability of the model. This is particularly important when the aim is to build an easy-to-interpret prediction formula and to quantify the effect of a specific explanatory variable on the outcome. A drawback, however, is that the resulting tree estimators are often very unstable. This means that a small variation in the data can result in very different splits, which particularly affects prediction accuracy. To overcome this problem, we propose a random forest algorithm, an ensemble method originally developed by Breiman (2001). The proposed approach is based on the splitting criterion introduced in Section 2.2 and is referred to as beta forest in the following.
The main principle, of the beta forest is to generate a fixed number of samples from the original data (denoted by ntree) using bootstrap procedures, i.e. by drawing with replacement, and to apply the tree-building algorithm to each of the samples. To mitigate the similarity of the resulting trees, the number of explanatory variables that are available for splitting in each node (denoted by mtry) is reduced. During tree building, each node is split until the number of observations falls below a (predefined) minimal node size (denoted by min.node.size). The algorithm finally terminates when all current nodes are flagged as terminal nodes. As in the classical random forest approach, the fitted conditional mean for an observation is obtained by averaging the predicted values over all trees where the observation was part of the out-of-bag sample, i.e. was not used for tree building. A schematic overview of the beta forest algorithm is provided in Table 1.
In R, the proposed algorithm has been implemented in an extended version of the add-on package ranger (Wright and Ziegler, 2017). The default values of the tuning parameters of the beta forest were set to , and , as for all the previously implemented regression methods.
In this section we present the results of several simulations to evaluate the performance of the beta forest. The aims of the study were: (i) to compare the beta forest to classical random forest approaches in the presence of a bounded outcome variable, and (ii) to compare the beta forest to parametric beta regression models in higher dimensional settings with a large number of non-influential variables, and in the presence of interactions between the explanatory variables.
3.1 Simulation Design
We considered several simulation scenarios, where each simulated data set consisted of observations. The outcome variable of each simulated data set was randomly drawn from a beta distribution with a constant precision parameter and the conditional mean , with defined as the inverse logit function and the predictor function
composed of four independent binary explanatory variables with . Thus, the true predictor function (9) contained four linear main effects and four two-factor interactions.
We simulated symmetrically distributed data with average location parameter value by setting , and left-skewed data with average location parameter value by setting . For the precision parameter we used the values (low), (moderate) and (high), yielding average variance parameter values of 0.08, 0.05 and 0.03 for , and 0.05, 0.03 and 0.02 for . Furthermore, to assess the robustness of the methods against the degree of noise in the data we added non-informative explanatory variables to the predictor space. Including the four informative variables, we considered scenarios with (informative), (low-dimensional), (moderate-dimensional) and (high-dimensional) explanatory variables. In total this resulted in 2 x 3 x 4 = 24 different scenarios. Each simulation scenario was replicated 100 times.
3.2 Competing Methods
In all simulation scenarios we compared the beta forest (i) to alternative random forest approaches differing with regard to the transformation applied to the outcome variable , and (ii) to parametric beta regression models differing with regard to the pre-specified predictor function. Specifically, we examined:
the proposed beta forest (beta-rF),
the classical random forest approach with splitting criterion MSE on the original, untransformed data (rF),
the classical random forest approach with splitting criterion MSE on arcsine square root transformed data (asin-rF),
the classical random forest approach with splitting criterion MSE on logit-transformed data (logit-rF),
a fully specified parametric beta regression model with linear predictor (linear-bR),
a parametric beta regression model without any explanatory variable, i.e. predictor (int-bR) and
a parametric beta regression model with predictor function (9) of the underlying data generating process (true-bR).
All random forest approaches (a) – (d) were computed with the R package ranger (Wright and Ziegler, 2017) using the default values for ntree and mtry and . The parametric beta regression models (e) – (g) were computed using the R package betareg (Cribari-Neto and Zeileis, 2010).
We evaluated the performance of the modeling approaches with respect to predicting outcomes of future observations. This was done by computing the predictive log-likelihood values (based on the beta, Gaussian, logit-normal or arcsine-normal distribution) on an independently drawn test data set of observations in each replication. To obtain the precision parameter of model (a), the log-likelihood function of the beta distribution was optimized on the sample of observations, after plugging in the conditional mean estimates of the random forest. The variance parameters of the models (b) – (d) were obtained accordingly.
3.3 Comparison to Parametric Beta Regression
The prediction accuracy obtained from the beta forest and the parametric beta regression models (e) – (g) for the 12 scenarios with symmetric outcome (average ) and varying (rows) and (columns) are shown in Figure 2.
It is seen that the performance of the fully specified beta regression model (linear-bR) was comparable to the beta forest in the scenarios with only informative explanatory variables as well as in the scenarios with moderate-dimensional data (first and second column), with a slight advantage of the beta forest in the informative, high precision scenario (, ). Further, the performance of the linear-bR substantially deteriorated with an increasing number of non-informative variables (third and fourth column of Figure 2), strongly favoring the proposed beta forest. The difference becomes even more apparent for smaller values of the precision parameter . For example in the moderate-dimensional, low precision scenario (, ) the average difference in the predictive log-likelihood was 47.3. As was to be expected and throughout all settings, the parametric beta regression model with the predictor specified according to the true data-generating process (true-bR) outperformed all other models. The relatively good performance of the simple int-bR model in the high-dimensional scenarios revealed that the explained variance in these data sets was rather small. For example, the average simulated was in the high-dimensional, moderate precision scenario (, ).
The superiority of the beta forest in the moderate- and high-dimensional scenarios is mainly explained by the variable selection mechanism, which is enforced when fitting random forests. Moreover, the result in the informative, high precision scenario (where the performance is almost equal to the true-bR) stresses that the beta forest adequately accounted for the interaction terms contained in the data-generating model (9).
3.4 Comparison to Random Forest Approaches
Figure 3 shows the prediction acuracy of the random forest approaches (a) – (d) for the 12 scenarios with symmetric outcome (average ). Throughout all scenarios, the beta forest outperformed all competing random forest approaches. Generally, all approaches gained in prediction accuracy with an increasing value of the precision parameter and a decreasing number of non-informative variables. The worst performance was obtained for the classical random forest on the untransformed data (rF), even though the difference vanished with increasing precision . The models based on arcsine square root transformed data (asin-rF) and logit transformed data (logit-rF) performed roughly equal in all scenarios. The largest differences to the beta forest were seen in the four low precision scenarios (first row of Figure 3). For example, in the moderate-dimensional, low precision scenario, the average difference between logit-rF and beta-rF was .
The results for the 12 scenarios with left skewed data (average ) are shown in Figure 5 in Appendix A. One notable difference in these scenarios was the prediction performance of the logit-rF: in the low precision scenarios (first row of Figure 5), the logit-rF yielded comparable results to the beta-rF and even outperformed the beta-rF for in the scenario with only informative explanatory variables and the low-dimensional data. Furthermore, the performance of the logit-rF strongly deviated from the asin-rF in these scenarios.
The largely observed superiority of the beta forest over the competing random forest approach revealed that the beta forest was best able to account for the bounded structure of the outcome variable, which was simulated from a beta distribution.
4 Concluding Remarks
We proposed a random forest approach for modeling bounded outcome variables. The beta forest is an alternative modeling strategy to parametric models if one finds to struggle with the specification of predictor-response relationships in higher dimensional settings. Furthermore, the random forest algorithm provides variable importance measures, which can be used to rank the explanatory variables in terms of their effect on the outcome variable.
The simulation study showed that (i) the beta forest yielded more accurate predictions of bounded outcomes than classical random forest approach based on the MSE as splitting criterion, and (ii) outperformed classical parametric beta regression models in the presence of high-dimensional data, as the method is capable to account for complex interaction patterns and carries out variable selection.
The beta forest is implemented in the user-friendly R add-on package ranger (Wright and Ziegler, 2017) and can therefore easily applied by practitioners.
- Bilcke et al. (2017) Bilcke, J., N. Hens, and P. Beutels (2017). Quality-of-life: a many-splendored thing? Belgian population norms and 34 potential determinants explored by beta regression. Quality of Life Research 26, 2011–2023.
- Breiman (2001) Breiman, L. (2001). Random forests. Machine Learning 45, 5–32.
- Breiman et al. (1984) Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). CART: Classification and Regression Trees. New York: Chapman and Hall, Wadsworth.
- Cribari-Neto and Zeileis (2010) Cribari-Neto, F. and A. Zeileis (2010). Beta regression in R. Journal of Statistical Software 34.
- Fang et al. (2018) Fang, K., X. Fan, W. Lan, and B. Wang (2018). Nonparametric additive beta regression for fractional response with application to body fat data. Annals of Operations Research, published online.
- Fang and Ma (2013) Fang, K. and S. Ma (2013). Three-part model for fractional response variables with application to chinese household health insurance coverage. Journal of Applied Statistics 40, 925–940.
- Ferrari and Cribari-Neto (2004) Ferrari, S. and F. Cribari-Neto (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics 31, 799–815.
- Grün et al. (2012) Grün, B., I. Kosmidis, and A. Zeileis (2012). Extended beta regression in R: shaken, stirred, mixed, and partitioned. Journal of Statistical Software 48, 1–25.
- Hastie et al. (2009) Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning (2nd Ed.). New York: Springer.
- Hunger et al. (2012) Hunger, M., A. Döring, and R. Holle (2012). Longitudinal beta regression models for analyzing health-related quality of life scores over time. BMC Medical Research Methodology 12:144.
- Ishwaran (2007) Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics 1, 519–537.
- Moisen (2008) Moisen, G. (2008). Classification and regression trees. In S. E. Jørgensen; B.D. Fath (Editor-in-Chief). Encyclopedia of Ecology, Volume 1, pp. 582–588. Oxford, UK: Elsevier.
- Schlosser et al. (2018) Schlosser, L., T. Hothorn, R. Stauffer, and A. Zeileis (2018). Distributional regression forests for probabilistic precipitation forecasting in complex terrain. arXiv:1804.02921.
- Simasand et al. (2010) Simasand, A., W. Barreto-Souza, and A. Rocha (2010). Improved estimators for a general class of beta regression models. Computational Statistics & Data Analysis 54, 348–366.
- Smithson and Verkuilen (2006) Smithson, M. and J. Verkuilen (2006). A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods 11, 54–71.
- Su et al. (2004) Su, X., M. Wang, and J. Fan (2004). Maximum likelihood regression trees. Journal of Computational and Graphical Statistics 13, 586–598.
- Weinhold et al. (2016) Weinhold, L., S. Wahl, S. Pechlivanis, P. Hoffmann, and M. Schmid (2016). A statistical model for the analysis of beta values in DNA methylation studies. BMC Bioinformatics 17:480.
- Wright and Ziegler (2017) Wright, M. N. and A. Ziegler (2017). ranger: a fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77, 1–17.