Super learning in the SAS system
Abstract
Background and objective Stacking is an ensemble machine learning method that averages predictions from multiple other algorithms, such as generalized linear models and regression trees. A recent iteration of stacking, called super learning, has been developed as a general approach to black box learning and has seen frequent usage, in part due to the availability of an R package. I develop super learning in the SAS software system using a new macro, and demonstrate its performance relative to the R package.
Methods I follow closely previous work using the R SuperLearner package and assess the performance of super learning in a number of domains. I compare the R package with the new SAS macro in a small set of simulations assessing curve fitting in a prediction model, a set of 14 publicly available datasets to assess crossvalidated, expected loss, and data from a randomized trial of job seekers’ training to assess the utility of super learning in causal inference using inverse probability weighting.
Results Across the simulated data and the publicly available data, the macro performed similarly to the R package, even with a different set of potential algorithms available natively in R and SAS. The example with inverse probability weighting demonstrated the ability of the SAS macro to include algorithms developed in R.
Conclusions The super learner macro performs as well as the R package at a number of tasks. Further, by extending the macro to include the use of R packages, the macro can leverage both the the robust, enterprise oriented procedures in SAS and the nimble, cutting edge packages in R. In the spirit of ensemble learning, this macro extends the potential library of algorithms beyond a single software system and provides a simple avenue into machine learning in SAS.
1 Introduction
Supervised machine learning algorithms are emerging as an essential tool for prediction and causal inference in biomedicine. Ensemble machine learning algorithms combine multiple algorithms into a single learner that can improve prediction characteristics such as classification accuracy or mean squared error. One ensemble machine learning method, referred to as stacking, is an approach to combining an arbitrary set of learning algorithms, including other ensemble methodsWolpert (1992); Breiman (1996). A recent approach to stacking, referred to as super learning, has demonstrated theoretical and practical properties that make it an ideal framework for prediction van der Laan et al. (2007); Polley and van der Laan (2010). One of the practical properties is the relative ease of software implementation that has lead to the development of several software packages for super learning. In turn, the availability of software has made the approach relatively simple to implement in some analysis systems Naimi and Balzer (2018).
Existing implementations of Super Learner include an R package Polley et al. (2017), maintained by the developers of the super learning algorithm, and unofficial, open source releases in Python 2Lendle (2014) that has been updated for Python 3 by one of us Keil (2017), and a small open source version in SAS Brooks (2016). Unfortunately, the SAS implementation does not appear to be under active development and has a limited library of algorithms.
I demonstrate usage of super learning in the SAS system using a macro written by us. This macro improves on existing software by providing a general approach to super learning with an extensive existing library that is easily extensible by the user to incorporate new learners, including algorithms from the R programming language. I demonstrate the use of this macro using one simulated example and two examples using real data, and I compare performance with existing implementations. The first two examples closely follow the analyses of Polley and van der Laan, which demonstrated the R SuperLearner package Polley and van der Laan (2010); Polley et al. (2017), while the third example demonstrates some unique features of the SAS macro: namely, the ability to include algorithms from both SAS and R in the same ensemble machine learner.
2 Methods: supervised and super learning
We first provide a brief review of supervised machine learning, and then describe the super learning algorithm.
2.1 Supervised learning
Suppose one is interested in learning about how lung cancer mortality rates vary according to age, smoking, and radon exposures in a population of uranium miners from Colorado in the 1950s. One can frame such learning in terms of a causal inference problem (e.g. what would be the change in the lung cancer mortality rates if one could eliminate smoking among the miners?) or in terms of a prediction problem (e.g. what is the expected lung cancer mortality rate among a nonsmoking 70 year old former miner who was exposed at the Mining Safety and Health Administration occupational exposure limit from ages 20 to 65?). Supervised machine learning is one way to use the inputs (smoking, age, radon exposure) and outputs (lung cancer mortality) as a way to describe or uncover patterns in how relates to . The way these variables relate is through a function that yields the average (or ”predicted”) lung cancer mortality rate for a given pattern of smoking and radon exposure at a given age within the context of our study sample . The parameters determine the shape of the function that relates inputs to outputs. For example could represent logodds ratios if our function is a logistic regression model or it could represent the rules that define the nodes of classification and regression trees.
Using the study data and a some function (e.g. logistic model, regression tree), one “trains” the parameters of that function by choosing the that minimize some estimated expected loss function, which could be the negative loglikelihood (as in maximum likelihood estimation) or some other estimated expected loss function such as the squared sum of the fitted residuals where (as in ordinary leastsquares). In this context, supervised machine learning is the act of training the parameters such that the function carries information about how and relate to each other in the study sample. That is, the model is “learned” from the data by minimizing an estimate of an expected loss function.
2.2 Super learning
Super learning is a way of combining multiple functions, or learners, using crossvalidation. Precise, theoretic descriptions of super learning are given in van der Laan et al. (2007) and Polley and van der Laan (2010), but I review basic principles here using alternative notation. Let be the number of learners in the library (set) of learners, and index each learner as for . I denote predictions from the learners by the vector . For example, M could equal 3 for a continuous and our library could contain a generalized linear model (glm), a generalized additive model (g.a.m.), and a regression tree (tree), and . The super learner prediction is given as a combination of the predictions from the leaners, which can be expressed as in equations 1 and 2.
(1)  
(2) 
I adopt Wolpert’s terminology and refer to the functions as “level0” models, which are essentially regression models for the observed on covariates indexed by parameters ; I refer to the function as a “level1” model in which the observed is regressed on the set of predictions using a model indexed by parameters Wolpert (1992). The backbone for “stacked generalization” was laid out by Wolpert Wolpert (1992) and developed further by Breiman Breiman (1996) using parametric level1 models. The algorithm given in equations 1 and 2 was generalized to arbitrary functions and by van der Laan et al., who allowed that could be, for example, a penalized regression model or a data adaptive approach like random forest van der Laan et al. (2007); this generalization was termed “super learning.” In practice, however, modern super learning algorithms are relatively unchanged from stacking algorithms in place by the late 1990s, which rely on parametric models in which the parameters form a convex combination (i.e. , and ); thus, super learner predictions can often be expressed as weighted combinations of a set of other machine learning algorithms.
Because we do not know the parameters , we must estimate them. In combination with the level1 model, fold crossvalidation is used as a way to estimate without overfitting to the data.
fold crossvalidation proceeds by partitioning the data into equally sized folds, where denotes the th fold, and is the size of the study sample. Typically, is chosen as 10 or 20. I denote a set of crossvalidated predictions by . This notation emphasizes that a crossvalidated prediction for the th fold results from first training the parameters of the function on the remaining folds, denoted by , and then using the trained value to predict , given the values of in the th fold of the study sample, denoted The final set of crossvalidated predictions for learner are given as the set of all N crossvalidated predictions (that is, if is a vector of length , then will also be a vector of length ). The coefficients are estimated in a model of the form , which is essentially a fitted regression model identical to the level1 model above, but crossvalidated predictions from each of the level0 models are used in place of the “true” predictions. That is, the super learning algorithm finds that minimize the estimated expected crossvalidated loss function . The final superlearner prediction is made using predictions from the level1 model with inputs estimated in the full study sample and parameters .
In the remainder of the manuscript, we give three example applications of super learning using the %SuperLearner macro, including one simulation study and two examples with realworld data. The %SuperLearner macro is available from the github page of the author, and a current version of the macro can be enabled in SAS by including the following at the top of a SAS program:
FILENAME sl URL "https://goo.gl/ihuq66";
%INCLUDE sl;
3 Results using the %SuperLearner macro.
3.1 Example 1: Simulation studies
Polley and van der Laan (2010) demonstrated the performance of super learner in a simple simulation problem. This simulation involved learning a regression function characterizing the mean of a continuous variable as a function of a single continuous predictor , given as a uniform random variable with min=4, max=4. was generated under four different scenarios, given by:
 Sim 1

 Sim 2

 Sim 3

 Sim 4

The simulations assess outofsample prediction accuracy for super learner. Specifically, this set of simulations quantifies how well a learning algorithm with parameters estimated in a training set of N=100 could predict outcomes in a validation data set (with the same data generating mechanism) of size 10,000. The metric used is an estimated statistic given by
Polley and van der Laan showed that the optimal value of (the expected value under the true parametric model) is 0.80 for all four simulations, where is estimated in the validation data. The simulations were each repeated 100 times and was calculated using estimated by super learner as well as for each learner in the super learner library.
I repeated Polley and van der Laan’s original simulation analysis using SAS and R with some modifications: I used the R package ‘xgboost’ for boosting (rather than ‘gbm’), and b.a.r.t. was dropped from the library (there is no SAS procedure for b.a.r.t.). The final super learner library contained algorithms for linear regression, linear regression with all first order interaction terms (glm, glm + intx), random forest, bootstrap aggregation of trees (baggingPeters and Hothorn (2017)), generalized additive models (g.a.m.Hastie (2018)), gradient boosting (boostingChen et al. (2018)), neural networks (neural netVenables and Ripley (2002)), multivarate adaptive regression splines (m.a.r.s.Kooperberg (2015)), Bayesian additive regression trees (b.a.r.t.Chipman et al. (2010)), and local polynomial regression (loessCleveland (1991)). Variations of some of these algorithms were added to the super learner library: bagging algorithms with complexity parameters set to 0.0, 0.01, and 0.1 were used, as well as one with a mean split size (ms) of 5; g.a.m. algorithms were created using splines with 2, 3, or 4 degrees of freedom; neural net algorithms were created with 2,3,4, or 5 hidden nodes; finally loess algorithms were created with smoothing parameters set to 0.75, 0.5, 0.25, or 0.1.
An example call to the %SuperLearner macro for the analysis of the simulated data is given in Figure 1. For each algorithm, and for super learner, I estimated the mean and interquartile range of the estimates across the 100 simulations. As shown in Figure 2, super learner performed equally well in both R and SAS. There were few meaningful differences across software platforms in the performance of individual learners, with the exception of loess, which is likely due to in parameterization of the smoothing kernel.
3.2 Example 2: Performance in 14 real world datasets
To extend beyond simple simulation scenarios, I tested the performance of the %SuperLearner macro in 14 real world data sets. The analyses in this section closely follow those reported in section 3.2 of Polley and van der Laan. I used 14 publicly available, real world datasets with continuous target variables and sample sizes between 200 and 700 (Appendix Table A1). The datasets vary in terms of the number of predictors and are from diverse subject matter areas including economic, health, engineering, and biologic data. Some of the datasets used by Polley and van der Laan are no longer available, so only 10 of our 14 datasets were featured in Polley and van der Laan’s analyses.
The objective of these analyses is to assess crossvalidated predictive accuracy, which I quantify using 10fold cross validated meansquared error (CVMSE). I followed Polley and van der Laan by scaling the CVMSE for each learner by the CVMSE for the generalized linear model using . To facilitate comparisons of the average performance of each learner, I calculated the geometric mean of across all datasets for each learner.
In addition to a number of the learners used in the simulation analyses of section 3.1, I also included b spline (basis splines, SAS only), b.a.r.t. (R only), stepwise (stepwise selection of a linear model Hastie and Pregibon (1992)), ridge (ridge regression Hastie and Pregibon (1992)), l.a.s.s.o.(least absolute shrinkage and selection operator Friedman et al. (2010)), random forest Liaw and Wiener (2002), bayes glm (R only) s.v.m (support vector machine regression, R only Meyer et al. (2017)), and d.s.a. (the deletion/substitution/addition algorithm Molinaro et al. (2010), R only).
In both SAS and R, the super learner algorithm had the lowest (best) average of all the algorithms examined (3). Among the other algorithms, g.a.m. performed well under several different parameterizations. Notably, the SAS version of g.a.m. yielded very few estimates that performed worse than a linear model, and seem to have a more consistent average performance than the R version. In contrast with the findings of Polley and van der Laan, b.a.r.t. did not perform the best among the other algorithms, which may be partly due to differences in the packages used (the b.a.r.t. algorithm used by Polley and van der Laan was not available at the time of this analysis) or simply due to differences in the datasets used for each analysis. Interestingly, the l.a.s.s.o. algorithm performed on average better than the linear model in R, but not in SAS. The SAS and R versions of the l.a.s.s.o. differ in terms of how the model is chosen, so the nominal category of the learner will not necessarily dictate its performance. Some individual algorithmic differences aside, there appeared to be no important difference in average across these 14 datasets between Super Learner implemented with nativeSAS procedures and Super Learner implemented with nativeR packages. Within a given dataset, however, they may not yield the same result due to differing performance among the available algorithms.
3.3 Example 3: Super learner estimates of inverse probability weights
One recent application of prediction algorithms is inverse probability weighting. Inverse probability weighting has been used for causal effect estimation in observational studies Robins et al. (2000), generalizing study results from randomized trials Cole and Stuart (2010) and observational studies Keil and Richardson (2017), and addressing potential bias from noncompliance and dependent censoring in clinical trials Robins and Finkelstein (2000); Cain and Cole (2009).
Under assumptions of exchangeability, consistency, positivity, no interference, no measurement error, and correct model specification, inverse probability weighting can be used to correct for selection bias from informative censoring in order to estimate the perprotocol (i.e. perfect adherence) effect of a treatment in a randomized trial. Super learner provides one potential avenue to relax the statistical assumption of correct model specification in the spirit of other machine learning approaches to estimate propensity scores Lee et al. (2010); Westreich et al. (2010); Watkins et al. (2013).
I used the %SuperLearner macro to estimate inverseprobabilityoftreatment and inverseprobabilityofcensoring weights in a randomized trial of a job training protocol among patients with substance abuse disorders. This trial was previously described in detail Svikis et al. (2012). Briefly, 628 un or underemployed patients seeking treatment for substance abuse disorders were randomized either to receive standard of care or to be offered weekly training via a series of small group sessions. Participants were assessed at randomization, as well as 4, 12, and 24 weeks (approximately) following randomization to enquire about current and former employment status, current and prior substance use (including a testing program), reading skills, and demographics. Over the course of the study, 127(20%) participants were losttofollowup.
In our analysis, I included only individuals with nonmissing baseline values for race, selfreported illicit substance use, or reading scores. I censored individuals after their first missed study visit or drug test, and individuals were censored at the time of the scheduled (missed) visit or test. The outcome of interest () is the first selfreported change in employment (gaining employment). Specific dates of employment were not available, so the time of the job change is recorded to be the time of the interview. The treatment of interest is participation in the small group jobtraining sessions. The target parameters are the study arm specific cumulative probabilities of gaining employment during the study period, had all participants remained under followup and compliant to the study arm to which they were assigned. Such estimands are often referred to as the “perprotocol” effects.
3.3.1 Estimating the probability of gaining employment
Inverse probability weighting was used to adjust for chance baseline imbalances in randomization by measured variables (treatment weights Greenland and Robins (2009)) and informative losstofollowup by measured, possibly timevarying variables (censoring weights Robins and Finkelstein (2000)). Using these weights, I estimated nonparametric cumulative distribution curves that estimate the perprotocol cumulative probability of obtaining employment in each study arm Cole and Hernán (2004), and I compared these curves with the estimated curves from an intenttotreat (ITT) analysis which is a crude comparison of estimated probabilities of employment across study arms.
3.3.2 Estimating inverse probability weights using super learner
Following the notation of Section 2, inverse probability of treatment weights were based on estimated probabilities derived from (where is treatment arm) and predictions from each learner depended baseline variables given by age, selfreported number of jobs ever worked, WRAT subtest scores for word and letters, any work in the last 4 weeks, race, the number of positive drug tests (out of 10 different substances), and selfreported everuse of cocaine or heroin. The timevarying censoring weights were based on (where is a censoring indicator at time ). Predictions from each learner depended on all variables included in treatment weights as well as the most recent number of positive drug tests, study arm and day of the study as predictors.
The library of learners for both treatment and censoring weights comprise a logistic model (glm), b.a.r.t., m.a.r.s., boosting, bagging, stepwise, and neural net. I expanded the neural net algorithms in the previous examples to potentially include multiple hidden layers (indexed by #layers, #nodes per layer), which is a simple version of “deep learning” LeCun et al. (2015). The weight used in analysis was calculated using the product of treatment and censoring weights, and weights were stabilized by multiplying by the crude probabilities of treatment and censoring. The loss function for both treatment and censoring level1 models was the negative of the binomial likelihood with a logit link and were constrained to be a convex combination.
A notable feature of this analysis is that the b.a.r.t. algorithm fit via an R routine. However, using the RLANG system option in SAS, it can be called via the %SuperLearner macro using the library member ‘r_bart’ (as can R versions of g.a.m., s.v.m., random forest, boosting, bagging, and m.a.r.s.). Thus, the %SuperLearner macro subsumes many of the learners available in the R SuperLearner package.
3.3.3 Results
As shown in Table 1, the learning algorithms that contributed the greatest to the super learner fit were l.a.s.s.o. (treatment weights) and m.a.r.s. (censoring weights). The strong influence of l.a.s.s.o. on the super learner fit for predicting treatment coincides with prior knowledge that most baseline covariates would not contribute to predictions of treatment (on the logit scale) because treatment is randomized. The other explicit variable selection algorithm in the library, stepwise, had the second lowest expected crossvalidated loss. Such prior knowledge is not available for predicting censoring. Interestingly, b.a.r.t. demonstrated the lowest expected crossvalidated loss of all learners in the library, but did not contribute substantially to the super learner fit. Intuition about regression modeling may provide one explanation: regressors that demonstrate high bivariate correlation with the dependent variable may not display high correlation conditional on other regressors. That is, this may simply be a level1 model analogue of Simpson’s paradox Hernán et al. (2011).
The estimated 10fold cross validated expected loss function was lowest for l.a.s.s.o. for the treatment weights and was lowest for super learner for the censoring weights. Notably, the method used to fit the level1 model of super learner had a noticeable impact on whether or not super learner achieved the minimum (or near minimum) estimated crossvalidated expected loss among the set of learners. For example, using the “NNLS” (the R SuperLearner package default: nonnegative, normalized least squares) method resulted in expected crossvalidated loss for super learner that was higher than 8/14 learners in the library for prediction of censoring (Appendix Table A2). This result emphasizes the utility of using an additional layer of crossvalidation to assess whether super learner adequately fits the data. It also underscores the continued importance of the analyst in evaluating the fit and plausibility of findings from automated algorithms such as super learner Keil and Edwards (2018).
Treatment  Censoring  

model  model  
Learner  Loss  Loss  
super learner  0.695  
glm  0.20  0.703  0.00  0.292 
b.a.r.t.  0.00  0.696  0.00  0.277 
m.a.r.s.  0.10  0.698  0.36  0.280 
boosting  0.00  0.748  0.00  0.290 
bagging  0.18  0.705  0.00  0.280 
stepwise  0.00  0.697  0.00  0.292 
random forest  0.00  0.717  0.00  0.288 
l.a.s.s.o.  0.51  0.00  0.290  
naive Bayes  0.00  0.700  0.19  0.623 
s.v.m.  0.00  0.816  0.18  0.367 
neural net (1,5)  0.00  0.714  0.00  0.284 
neural net (1,15)  0.00  0.699  0.05  0.286 
neural net (2,15)  0.00  0.698  0.12  0.284 
random forest 2  0.02  0.732  0.09  0.303 
Estimated expected 10fold crossvalidated loss: or ; given as the negative of the binomial likelihood with a logit link. Learners with lowest expected loss are denoted in bold.
using the “dbarts” package in R
Random forest using sampling with replacement (SAS default is sampling without replacement)
The mean (sd) of the stabilized weights (estimated using super learner) was 0.93 (0.13), while weights estimated using only logistic models for treatment and censoring had a mean (sd) of 1.03 (0.40). Ideally, a mean stabilized weight will equal 1.0, though it has previously been reported that machine learning methods to estimate propensity scores have resulted in stabilized weights that deviated substantially from 1.0 but nonetheless resulted in treatment effect estimates with low bias relative to misspecified parametric models Lee et al. (2010). As demonstrated in figure 4, the perprotocol effect of the intervention did not differ appreciably from the intent to treat analysis. Based on the small difference between the intent to treat analysis and the perprotocol analysis, bias due to informative dropout by the variables I included as covariates did not appear to be meaningfully large. Statistical analysis using weighted Cox models agreed with graphical analysis which suggested a larger effect of treatment in the latter half of the study (not shown).
4 Conclusions
While the SuperLearner package has been available in R since 2010, there has been no such facility that is generally available in the SAS software system. With the addition of the SAS Enterprise Miner software, the availability of machine learning algorithms in SAS has warranted a need for a way to combine inference from both parametric and data adaptive models. The %SuperLearner macro fills this gap and, as demonstrated, performs similarly to the R package in a number of problems, even with a different set of natively available machine learning algorithms in each software system.
The reliance on SAS places some constraints on the available features of the %SuperLearner macro. Namely, while the %SuperLearner macro can be used to train super learner in one dataset in order to make predictions in another (as in the simulation example shown in figure 1), these processes must currently be done simultaneously. The R package, on the other hand, allows one to save a trained super learner model for making predictions at a later time. The procedural programming oriented approach of SAS make such a feature difficult to implement in SAS. Further, many of the procedures underlying machine learning algorithms in SAS require the (paid) installation of SAS Enterprise Miner (e.g. random forest and neural net) and even basic implementations require SAS/STAT and SAS/OR, which may or may not be included in some SAS installations.
Limitations notwithstanding, the %SuperLearner macro is a power tool that can draw on both the SAS and R systems for machine learning algorithms. Thus, in the spirit of ensemble machine learning algorithms, this approach is appealing in the number of different learners that can be implemented as part of the super learner library. This macro draws on a number of strengths from the SAS system, including the robust, enterprise oriented procedures, bygroup processing, and the default capability to handle datasets that are too large to fit in memory. Rather than replacing the SuperLearner package in R, this macro provides a valuable alternative to researchers more familiar with the SAS system or who use SAS due to enterprise features or collaborative ease.
References
References
 Wolpert (1992) D. H. Wolpert, Stacked generalization, Neural networks 5 (1992) 241–259.
 Breiman (1996) L. Breiman, Stacked regressions, Machine learning 24 (1996) 49–64.
 van der Laan et al. (2007) M. J. van der Laan, E. C. Polley, A. E. Hubbard, Super learner, Report, Division of Biostatistics, University of California, Berkeley, 2007.
 Polley and van der Laan (2010) E. C. Polley, M. J. van der Laan, Super learner in prediction, Report, Division of Biostatistics, University of California, Berkeley, 2010.
 Naimi and Balzer (2018) A. I. Naimi, L. B. Balzer, Stacked generalization: an introduction to super learning, Eur J Epidemiol (2018).
 Polley et al. (2017) E. Polley, E. LeDell, C. Kennedy, M. van der Laan, SuperLearner: Super Learner Prediction, 2017. URL: https://CRAN.Rproject.org/package=SuperLearner, r package version 2.022.
 Lendle (2014) S. Lendle, Supy learner, 2014. URL: https://github.com/lendle/SuPyLearner.
 Keil (2017) A. P. Keil, Supy learner (python 3), 2017. URL: https://github.com/alexpkeil1/SuPyLearner.
 Brooks (2016) J. Brooks, Superlearner sas macro, 2016. URL: https://github.com/BerkeleyBiostats/SuperLearnerSAS.
 Peters and Hothorn (2017) A. Peters, T. Hothorn, ipred: Improved Predictors, 2017. URL: https://CRAN.Rproject.org/package=ipred, r package version 0.96.
 Hastie (2018) T. Hastie, gam: Generalized Additive Models, 2018. URL: https://CRAN.Rproject.org/package=gam, r package version 1.15.
 Chen et al. (2018) T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, xgboost: Extreme Gradient Boosting, 2018. URL: https://CRAN.Rproject.org/package=xgboost, r package version 0.6.4.1.
 Venables and Ripley (2002) W. N. Venables, B. D. Ripley, Modern Applied Statistics with S, fourth ed., Springer, New York, 2002. URL: http://www.stats.ox.ac.uk/pub/MASS4, iSBN 0387954570.
 Kooperberg (2015) C. Kooperberg, polspline: Polynomial Spline Routines, 2015. URL: https://CRAN.Rproject.org/package=polspline, r package version 1.1.12.
 Chipman et al. (2010) H. A. Chipman, E. I. George, R. E. McCulloch, Bart: Bayesian additive regression trees, The Annals of Applied Statistics 4 (2010) 266–298.
 Cleveland (1991) W. S. Cleveland, Local regression models, Statistical Models in S. (1991).
 Hastie and Pregibon (1992) T. J. Hastie, D. Pregibon, Generalized linear models. Chapter 6 of Statistical Models in S, Wadsworth & Brooks/Cole, 1992.
 Friedman et al. (2010) J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software 33 (2010) 1–22.
 Liaw and Wiener (2002) A. Liaw, M. Wiener, Classification and regression by randomforest, R News 2 (2002) 18–22.
 Meyer et al. (2017) D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, F. Leisch, e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, 2017. URL: https://CRAN.Rproject.org/package=e1071, r package version 1.68.
 Molinaro et al. (2010) A. M. Molinaro, K. Lostritto, M. J. van der Laan, partdsa: Deletion/substitution/addition algorithm for partitioning the covariate space in prediction, Bioinformatics 26 (2010) 1357–63.
 Robins et al. (2000) J. M. Robins, M. A. Hernán, B. Brumback, Marginal structural models and causal inference in epidemiology, Epidemiology 11 (2000) 550–60.
 Cole and Stuart (2010) S. R. Cole, E. A. Stuart, Generalizing evidence from randomized clinical trials to target populations: The actg 320 trial, American journal of epidemiology 172 (2010) 107–115.
 Keil and Richardson (2017) A. P. Keil, D. B. Richardson, Quantifying cancer risk from radiation, Risk Analysis (2017).
 Robins and Finkelstein (2000) J. M. Robins, D. M. Finkelstein, Correcting for noncompliance and dependent censoring in an aids clinical trial with inverse probability of censoring weighted (ipcw) logrank tests, Biometrics 56 (2000) 779–788.
 Cain and Cole (2009) L. E. Cain, S. R. Cole, Inverse probabilityofcensoring weights for the correction of timevarying noncompliance in the effect of randomized highly active antiretroviral therapy on incident aids or death, Statistics in medicine 28 (2009) 1725–1738.
 Lee et al. (2010) B. K. Lee, J. Lessler, E. A. Stuart, Improving propensity score weighting using machine learning, Statistics in medicine 29 (2010) 337–346.
 Westreich et al. (2010) D. Westreich, J. Lessler, M. J. Funk, Propensity score estimation: neural networks, support vector machines, decision trees (cart), and metaclassifiers as alternatives to logistic regression, Journal of clinical epidemiology 63 (2010) 826–833.
 Watkins et al. (2013) S. Watkins, M. JonssonFunk, M. A. Brookhart, S. A. Rosenberg, T. M. O’Shea, J. Daniels, An empirical comparison of treebased methods for propensity score estimation, Health Serv Res 48 (2013) 1798–817.
 Svikis et al. (2012) D. S. Svikis, L. KeyserMarcus, M. Stitzer, T. Rieckmann, L. Safford, P. Loeb, T. Allen, C. LunaAnderson, S. E. Back, J. Cohen, et al., Randomized multisite trial of the job seekersÂ workshop in patients with substance use disorders, Drug & Alcohol Dependence 120 (2012) 55–64.
 Greenland and Robins (2009) S. Greenland, J. M. Robins, Identifiability, exchangeability and confounding revisited, Epidemiol Perspect Innov 6 (2009) 4.
 Cole and Hernán (2004) S. R. Cole, M. A. Hernán, Adjusted survival curves with inverse probability weights, Comput Methods Programs Biomed 75 (2004) 45–9.
 LeCun et al. (2015) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (2015) 436.
 Hernán et al. (2011) M. A. Hernán, D. Clayton, N. Keiding, The simpson’s paradox unraveled, Int J Epidemiol 40 (2011) 780–5.
 Keil and Edwards (2018) A. P. Keil, J. K. Edwards, You are smarter than you think:(super) machine learning in context, European journal of epidemiology (2018) 1–4.
 Cook and Weisberg (2009) R. D. Cook, S. Weisberg, An introduction to regression graphics, volume 405, John Wiley & Sons, 2009.
 Penrose et al. (1985) K. W. Penrose, A. Nelson, A. Fisher, Generalized body composition prediction equation for men using simple measurement techniques, Medicine & Science in Sports & Exercise 17 (1985) 189.
 James et al. (2013) G. James, D. Witten, T. Hastie, R. Tibshirani, An introduction to statistical learning, volume 112, Springer, 2013.
 Berndt (1991) E. R. Berndt, The practice of econometrics: classic and contemporary, Addison Wesley Publishing Company, 1991.
 Kibler et al. (1989) D. Kibler, D. W. Aha, M. K. Albert, Instancebased prediction of realvalued attributes, Computational Intelligence 5 (1989) 51–57.
 Harrell (2001) F. E. Harrell, Regression modeling strategies, with applications to linear models, survival analysis and logistic regression, GET ADDRESS: Springer (2001).
 Chu (2001) S. Chu, Pricing the c’s of diamond stones, Journal of Statistics Education 9 (2001).
 Rosner (2015) B. Rosner, Fundamentals of biostatistics, Nelson Education, 2015.
 Harrison Jr and Rubinfeld (1978) D. Harrison Jr, D. L. Rubinfeld, Hedonic housing prices and the demand for clean air, Journal of environmental economics and management 5 (1978) 81–102.
 Cook (2009) R. D. Cook, Regression graphics: Ideas for studying regressions through graphics, volume 482, John Wiley & Sons, 2009.
 Gelman et al. (2014) A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin, Bayesian data analysis, volume 2, CRC press Boca Raton, FL, 2014.
 Carroll et al. (1997) R. J. Carroll, D. Ruppert, L. A. Stefanski, J. Buonaccorsi, Measurement error in nonlinear models, Metrika 45 (1997) 182–183.
 Western (1996) B. Western, Vague theory and model uncertainty in macrosociology, Sociological Methodology (1996) 165–192.
Appendix A Appendix
Data  N  p  Citation 

ais  202  10  Cook and Weisberg (2009) 
bodyfat  252  14  Penrose et al. (1985) 
cholesterol  297  13  James et al. (2013) 
cps78  550  18  Berndt (1991) 
cps85  534  18  Berndt (1991) 
cpu  209  6  Kibler et al. (1989) 
diabetes  375  15  Harrell (2001) 
diamond  308  17  Chu (2001) 
fev  654  4  Rosner (2015) 
house  506  13  Harrison Jr and Rubinfeld (1978) 
mussels  201  3  Cook (2009) 
presidential  591  20  Gelman et al. (2014) 
sat  339  4  Carroll et al. (1997) 
strike  625  6  Western (1996) 
number of predictors
does not appear in Polley and van der Laan (2010).
Treatment  Censoring  

model  model  
Learner  Loss  Loss  
super learner  0.250  0.079  
glm  0.32  0.251  0.00  0.079 
b.a.r.t.  0.00  0.251  0.00  
m.a.r.s.  0.13  0.251  0.41  0.079 
boosting  0.00  0.273  0.00  0.081 
bagging  0.13  0.255  0.00  0.079 
stepwise  0.00  0.252  0.00  0.079 
random forest  0.00  0.260  0.04  0.080 
l.a.s.s.o.  0.40  0.00  0.079  
naive Bayes  0.00  0.253  0.14  0.221 
s.v.m.  0.00  0.296  0.00  0.084 
neural net (1,5)  0.00  0.257  0.00  0.079 
neural net (1,15)  0.00  0.252  0.00  0.079 
neural net (2,15)  0.00  0.252  0.40  0.078 
random forest 2  0.02  0.266  0.02  0.083 
Estimated expected 10fold crossvalidated loss: or ; given as the mean squared error. Learners with lowest expected loss are denoted in bold.
using the “dbarts” package in R
Random forest using sampling with replacement (SAS default is sampling without replacement)