# A nonparametric Bayesian analysis of heterogeneous treatment effects in digital experimentation

\marginsize

1.1in.9in.3in1.4in \sectionfont \subsectionfont \subsubsectionfont

\setstretch

1.1

A nonparametric Bayesian analysis of heterogeneous

treatment effects in digital experimentation

University of Chicago Booth School of Business 1

Matt Gardner

EBay

Liyun Chen

EBay

David Draper

University of California, Santa Cruz

Abstract: Randomized controlled trials play an important role in how Internet companies predict the impact of policy decisions and product changes. In these ‘digital experiments’, different units (people, devices, products) respond differently to the treatment. This article presents a fast and scalable Bayesian nonparametric analysis of such heterogeneous treatment effects and their measurement in relation to observable covariates. New results and algorithms are provided for quantifying the uncertainty associated with treatment effect measurement via both linear projections and nonlinear regression trees (CART and Random Forests). For linear projections, our inference strategy leads to results that are mostly in agreement with those from the frequentist literature. We find that linear regression adjustment of treatment effect averages (i.e., post-stratification) can provide some variance reduction, but that this reduction will be vanishingly small in the low-signal and large-sample setting of digital experiments. For regression trees, we provide uncertainty quantification for the machine learning algorithms that are commonly applied in tree-fitting. We argue that practitioners should look to ensembles of trees (forests) rather than individual trees in their analysis. The ideas are applied on and illustrated through an example experiment involving 21 million unique users of EBay.com.

\setstretch

1.5

## 1 Introduction

The Internet is host to a massive amount of experimentation. Online companies are constantly experimenting with changes to the ‘user’ experience. Randomized controlled trials are particularly common; they are referred to within technology companies as ‘A/B testing’ for the random assignment of control (option A) and treatment (option B) to experimental units (often users, but also products, auctions, or other dimensions). The treatments applied can involve changes to choice of the advertisements a user sees, the flow of information to users, the algorithms applied in product promotion, the pricing scheme and market design, or any aspect of website look and function. EBay, the source of our motivating application, experiments with these and other parts of the user experience with the goal of making it easier for buyers and sellers of specific items to find each other.

Heterogeneous treatment effects (HTE) refer to the phenomenon where the treatment effect for any individual user – the difference between how they would have responded under treatment rather than control – is different from the average. It is self-evident that HTE exist: different experimental units (people, products, or devices) each have unique responses to treatment. The task of interest is to measure this heterogeneity. Suppose that for each user with response , in either control or treatment, or respectively, there are available some pre-experiment attributes, . These attributes might be related to the effect of on . For example, if is during-experiment user spend, then could include pre-experiment spend by user on the company website. We can then attempt to index the HTE as a function of .

Digital (i.e., Internet) A/B experiments differ from most prior experimentation in important ways. First, the sample sizes are enormous. Our example EBay experiment (described in Section 2) has a sample size of over 21 million unique users. Second, the effect sizes are tiny. Our example treatment – increasing the size of product images – has response standard deviation around 1000 times larger than the estimated treatment effect. Finally, the response of interest (some transaction, such as user clicks or money spent) tends to be distributed with a majority of mass at zero, density spikes at other discrete points such as 1 or 99, a long tail, and variance that is correlated with available covariates. These data features – large samples, tiny effects that require careful uncertainty quantification, and unusual distributions that defy summarization through a parametric model – provide a natural setting for nonparametric analysis.

This article proposes a scalable framework for Bayesian nonparametric analysis of heterogeneous treatment effects. Our approach, detailed in Section 3, has two main steps.

1. Choose some statistic that is useful for decision making, regardless of the true data distribution. For example, this could be a difference in means between two groups.

2. Quantify uncertainty about this statistic as induced by the posterior distribution for a flexible Bayesian model of the data generating process (DGP).

This differs from the usual Bayesian nonparametric analysis strategy, in which a model for the data generating process is applied directly in prediction for future observations. See Hjort et al. (2010) and Taddy and Kottas (2010) for examples. In contrast, we consider a scenario where there is a given statistic that will be used in decision making regardless of the true DGP. The role of Bayesian modeling is solely to quantify uncertainty about this statistic. For example, in Section 5 we study regression trees; you do not need to believe that your data was generated from a tree in order for regression trees to be useful for prediction. Our statistic in () is then the output of a tree-fitting algorithm and in () we seek to evaluate stability in this algorithm under uncertainty about the true DGP.

We refer to this style of analysis as distribution-free Bayesian nonparametrics, in analogy to classical distribution-free statistics (e.g., as in Hollander and Wolfe, 1999) whose null-hypothesis distribution can be derived under minimal assumptions on the DGP (or without any assumptions at all, as in the case of the rank-sum test of Wilcoxon, 1945). In both Bayesian and classical setups, the statistic of interest is decoupled from assumptions about the DGP. One advantage of this strategy in the Bayesian context is that it allows us to apply a class of simple but flexible DGP models whose posterior can be summarized analytically or easily sampled via a bootstrap algorithm. That is, we can avoid the computationally intensive Markov chain Monte Carlo algorithms that are usually required in Bayesian nonparametric analysis (and which currently do not scale for data of the sizes encountered in digital experiments). Moreover, decoupling the tool of interest from the DGP model lets us provide guidance to practitioners without requiring them to change how they are processing data and the statistics that they store.

Our Bayesian DGP model, detailed in Section 3, treats the observed sample as a draw from a multinomial distribution over a large but finite set of support points. We place a Dirichlet prior on the probabilities in this multinomial, and the posterior distribution over possible DGPs is induced by the posterior on these probabilities. This Dirichlet-multinomial setup was introduced in Ferguson (1973) as a precursor to his Dirichlet process model (which has infinite support). We are not the first to propose its use in data analysis. Indeed, it is the foundation for the Bayesian bootstrap of Rubin (1981), and Chamberlain and Imbens (2003) provide a nice survey of some econometric applications. Similarly, our goal is to show the potential for distribution-free Bayesian nonparametric analysis of treatment effects in massive datasets and to understand its implications for some common practices.

After introducing our motivating dataset in Section 2 and the multinomial-Dirichlet framework in Section 3, the remainder of the paper is devoted to working through the analysis of two application areas and illustrating the results on our EBay experiment data. First, Section 4 considers linear least-squares HTE projections and their use in adjusting the measurement of average treatment effects. We provide approximations for the posterior mean and variance associated with regression-adjusted treatment effects, and the results predict a phenomena that we have observed repeatedly in practice: the influence of linear regression adjustment tends to be vanishingly small in the low-signal and large-sample setting of digital experiments. Second, Section 5 considers the use of the CART (Breiman et al., 1984) algorithm for partitioning of units (users) according to their treatment effect heterogeneity. This is an increasingly popular strategy (e.g., Athey and Imbens, 2015), and we demonstrate that there can be considerable uncertainty associated with the resulting partitioning rules. As a result, we advocate and analyze ensembles of trees that allow one to average over and quantify posterior uncertainty.

## 2 Data

First, some generic notation. For each independent experimental unit , which we label a ‘user’ following our eBay example, there is a response , a binary treatment indicator with (in control) and (in treatment), and a length- covariate vector . (We focus on scalar binary treatment for ease of exposition, but it is straightforward to generalize our results to multi-factor experiments.) There are users in control, in treatment, and in total. The user feature matrices for control and treatment groups are and , respectively, and these are accompanied by response vectors and . Stacked features and response are and , so that are in control and are treated.

Our example experiment involves 21 million users of the website EBay.com, randomly assigned 2/3 in treatment and 1/3 in control over a five week period. The treatment of interest is a change in image size for items in a user’s myEBay page – a dashboard that keeps track of items that the user has marked as interesting. In particular, the pictures are increased from pixels in control to pixels for the treated.

## 5 Regression tree prediction for HTE

Regression trees partition the feature (covariate) space into regions of response homogeneity, such that the response associated with any point in a given partition can be predicted from the average for that of its neighbors. The partitions are typically formed through a series of binary splits, and after this series of splits each terminal leaf node contains a rectangular subset of the covariate support. Advantages of using trees as prediction rules include that they can model a response that is nonlinear in the original covariates, they can represent complex interactions (every variable that is split upon is interacting with those above and below it in the tree), and they allow for error variance that changes with the covariates (there is no homoscedasticity restriction across leaves). Through their implementation as part of Random Forest (Breiman, 2001) or gradient boosting machine (Friedman, 2001) ensembles, it is difficult to overstate the extent to which trees play a central role in contemporary industrial machine learning.

The CART algorithm of Breiman et al. (1984) is the most common and successful recipe for building trees. It grows greedily and recursively: for a given node (subset of data), a split location is chosen to minimize some impurity (sum-squared-error for regression trees) across the two resulting children; this splitting procedure is repeated on each child, and hence recursively until the algorithm encounters a stopping rule (e.g., if a new child contains fewer observations than a specified minimum leaf size). After the tree is fit, it is common to use cross-validation to prune it by evaluating whether the splits near to the leaves improve out-of-sample prediction and removing those that do not. Variations on CART include the random removal of input dimensions as candidates for the split location at each impurity minimization. The resulting prediction rule is then the average across repeated runs of this randomized-input CART (e.g., see Breiman, 2001). Introduction of such stochasticity can improve upon the performance of greedy search in datasets where, e.g., you have high-dimensional inputs.

To quantify uncertainty for CART, we study a population CART algorithm that (analogously to the population OLS in (5)) optimizes over a realization of our Bayesian nonparametric DGP model from (2). Consider a node , containing a subset of the data indices . This node is to be partitioned into two child nodes according to a binary split on one of the covariate locations: a split on input of observation , say , so that the two resulting child nodes are and . Given a realization of the DGP weights , the population CART algorithm chooses to minimize

 Eleft(η,j,x)(θ)+Eright(η,j,x)(θ), (14)

where for a generic node the impurity (error) is

 Es(θ)=∑i∈sθi(yi−μs)2    with    μs=y′sθs/|θs|. (15)

As in sample CART, this splitting is repeated recursively until we encounter a stopping rule. For randomized-input versions of CART, one minimizes the same DGP-dependent impurity in (14) but over a random subset of candidate split dimensions . The statistic of interest is then itself a random object: we have decoupled variability due to uncertainty about the DGP from algorithmic stochasticity that does not diminish as you accumulate data.

The posterior over trees (i.e., over CART fits) can be sampled via the Bayesian bootstrap of Section 3.1. This leads to a posterior sample of trees that we label a Bayesian Forest. The algorithm is studied in detail in Taddy et al. (2015), along with an Empirical Bayes approximation for computation in distribution across many machines. Taddy et al. (2015) demonstrate that the average prediction from a Bayesian Forest (i.e., the posterior mean) outperforms prediction from a single CART tree (with cross-validated pruning) and many other common tree-based prediction algorithms. Of particular interest, Bayesian Forests tend to perform similarly to, although usually slightly better than, Random Forests. The only difference between the two algorithms is that while the Bayesian Forest uses independent observation weights, the Random Forest draws a vector of discrete weights from a multinomial distribution with probability and size (this is the frequentist nonparametric bootstrap).

### 5.1 Single tree prediction for HTE

Bayesian Forests provide uncertainty quantification for a machine learning algorithm – CART – that is often viewed as a black-box. Recently published applications of regression trees in HTE prediction include Foster et al. (2011) for medical clinical trials and Dudík et al. (2011) for user browsing behavior, and we have observed that CART is commonly employed in industry for the segmentation of customers according to their response to advertisement and promotions.

Athey and Imbens (2015) study various strategies for the use of CART-like algorithms in prediction of HTE. We will focus on their transformed outcome tree (TOT) method, which is simply the application of CART in prediction of a transformed response, , which has expected value equal to the treatment effect of interest. In the language of the Neyman-Rubin causal model (e.g., Rubin, 2005), each unit of observation in an experiment is associated with two potential outcomes: , their response if they are allocated to the control group; and , their response under treatment. Of course, only one of these two potential outcomes is ever realized and observed: . Athey and Imbens (2015) define

 y⋆i=yidi−qq(1−q), (16)

where is the probability of treatment ( in our EBay example). Then, with denoting expectation over unknown independent treatment allocation ,

 \mathdsEd[y⋆i|υi]=qυi(t)1−qq(1−q)−(1−q)υi(c)qq(1−q)=υi(t)−υi(c). (17)

Thus a tree that is trained to fit the expectation for can be used to predict the treatment effect.

#### Example: posterior uncertainty for transformed outcome trees

Sample-fit TOT trees for our EBay example experiment are shown in Figure 2, fit to the data accumulated through one and five weeks of experimentation. The leaf nodes are marked with the corresponding prediction rule: , the mean of over , which is an estimate for following (17). Recall that these are dollar-value effects. We fit all of our trees and forests via adaptations of MLLib’s decision tree methods in Apache Spark, and in this case the algorithm stops at either a maximum depth of 5 or a minimum leaf size of 100,000 users. Athey and Imbens (2015) recommend the use of cross-validated pruning for TOT tree fitting. In our examples, cross-validation selects deeper trees than those shown in Figure 2, such that these can be viewed as the trunks of some more complex optimal sample TOT.

The population version of the TOT algorithm simply replaces with in (14), and a posterior sample over TOT trees – a Bayesian Forest – is obtained via Bayesian bootstrapping as described above. We fit Bayesian Forests of 1000 trees to study the uncertainty associated with the TOT trees in Figure 2. For the variables split upon in each sample TOT tree, Table 2 contains the posterior probability that each variable is split upon, at or above a given depth, for a new realization of the DGP. This is simply the proportion of trees in the forest in which such splits occur. The internal decision nodes in Figure 2 are colored according to these probabilities.

After one week and 7.45 million users, only Bids Fashion – the number of bids on ‘fashion’ items – is split upon with greater than 1/3 probability at a depth . After observing 5 weeks of purchasing from 13.22 million users, the structure is more stable: five variables occur in more than 1/3 of depth-5 trees. Two variables occur with probability greater than 1/2: the lastmonth indicator, for whether the user made a purchase in the past month; and Bids Other, the number of bids on un-categorized items (the split location for Bids Other was always between 6 and 10). We could hence, say, partition users into four groups according to the splits on lastmonth and Bids Other and have a better than 1/2 chance that for any posterior DGP realization a similar partitioning would be included in the top of the corresponding TOT fit.

However, even after 5 weeks, there remains considerable uncertainty associated with the full tree structure. For example, the very first (root) split in the sample tree occurs in only 40% of depth-5 trees, it targets a small subset of the data (less than 1% of users had SI Fashion ), and it predicts an extreme treatment effect for this subset (-\$38.77 as the effect of slightly larger images). Moreover, at a depth of 5 all variables except for lastpurch have low split probability. This uncertainty contradicts the examples and results in Taddy et al. (2015), which finds high posterior probability for the trunks of CART fits and takes advantage of this stability for efficient computation. We hypothesize that the difference here is due to the tiny signal available for prediction of HTE in our (and probably many other) digital experiments.

### 5.2 Bayesian Forest HTE prediction

A single CART fit is a fragile object; even if the trunks are more stable than we find in the example above, deep tree structure will have near-zero posterior probability. Splits that cross-validated pruning finds useful for out-of-sample prediction will often disappear under small jitter to the dataset. See Breiman (1996) for the classic study of this phenomenon, which was the motivation for his Random Forest algorithm. Breiman showed that by averaging across many trees, each individually unstable and over-fit, he could obtain a response surface that was both stable and a strong performer in out-of-sample prediction. The act of averaging removes noisy structure that exists in only a small number of trees, and it smooths across uncertainty, e.g., about split locations or the order of nodes in a tree path.

From our perspective, a Bayesian Forest (which is nearly equivalent to a Random Forest) is a posterior over CART predictors. The average leaf value associated with a given is the posterior mean prediction rule. This contrasts with the predictions implied by the single sample CART tree, which is the CART prediction rule at posterior mean DGP, where . Experience shows that this can make a big difference: the forest average response surface will be different from and provide better prediction than the sample CART tree. (More generally, see Clyde and Lee (2001) for discussion on the Bayesian bootstrap and model averaging.)

For our final example, we consider the posterior distribution on the difference between two prediction rules: CART fit to each of treatment and control DGPs. Write for the prediction rule at resulting from population CART fit to support with weights . That is, if the realized CART fit allocates to the leaf node containing observations in set , then as described in (15). Thus is a random variable and so is the predicted treatment effect

 ^yt(x)−^yc(x). (18)

As in Section 4, the DGPs for treatment and control are independent from each other and we can obtain posterior samples of (18) via separate Bayesian Forests for each treatment group.

The framework implied by (18) is related to a semi-parametric literature (e.g. Hill, 2011; Green and Kern, 2012; Grimmer et al., 2013; Imai and Ratkovic, 2013) that studies the difference between flexible regression functions in each of the treatment groups. In a prominent example, Hill (2011) applies Bayesian additive regression trees (BART; Chipman et al., 2010) and interprets the difference between posterior predictive distributions across treatment groups as effect heterogeneity. In contrast to our approach, where the trees are just a convenient prediction rule and we do not assume that the data were actually generated from a tree, Hill assumes that her regression functions are representative of the true underlying DGP. Which strategy is best will depend upon your application. For example, BART includes a homogeneous Gaussian additive error and is thus inappropriate for the heteroscedastic errors in Internet transaction data. BART is outperformed by forest algorithms in such settings (see, e.g., Taddy et al., 2015), but will outperform the forests when the homoscedasticity assumption is more valid.

Due to the similarity between Random and Bayesian Forests, our approach is also related to recent work by Wager and Athey (2015) on the use of Random Forests in HTE estimation. Wager and Athey use the forests to construct confidence intervals for a true treatment effect surface. This is more ambitious than our contribution, which interprets the forest as a posterior distribution for optimal prediction of treatment effects within a certain class of algorithms. Indeed, Wager and Athey are studying the frequentist properties of our posterior mean.

#### Example: Differenced treatment group forests

Returning to our EBay example, we focus on HTE prediction for the completed experiment including 13.22 million users over 5 weeks. Bayesian Forests of 1000 trees each were fit to the treatment and control group samples. Each forest is the posterior distribution over a population-CART algorithm run with maximum depth of 10 and no minimum leaf size. CART was applied without any random variable subsetting; hence, variability in the resulting prediction surface is due entirely to posterior uncertainty about the DGP.

The outcome is a posterior sample over prediction rules for the conditional average treatment effects, as in (18). Figure 3 shows four example posteriors for individual user treatment effects; this type of uncertainty quantification is available for any new user whose treatment effect you wish to predict. It is also possible to summarize, for a given DGP realization, the average treatment effect conditional on variable being in set as

 ^yjt(X)−^yjc(X):=∑i:xij∈Xθi(^yt(xi)−^yc(xi))∑i:xij∈Xθi. (19)

The sum in (19) is over all observations in the sample (both treatment and control groups) that have their feature in . Figure 4 shows change in the posterior distributions for conditional average treatment effects corresponding to change in the user’s last purchase date and their spending (in dollars and items bought) over the period prior to the experiment. Both posterior mean and uncertainty tend to increase for groups of more active users. As in our OLS analysis of Figure 1, the posteriors can be highly skewed.

Finally, each DGP realization provides a prediction for the average treatment effect,

 ^yt−^yc:=1|θ|n∑i=1θi(^yt(xi)−^yc(xi)). (20)

The random play a role here both in weighting each treatment effect prediction, , and in the CART fits that underly those predictions. The last row of Table 1 shows posterior mean and standard deviation for from (20) after 1-5 weeks of experimentation. We have no basis here to argue that this statistic is preferable to unadjusted or the adjusted metrics of Section 4.2; however, it presents an intuitively appealing option if you believe that the expected response within each treatment group is nonlinear in .

## 6 Discussion

This article outlines a nonparametric Bayesian framework for treatment effect analysis in A/B experiments. The approach is simple, practical, and scalable. It applies beyond the two studied classes of HTE statistics; for example, an earlier version of the work (Taddy et al., 2014) considered HTE summarization via moment conditions and our CART trees are just one possible prediction rule amongst many available machine learning tools.

One area for future research is in semi-parametric extensions of this framework. For example, we know from existing theory on the frequentist nonparametric bootstrap that it can fail for distributions with infinite variance (Athreya, 1987). This scenario could occur in digital experiments where the response is extremely heavy tailed. In response, Taddy et al. (2015) propose combining the Dirichlet-multinomial model with a parametric tail distribution.

Throughout, we have referenced large existing literatures on Bayesian parametric and semi-parametric and frequentist analysis of HTE. We are not aiming to replace these existing frameworks, nor are we advocating for any one HTE statistic over another. Instead, we simply present a novel set of Bayesian nonparametric analyses for some common and useful tools. The hope is that frequentists and parametric Bayesians alike will benefit from this alternative point of view.

## Appendix A Population OLS gradient

Define and use to denote the gradient of on . Then

 ∇βd =∇(S−1dX′dΘdyd) (21) =S−1d∇vec(X′dΘdyd)+(y′dΘdXd⊗Ip)∇vec(S−1d) =S−1d(y′d⊗X′d)∇vec(Θd)−(y′dΘdXd⊗Ip)(S−1d⊗S−1d)(X′d⊗X′d)∇vec(Θd) Extra open brace or missing close brace =(yd−Xdβd)′⊗S−1dX′d∇vec(Θd)

via repeated applications of and for appropriately sized matrices, and using the chain rule with a standard result from matrix calculus to get . Since , the formula in (21) reduces to .

## Appendix B Posterior inference for regression-adjusted ATE

Our first-order approximation to is . Writing , this approximation has variance .

###### Theorem B.1.
 var(¯x′[~βt−~βc]) =s2ycn2t+s2ycn2c−(R2ts2ycn2t+R2cs2ycn2c) +(¯x−¯xt)′Σ~βt(¯x−¯xt)+(¯x−¯xc)′Σ~βc(¯x−¯xc), (22)

where for generic length- vector and .

###### Proof.

Consider the shifted OLS projections , using design matrix that has been centered within each group (except for the intercept) so that . Say is the first-order approximation of (6) applied to , with variance . Note that the residuals are unchanged and that the non-intercept coefficients are exactly equal: for . Thus with variance . Using and summing completes the result. ∎

Making the rough equivalences and , the result in (22) leads to our expression in (13). Note that (22) ignores variance in the covariate mean, , which is correlated with and has variance

### Footnotes

1. Taddy is also a research fellow at EBay. The authors thank others at EBay who have contributed, especially Jay Weiler who assisted in data collection.

### References

1. Athey, S. and G. Imbens (2015). Machine learning methods for estimating heterogeneous causal effects. arXiv: 1504.01132.
2. Athreya, K. (1987). Bootstrap of the mean in the infinite variance case. The Annals of Statistics, 724–731.
3. Berk, R., E. Pitkin, L. Brown, A. Buja, E. George, and L. Zhao (2013). Covariance adjustments for the analysis of randomized field experiments. Evaluation Review 37, 170–196.
4. Breiman, L. (1996). Heuristics of instability and stabilization in model selection. The Annals of Statistics 24(6), 2350–2383.
5. Breiman, L. (2001). Random Forests. Machine Learning 45, 5–32.
6. Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classification and regression trees. Chapman & Hall/CRC.
7. Chamberlain, G. and G. W. Imbens (2003). Nonparametric applications of Bayesian inference. Journal of Business and Economic Statistics 21, 12–18.
8. Chipman, H. A., E. I. George, and R. E. McCulloch (2010). BART: Bayesian Additive Regression Trees. The Annals of Applied Statistics 4, 266–298.
9. Clyde, M. and H. Lee (2001). Bagging and the Bayesian bootstrap. In Artificial Intelligence and Statistics.
10. Deng, A., Y. Xu, R. Kohavi, and T. Walker (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 123–132. ACM.
11. Dudík, M., J. Langford, and L. Li (2011). Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning (ICML 2011).
12. Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics, 1–26.
13. Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics 1, 209–230.
14. Foster, J. C., J. M. Taylor, and S. J. Ruberg (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine 30, 2867–2880.
15. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
16. Green, D. P. and H. L. Kern (2012). Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees. Public opinion quarterly, 491–511.
17. Grimmer, J., S. Messing, and S. J. Westwood (2013). Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods.
18. Hill, J. L. (2011, January). Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics 20(1), 217–240.
19. Hjort, N. L., C. Holmes, P. Müller, and S. G. Walker (2010). Bayesian nonparametrics. Cambridge University Press.
20. Hollander, M. and D. Wolfe (1999). Nonparametric Statistical Methods (2nd ed.). Wiley.
21. Imai, K. and M. Ratkovic (2013). Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics 7, 443–470.
22. Imbens, G. (2004). Nonparametric Estimation Of Average Treatment Effects under Exogeneity: A Review. The Review of Economics and Statistics 86, 4–29.
23. Imbens, G. W. and D. B. Rubin (2015). Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
24. Lancaster, T. (2003). A note on bootstraps and robustness. Technical report, Working Paper, Brown University, Department of Economics.
25. Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. The Annals of Applied Statistics 7, 295–318.
26. Lin, W. (2014). Comments on ‘Covariance adjustments for the analysis of randomized field experiments’. Evaluation Review 38, 449–451.
27. Miratrix, L. W., J. S. Sekhon, and B. Yu (2013). Adjusting treatment effect estimates by post-stratification in randomized experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(2), 369–396.
28. Pitkin, E., R. Berk, L. Brown, A. Buja, E. George, K. Zhang, and L. Zhao (2013). Improved precision in estimating average treatment effects. arXiv:1311.0291.
29. Poirier, D. J. (2011). Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap. Econometric Reviews 30, 457–468.
30. Rubin, D. (1981). The Bayesian Bootstrap. The Annals of Statistics 9, 130–134.
31. Rubin, D. B. (2005). Causal inference using potential outcomes. Journal of the American Statistical Association 100, 322–331.
32. Taddy, M., C.-S. Chen, J. Yu, and M. Wyle (2015). Bayesian and empirical Bayesian forests. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015).
33. Taddy, M., H. Freitas, D. Goldberg, and M. Gardner (2015). Semi-parametric Bayesian inference for the means of heavy-tailed distributions. In prep.
34. Taddy, M., M. Gardner, L. Chen, and D. Draper (2014). Heterogeneous treatment effects in digital experimentation. arXiv:1412.8563v3.
35. Taddy, M. and A. Kottas (2010). A Bayesian nonparametric approach to inference for quantile regression. Journal of Business and Economic Statistics 28, 357–369.
36. Wager, S. and S. Athey (2015). Estimation and inference of heterogeneous treatment effects using random forests. arXiv: 1510.04342.
37. White, H. (1980, May). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48(4), 817.
38. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics bulletin, 80–83.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters