A Population OLS gradient

\marginsize

1.1in.9in.3in1.4in \sectionfont \subsectionfont \subsubsectionfont

\setstretch

1.1


A nonparametric Bayesian analysis of heterogeneous

treatment effects in digital experimentation


Matt Taddy (taddy@chicagobooth.edu)

University of Chicago Booth School of Business 1

Matt Gardner

EBay

Liyun Chen

EBay

David Draper

University of California, Santa Cruz


Abstract: Randomized controlled trials play an important role in how Internet companies predict the impact of policy decisions and product changes. In these ‘digital experiments’, different units (people, devices, products) respond differently to the treatment. This article presents a fast and scalable Bayesian nonparametric analysis of such heterogeneous treatment effects and their measurement in relation to observable covariates. New results and algorithms are provided for quantifying the uncertainty associated with treatment effect measurement via both linear projections and nonlinear regression trees (CART and Random Forests). For linear projections, our inference strategy leads to results that are mostly in agreement with those from the frequentist literature. We find that linear regression adjustment of treatment effect averages (i.e., post-stratification) can provide some variance reduction, but that this reduction will be vanishingly small in the low-signal and large-sample setting of digital experiments. For regression trees, we provide uncertainty quantification for the machine learning algorithms that are commonly applied in tree-fitting. We argue that practitioners should look to ensembles of trees (forests) rather than individual trees in their analysis. The ideas are applied on and illustrated through an example experiment involving 21 million unique users of EBay.com.

\setstretch

1.5


1 Introduction

The Internet is host to a massive amount of experimentation. Online companies are constantly experimenting with changes to the ‘user’ experience. Randomized controlled trials are particularly common; they are referred to within technology companies as ‘A/B testing’ for the random assignment of control (option A) and treatment (option B) to experimental units (often users, but also products, auctions, or other dimensions). The treatments applied can involve changes to choice of the advertisements a user sees, the flow of information to users, the algorithms applied in product promotion, the pricing scheme and market design, or any aspect of website look and function. EBay, the source of our motivating application, experiments with these and other parts of the user experience with the goal of making it easier for buyers and sellers of specific items to find each other.

Heterogeneous treatment effects (HTE) refer to the phenomenon where the treatment effect for any individual user – the difference between how they would have responded under treatment rather than control – is different from the average. It is self-evident that HTE exist: different experimental units (people, products, or devices) each have unique responses to treatment. The task of interest is to measure this heterogeneity. Suppose that for each user with response , in either control or treatment, or respectively, there are available some pre-experiment attributes, . These attributes might be related to the effect of on . For example, if is during-experiment user spend, then could include pre-experiment spend by user on the company website. We can then attempt to index the HTE as a function of .

Digital (i.e., Internet) A/B experiments differ from most prior experimentation in important ways. First, the sample sizes are enormous. Our example EBay experiment (described in Section 2) has a sample size of over 21 million unique users. Second, the effect sizes are tiny. Our example treatment – increasing the size of product images – has response standard deviation around 1000 times larger than the estimated treatment effect. Finally, the response of interest (some transaction, such as user clicks or money spent) tends to be distributed with a majority of mass at zero, density spikes at other discrete points such as 1 or 99, a long tail, and variance that is correlated with available covariates. These data features – large samples, tiny effects that require careful uncertainty quantification, and unusual distributions that defy summarization through a parametric model – provide a natural setting for nonparametric analysis.

This article proposes a scalable framework for Bayesian nonparametric analysis of heterogeneous treatment effects. Our approach, detailed in Section 3, has two main steps.

  1. Choose some statistic that is useful for decision making, regardless of the true data distribution. For example, this could be a difference in means between two groups.

  2. Quantify uncertainty about this statistic as induced by the posterior distribution for a flexible Bayesian model of the data generating process (DGP).

This differs from the usual Bayesian nonparametric analysis strategy, in which a model for the data generating process is applied directly in prediction for future observations. See Hjort et al. (2010) and Taddy and Kottas (2010) for examples. In contrast, we consider a scenario where there is a given statistic that will be used in decision making regardless of the true DGP. The role of Bayesian modeling is solely to quantify uncertainty about this statistic. For example, in Section 5 we study regression trees; you do not need to believe that your data was generated from a tree in order for regression trees to be useful for prediction. Our statistic in () is then the output of a tree-fitting algorithm and in () we seek to evaluate stability in this algorithm under uncertainty about the true DGP.

We refer to this style of analysis as distribution-free Bayesian nonparametrics, in analogy to classical distribution-free statistics (e.g., as in Hollander and Wolfe, 1999) whose null-hypothesis distribution can be derived under minimal assumptions on the DGP (or without any assumptions at all, as in the case of the rank-sum test of Wilcoxon, 1945). In both Bayesian and classical setups, the statistic of interest is decoupled from assumptions about the DGP. One advantage of this strategy in the Bayesian context is that it allows us to apply a class of simple but flexible DGP models whose posterior can be summarized analytically or easily sampled via a bootstrap algorithm. That is, we can avoid the computationally intensive Markov chain Monte Carlo algorithms that are usually required in Bayesian nonparametric analysis (and which currently do not scale for data of the sizes encountered in digital experiments). Moreover, decoupling the tool of interest from the DGP model lets us provide guidance to practitioners without requiring them to change how they are processing data and the statistics that they store.

Our Bayesian DGP model, detailed in Section 3, treats the observed sample as a draw from a multinomial distribution over a large but finite set of support points. We place a Dirichlet prior on the probabilities in this multinomial, and the posterior distribution over possible DGPs is induced by the posterior on these probabilities. This Dirichlet-multinomial setup was introduced in Ferguson (1973) as a precursor to his Dirichlet process model (which has infinite support). We are not the first to propose its use in data analysis. Indeed, it is the foundation for the Bayesian bootstrap of Rubin (1981), and Chamberlain and Imbens (2003) provide a nice survey of some econometric applications. Similarly, our goal is to show the potential for distribution-free Bayesian nonparametric analysis of treatment effects in massive datasets and to understand its implications for some common practices.

After introducing our motivating dataset in Section 2 and the multinomial-Dirichlet framework in Section 3, the remainder of the paper is devoted to working through the analysis of two application areas and illustrating the results on our EBay experiment data. First, Section 4 considers linear least-squares HTE projections and their use in adjusting the measurement of average treatment effects. We provide approximations for the posterior mean and variance associated with regression-adjusted treatment effects, and the results predict a phenomena that we have observed repeatedly in practice: the influence of linear regression adjustment tends to be vanishingly small in the low-signal and large-sample setting of digital experiments. Second, Section 5 considers the use of the CART (Breiman et al., 1984) algorithm for partitioning of units (users) according to their treatment effect heterogeneity. This is an increasingly popular strategy (e.g., Athey and Imbens, 2015), and we demonstrate that there can be considerable uncertainty associated with the resulting partitioning rules. As a result, we advocate and analyze ensembles of trees that allow one to average over and quantify posterior uncertainty.

2 Data

First, some generic notation. For each independent experimental unit , which we label a ‘user’ following our eBay example, there is a response , a binary treatment indicator with (in control) and (in treatment), and a length- covariate vector . (We focus on scalar binary treatment for ease of exposition, but it is straightforward to generalize our results to multi-factor experiments.) There are users in control, in treatment, and in total. The user feature matrices for control and treatment groups are and , respectively, and these are accompanied by response vectors and . Stacked features and response are and , so that are in control and are treated.

Our example experiment involves 21 million users of the website EBay.com, randomly assigned 2/3 in treatment and 1/3 in control over a five week period. The treatment of interest is a change in image size for items in a user’s myEBay page – a dashboard that keeps track of items that the user has marked as interesting. In particular, the pictures are increased from pixels in control to pixels for the treated.

At EBay, where buyers and sellers transact sales and purchases of items listed on the website, an important outcome variable is the per-buyer total value of merchandise bought: the amount that a user spends on purchases during the experiment. This variable is measured in US$. The response is typical of Internet transaction data, in that it has

  • a majority at zero, since most users do not make a transaction during the experiment;

  • an long and fat right tail, corresponding to heavy spend by a minority of users;

  • density spikes at, e.g., psychological price thresholds such as $1 or $99; and

  • a variance that is correlated with both the treatment and sources of treatment heterogeneity. For our experiment, is much higher than .

Despite these unusual features, raw spending is the business-relevant variable of interest. For our results to be useful in decision making we need to understand treatment effects on the scale upon which EBay makes money. Common transformations employed by statisticians to facilitate modeling can lead to unintended consequences. For example, focusing on could drive the firm to target low-cost (and low-profit) items, and it is not difficult to define scenarios in which the mean effect is positive in but negative in . This motivates our use of nonparametric methods in analysis of the original dollar-valued response.

2.1 User features

Each user feature vector , representing potential covariates of heterogeneity, is constructed from user behavior tracked before the beginning of the experiment. Most metrics are aggregated over the four weeks prior to the start of the experiment, but we also include longer-term information in a three-dimensional indicator for whether the user made any purchases in the past month, quarter, or year. The metrics tracked include

  • transaction information such as total spending and number of bought items, sold items, or average price per bought item (treated as zero for zero bought items); and

  • activity information such as counts for site-session visits, page or item views, and actions such as bidding on an item or asking a seller a question.

The variables are tracked in aggregate, as well as broken out by product category (e.g., collectibles, fashion, or ‘unknown’) and market platform (e.g., auction or fixed price).

This gives around 100 total raw variables; they are extremely sparse and highly correlated. For the linear analysis of Section 4, we expand the raw data into indicators for whether the variable is greater than or equal to each of its positive quintiles. That is, there is a binary element to indicate when each variable is greater than 0 and when it is greater than or equal to the , , , and percentile of nonzero sample values for that variable. After collapsing across equal quintiles (e.g., some variables have up to nonzero percentile equal to one), this results in around 400 covariates. Finally, we set unless otherwise specified, so that the regression design matrices include an intercept. For the tree partitioning of Section 5, we do no preprocessing and simply input the 100 raw features to our algorithms.

3 Bayesian nonparametric sampling model

We employ Dirichlet-multinomial sampling as a flexible representation for the data generating process (DGP). The approach dates back to Ferguson (1973), Chamberlain and Imbens (2003) overview it in the context of econometric problems, and Lancaster (2003) and Poirier (2011) provide detailed analysis of linear projections. Rubin (1981) proposed the Bayesian bootstrap as an algorithm for sampling from versions of the posterior implied by this strategy, and the algorithm has since become closely associated with this model.

This model represents the DGP through a probability mass function on a large but finite number of possible data points (including response, covariates, and treatment),

(1)

where is the support of the DGP and are random weights with . We will often suppress and write for unless the weights need to be made explicit. Here denotes , the norm. Observations are assumed drawn independently from (1) by first sampling with probability and then assigning . A posterior over is induced by the posterior over . Functionals of , such as for arbitrary function and where implies expectation over , are thus random variables.

The conjugate prior for the normalized weight vector, , is a Dirichlet distribution, written . We specify a single concentration parameter , such that for and . Note that, from the fact that a Dirichlet realization can be generated as a normalized vector of independent gamma random variables, this conjugate prior is equivalent to independent exponential prior distributions on each un-normalized weight: for .

Now, suppose you have the observed sample . For notational convenience, we allow for in the case of repeated values and write so that and . The posterior distribution for is proportional to . Again using the constructive definition for the Dirichlet distribution, the posterior distribution on can thus be written as . That is, each weight is an independent exponential random variable with both mean and standard deviation equal to .

This model places no real restrictions on the DGP beyond that of independence across observations. In particular, could be so large as to include an practically exhaustive set of values. However, from here on we will focus on the limiting prior that arises as . This ‘non-informative’ limit yields a massive computational convenience: as the weights for unobserved support points converge to a degenerate random variable at zero: for . Our posterior for the DGP is then a multinomial sampling model with random weights on the observed data points. We modify notation and re-write as the vector of weights, so that each DGP is realized in the posterior as

(2)

The resulting inference model is thus similar to that underlying the frequentist nonparametric bootstrap of Efron (1979): the observed support acts as a stand-in for the population support. The prior is obviously not our ‘true’ prior, but we view this limiting case as a realistic theoretical construction that provides a convenient foundation for uncertainty quantification.

3.1 Posterior Inference

Consider some statistic of the DGP, say . This could be as simple as the DGP mean, written in the posterior as . Following Rubin (1981), we can obtain a sample from the posterior on through a Bayesian bootstrap. For :

  • draw ; then

  • calculate .

It is also often possible to derive analytic expressions for the posterior on . These are useful for understanding the sources of uncertainty and they allow us to avoid Monte Carlo computation. For example, if is some scalar associated with each observation (e.g., an element of ) and is the sample vector, the DGP mean for is written as . We can derive the first two posterior moments for this DGP mean as

(3)

Throughout, and to refer to posterior mean and variance operators unless otherwise indicated. In more complex examples, we proceed by studying the posterior on first-order Taylor series expansions for the statistic of interest around the posterior mean DGP, which has . The availability of simple analytic expressions for the posterior mean and variance associated with arbitrary statistics is an important part of our approach.

4 Linear HTE projection and adjusted average effects

Linear regression is probably the most commonly applied tool in measurement for heterogeneous treatment effects; Lin (2013) includes a nice overview. Specifically, one can study HTE through the difference in ordinary least-squares (OLS) projections within each treatment group,

(4)

This statistic is obviously relevant if the response is truly linear in . In that case, under a wide range of additive error models, is a consistent and unbiased estimator for the conditional mean , with here denoting the frequentist’s expectation under a true but unknown DGP . In our setting of randomized controlled trials (i.e., an A/B test), treatment randomization implies that is independent from both and , so that can be interpreted as the conditional average causal effect of on given (see Imbens, 2004 and Imbens and Rubin, 2015 for background on causal inference). Moreover, there are many applications where the difference in linear projections will be useful even if the true DGP mean is nonlinear in . For example, we consider in Section 4.2 below the common frequentist practice of using as a measure of the average treatment effect.

4.1 Linear analysis of heterogeneous treatment effects

From our nonparametric Bayesian perspective, the difference between treatment group population OLS projections is a statistic of the underlying DGP. Write the group projections as

(5)

where and is the sub-vector of weights for (recall and ). Since the treatment groups are independent, and so that the posterior for the difference can be derived directly from the posterior on (5). A Bayesian bootstrap for thus takes the difference between weighted-OLS fits for treatment and control groups, with weights drawn independently from the distribution.

To provide an analytic summary of the posterior on , we proceed by studying the exact posterior on first-order Taylor series expansions for around its value at the posterior mean DGP, where . Lancaster (2003) and Poirier (2011) also apply this approach, under the same Dirichlet-multinomial model that we use, in analysis of OLS projections. A first-order Taylor approximation to the group-specific population OLS in (5) is

(6)

where is the OLS projection at posterior mean DGP. As detailed in Appendix A, the gradient is and

(7)

where and is the vector of group-specific OLS residuals, for . Since each has an independent exponential posterior distribution, such that , the approximation in (5) has exact posterior variance

(8)

This matches the Huber-White heteroskedastic-consistent OLS variance formula (White, 1980); hence, the first-order approximate nonparametric Bayesian posterior variance formula is the same as a widely used frequentist estimator for sampling variance in OLS.

Example: treatment effect heterogeneity between new and old users

Figure 1: Conditional average treatment effects for new vs existing users. Posterior samples of given data accumulated through 1, 3, and 5 experiment weeks. The contours correspond to values of in a bivariate normal density centered at and with variance .

To briefly illustrate these ideas, we turn to the EBay experiment data of Section 2 and consider the change in conditional average treatment effect across a single influential covariate: whether or not the user has made a purchase in the past year. Any user who has made such a purchase is labeled existing, while those who have not are new.

Write the feature vector as , with each element an indicator for whether the user is new or existing (so that always). In this simple case, the four random variables of interest – – each correspond to the means associated with independent sub-vectors of and are thus independent from each other. We can apply the formulas in (3) to obtain the exact posterior moments for each and hence for the treatment effects, Working through the formula in (8) shows that, in this setup, the variance of our first-order approximation is very nearly equal to the true posterior variance:

(9)

where is the number of members of treatment group that were of user type . Note that the variances in (9) would be exactly equal if the series in (6) was expanded for around , as is done in Poirier (2011), instead of for as we have here.

Figure 1 presents the posterior distribution for . Results are shown conditional upon the 7.45 million users who visited their myEBay page in the first week of the experiment, the 10.97 million users who visited in the first three weeks, and the 13.22 million who visited at any time during the experiment. The posteriors are summarized through 1000 draws from the Bayesian bootstrap and as well as through a normal distribution centered at the sample OLS difference, , and with variance . Following the discussion above, this approximate variance is very nearly equal to the true posterior variance and the new and existing user average treatment effects are independent from each other in the posterior. Comparing the normal approximation to the bootstrap samples, we see that the former’s tails are thinner than those of the true posterior. Even after 5 weeks and 13 million exposed users, the posterior is skewed with a long right tail for .

4.2 Average treatment effects

A primary goal of many A/B experiments is to infer the average treatment effect (ATE): the average difference between what a user will spend if they experience option B versus if they experience option A. This value, multiplied by the number of users on your website, is the change in revenue that you will receive if moving from option A to option B.

Again referring the reader to Imbens and Rubin (2015) for additional detail, the massive benefit of randomized controlled trials is that each user’s group assignment, , is independent of their treatment effect. The difference between the expected response given that a user has been assigned to treatment versus control can then be interpreted causally as the treatment effect. This motivates the usual measure of ATE for a randomized trial as the difference in group means, . Writing , the corresponding DGP statistic of interest is . Applying (3), we get the posterior moments

(10)

with the sum-squared-error for generic length- vector . The posterior is thus centered on the usual estimate for the difference in means between two populations, with variance that is a slight deflation of the common sampling variance formula for this estimator.

There is a large frequentist literature on inference for ATE that advocates replacing with alternative metrics that condition upon the information in covariates, . One recently popular strategy promotes a regression-adjusted average treatment effect (e.g., Lin, 2013),

(11)

The advantage of such estimators is that they can have lower sampling variance than . When (11) is also unbiased for some definition of the true ATE (see Imbens, 2004, for a survey of frequentist inferential targets), it thus leads to more efficient estimation.

One way to justify (11) as representative of the ATE is to assume a linear relationship between and conditional upon . Linearity holds trivially if represents a partitioning of the data into mutually exclusive strata. Application of (11) in this situation is referred to as post stratification; see Deng et al. (2013) for use in the context of digital experiments and Miratrix et al. (2013) for a detailed theoretical overview. But (11) can also be justified for situations where linearity does not hold. Berk et al. (2013), Pitkin et al. (2013), Lin (2013), and Lin (2014) study ATE estimation under a variety of assumed sampling models. In each case, they show that is an asymptotically more efficient estimator of average treatment effects than the simple difference in means, .

Both and (under the sampling models in the literature cited above) are unbiased for commonly targeted definitions of the ATE (e.g., the sample average treatment effect – what would have been the average effect if we could have observed every user under both treatment and control). The argument for use of the latter regression-adjusted statistic is that it reduces your estimation variance. Quantifying the amount of variance reduction requires assumptions about the underlying DGP (see the Berk, Pitken et al. and Lin references above) and it is not always clear what one should expect in real examples. Our personal experience with digital experiments is that the frequentist nonparametric bootstrap often shows little, if any, variance reduction after regression adjustment.

Consider covariate averages projected through the population OLS lines from (5),

(12)

where . This is a random variable; it is the Bayesian nonparametric target analogous to . As detailed in Appendix B, a first-order approximation to (12) is . This has exact posterior mean and Theorem B.1 shows that

(13)

where is the proportion of deviance explained for the group- OLS regression. Thus regression adjustment yields a large variance reduction only if elements of have large covariances with and if the sample size is not too big. Since our digital experiments have tiny signals and massive samples, we should expect precisely what we have experienced in practice: little variance reduction from regression adjustment.

Example: regression-adjustment for average treatment effects

Table 1 shows posterior inference for average treatment effects in our EBay experiment, conditional upon the data accumulated through each week of the experiment. We have applied here the full set of user features, expanded into length- covariate vectors as described in Section 2. The table shows posterior means and standard deviations for three ATE statistics: the unadjusted difference in group means, ; the regression-adjusted ATE, ; and the approximation to this regression-adjusted ATE, . For later reference, we also include the average difference between regression trees fit in each treatment group, labeled ; we defer discussion to Section 5. Posterior means and standard deviations for are obtained via Bayesian bootstrap sampling, while moments for the other statistics area available analytically as described above.

As predicted by Theorem B.1, since our samples are large and our correlations are small, there is practically zero reduction in variance between and . The standard deviations are the same up to two decimal places ($0.01) in every case. Moreover, the Bayesian bootstrap posterior sample for has means and standard deviations that are similar to those of its approximation, , giving evidence to support the practical relevance of (13). While neither this example nor our analytic results imply that regression-adjustment cannot yield lower-variance inference, it does fit with our practical experience that such adjustments makes little difference in experiments like the one studied here.


Posterior Mean (and SD)

\setstretch

1.5 week 1 week 2 week 3 week 4 week 5 3.30 (2.04) 1.34 (1.24) 1.44 (0.95) 1.71 (0.80) 1.90 (0.76) 2.95 (2.07) 1.26 (1.29) 1.39 (0.84) 1.55 (0.72) 1.83 (0.75) 3.05 (2.04) 1.15 (1.24) 1.29 (0.95) 1.59 (0.80) 1.80 (0.76) 2.99 (1.46) 1.03 (1.46) 1.20 (0.90) 1.74 (0.80) 1.75 (0.82) number of users, in mil 7.45 9.48 10.97 12.20 13.22

Table 1: Posterior means (and standard deviations) for ATE statistics conditional upon the sets of users who visited their myEBay page through 1-5 weeks, cumulative. Values for and are exact, while moments for are based on 100 draws from the Bayesian bootstrap. Finally, refers to the difference between treatment-group-specific regression trees, as detailed in Section 5.2; posterior moments here are based on two forests of 1000 trees each.

5 Regression tree prediction for HTE

Regression trees partition the feature (covariate) space into regions of response homogeneity, such that the response associated with any point in a given partition can be predicted from the average for that of its neighbors. The partitions are typically formed through a series of binary splits, and after this series of splits each terminal leaf node contains a rectangular subset of the covariate support. Advantages of using trees as prediction rules include that they can model a response that is nonlinear in the original covariates, they can represent complex interactions (every variable that is split upon is interacting with those above and below it in the tree), and they allow for error variance that changes with the covariates (there is no homoscedasticity restriction across leaves). Through their implementation as part of Random Forest (Breiman, 2001) or gradient boosting machine (Friedman, 2001) ensembles, it is difficult to overstate the extent to which trees play a central role in contemporary industrial machine learning.

The CART algorithm of Breiman et al. (1984) is the most common and successful recipe for building trees. It grows greedily and recursively: for a given node (subset of data), a split location is chosen to minimize some impurity (sum-squared-error for regression trees) across the two resulting children; this splitting procedure is repeated on each child, and hence recursively until the algorithm encounters a stopping rule (e.g., if a new child contains fewer observations than a specified minimum leaf size). After the tree is fit, it is common to use cross-validation to prune it by evaluating whether the splits near to the leaves improve out-of-sample prediction and removing those that do not. Variations on CART include the random removal of input dimensions as candidates for the split location at each impurity minimization. The resulting prediction rule is then the average across repeated runs of this randomized-input CART (e.g., see Breiman, 2001). Introduction of such stochasticity can improve upon the performance of greedy search in datasets where, e.g., you have high-dimensional inputs.

To quantify uncertainty for CART, we study a population CART algorithm that (analogously to the population OLS in (5)) optimizes over a realization of our Bayesian nonparametric DGP model from (2). Consider a node , containing a subset of the data indices . This node is to be partitioned into two child nodes according to a binary split on one of the covariate locations: a split on input of observation , say , so that the two resulting child nodes are and . Given a realization of the DGP weights , the population CART algorithm chooses to minimize

(14)

where for a generic node the impurity (error) is

(15)

As in sample CART, this splitting is repeated recursively until we encounter a stopping rule. For randomized-input versions of CART, one minimizes the same DGP-dependent impurity in (14) but over a random subset of candidate split dimensions . The statistic of interest is then itself a random object: we have decoupled variability due to uncertainty about the DGP from algorithmic stochasticity that does not diminish as you accumulate data.

The posterior over trees (i.e., over CART fits) can be sampled via the Bayesian bootstrap of Section 3.1. This leads to a posterior sample of trees that we label a Bayesian Forest. The algorithm is studied in detail in Taddy et al. (2015), along with an Empirical Bayes approximation for computation in distribution across many machines. Taddy et al. (2015) demonstrate that the average prediction from a Bayesian Forest (i.e., the posterior mean) outperforms prediction from a single CART tree (with cross-validated pruning) and many other common tree-based prediction algorithms. Of particular interest, Bayesian Forests tend to perform similarly to, although usually slightly better than, Random Forests. The only difference between the two algorithms is that while the Bayesian Forest uses independent observation weights, the Random Forest draws a vector of discrete weights from a multinomial distribution with probability and size (this is the frequentist nonparametric bootstrap).

5.1 Single tree prediction for HTE

Bayesian Forests provide uncertainty quantification for a machine learning algorithm – CART – that is often viewed as a black-box. Recently published applications of regression trees in HTE prediction include Foster et al. (2011) for medical clinical trials and Dudík et al. (2011) for user browsing behavior, and we have observed that CART is commonly employed in industry for the segmentation of customers according to their response to advertisement and promotions.

Athey and Imbens (2015) study various strategies for the use of CART-like algorithms in prediction of HTE. We will focus on their transformed outcome tree (TOT) method, which is simply the application of CART in prediction of a transformed response, , which has expected value equal to the treatment effect of interest. In the language of the Neyman-Rubin causal model (e.g., Rubin, 2005), each unit of observation in an experiment is associated with two potential outcomes: , their response if they are allocated to the control group; and , their response under treatment. Of course, only one of these two potential outcomes is ever realized and observed: . Athey and Imbens (2015) define

(16)

where is the probability of treatment ( in our EBay example). Then, with denoting expectation over unknown independent treatment allocation ,

(17)

Thus a tree that is trained to fit the expectation for can be used to predict the treatment effect.

1 Week                                                     5 Weeks

Figure 2: Sample TOT (CART for ) trees after one and five weeks of experimentation. The algorithm was run with a minimum leaf size of 100,000 users and a maximum depth of 5 internal (splitting) nodes. The right-hand children contain data for which the split condition is true, and left-hand children are the complement. Leaves are marked with the predicted treatment effect in that leaf. Decision nodes are colored for the posterior probability (see Table 2) that the corresponding variable is included in the TOT fit for new a DGP realization: light grey for , dark grey for , and red for .
 
Terminology: Views is the count of page impressions, Bids is the count of bids on the auction platform, Sessions is the number of sessions where an event occurs, SRP is a search results page, VI is where the user clicked on a item to view details, BI is the count of bought items, SI denotes the count of items sold by the user and SYI denotes the count of items that the user attempted to sell. Other, Fashion, and Electronics denote restriction of the associated metrics to activity in these product categories. Finally, lastmonth indicates whether the user made any purchases in the month prior to the experiment.

Variable split probabilities
1 Week depth in tree
BidsFashion .32 .35 .38 .39 .40
Bids .02 .08 .14 .23 .27
Bids Other .15 .20 .22 .23 .23
BI Electronics .02 .04 .11 .19 .22
Spend Electronics .04 .11 .16 .20 .28
5 Weeks depth in tree
SI Fashion .38 .42 .42 .42 .42
Bids Other .26 .52 .55 .57 .67
VI sessions .02 .08 .20 .26 .36
lastmonth .03 .13 .31 .59 .72
SRP views .00 .06 .11 .22 .42
VI views .01 .03 .08 .14 .27
SYI views .00 .02 .04 .09 .13

Table 2: Posterior probabilities of the TOT (CART for ) algorithm splitting at or above depths 1 to 5 on each of the variables in the corresponding sample TOT tree (see Figure 2), after one and five weeks of experimentation. These probabilities are obtained from a Bayesian Forest (sample) of 1000 TOT trees fit under the same settings as our sample trees: maximum depth of 5 and minimum leaf size of 100,000.

Example: posterior uncertainty for transformed outcome trees

Sample-fit TOT trees for our EBay example experiment are shown in Figure 2, fit to the data accumulated through one and five weeks of experimentation. The leaf nodes are marked with the corresponding prediction rule: , the mean of over , which is an estimate for following (17). Recall that these are dollar-value effects. We fit all of our trees and forests via adaptations of MLLib’s decision tree methods in Apache Spark, and in this case the algorithm stops at either a maximum depth of 5 or a minimum leaf size of 100,000 users. Athey and Imbens (2015) recommend the use of cross-validated pruning for TOT tree fitting. In our examples, cross-validation selects deeper trees than those shown in Figure 2, such that these can be viewed as the trunks of some more complex optimal sample TOT.

The population version of the TOT algorithm simply replaces with in (14), and a posterior sample over TOT trees – a Bayesian Forest – is obtained via Bayesian bootstrapping as described above. We fit Bayesian Forests of 1000 trees to study the uncertainty associated with the TOT trees in Figure 2. For the variables split upon in each sample TOT tree, Table 2 contains the posterior probability that each variable is split upon, at or above a given depth, for a new realization of the DGP. This is simply the proportion of trees in the forest in which such splits occur. The internal decision nodes in Figure 2 are colored according to these probabilities.

After one week and 7.45 million users, only Bids Fashion – the number of bids on ‘fashion’ items – is split upon with greater than 1/3 probability at a depth . After observing 5 weeks of purchasing from 13.22 million users, the structure is more stable: five variables occur in more than 1/3 of depth-5 trees. Two variables occur with probability greater than 1/2: the lastmonth indicator, for whether the user made a purchase in the past month; and Bids Other, the number of bids on un-categorized items (the split location for Bids Other was always between 6 and 10). We could hence, say, partition users into four groups according to the splits on lastmonth and Bids Other and have a better than 1/2 chance that for any posterior DGP realization a similar partitioning would be included in the top of the corresponding TOT fit.

However, even after 5 weeks, there remains considerable uncertainty associated with the full tree structure. For example, the very first (root) split in the sample tree occurs in only 40% of depth-5 trees, it targets a small subset of the data (less than 1% of users had SI Fashion ), and it predicts an extreme treatment effect for this subset (-$38.77 as the effect of slightly larger images). Moreover, at a depth of 5 all variables except for lastpurch have low split probability. This uncertainty contradicts the examples and results in Taddy et al. (2015), which finds high posterior probability for the trunks of CART fits and takes advantage of this stability for efficient computation. We hypothesize that the difference here is due to the tiny signal available for prediction of HTE in our (and probably many other) digital experiments.

5.2 Bayesian Forest HTE prediction

A single CART fit is a fragile object; even if the trunks are more stable than we find in the example above, deep tree structure will have near-zero posterior probability. Splits that cross-validated pruning finds useful for out-of-sample prediction will often disappear under small jitter to the dataset. See Breiman (1996) for the classic study of this phenomenon, which was the motivation for his Random Forest algorithm. Breiman showed that by averaging across many trees, each individually unstable and over-fit, he could obtain a response surface that was both stable and a strong performer in out-of-sample prediction. The act of averaging removes noisy structure that exists in only a small number of trees, and it smooths across uncertainty, e.g., about split locations or the order of nodes in a tree path.

From our perspective, a Bayesian Forest (which is nearly equivalent to a Random Forest) is a posterior over CART predictors. The average leaf value associated with a given is the posterior mean prediction rule. This contrasts with the predictions implied by the single sample CART tree, which is the CART prediction rule at posterior mean DGP, where . Experience shows that this can make a big difference: the forest average response surface will be different from and provide better prediction than the sample CART tree. (More generally, see Clyde and Lee (2001) for discussion on the Bayesian bootstrap and model averaging.)

For our final example, we consider the posterior distribution on the difference between two prediction rules: CART fit to each of treatment and control DGPs. Write for the prediction rule at resulting from population CART fit to support with weights . That is, if the realized CART fit allocates to the leaf node containing observations in set , then as described in (15). Thus is a random variable and so is the predicted treatment effect

(18)

As in Section 4, the DGPs for treatment and control are independent from each other and we can obtain posterior samples of (18) via separate Bayesian Forests for each treatment group.

The framework implied by (18) is related to a semi-parametric literature (e.g. Hill, 2011; Green and Kern, 2012; Grimmer et al., 2013; Imai and Ratkovic, 2013) that studies the difference between flexible regression functions in each of the treatment groups. In a prominent example, Hill (2011) applies Bayesian additive regression trees (BART; Chipman et al., 2010) and interprets the difference between posterior predictive distributions across treatment groups as effect heterogeneity. In contrast to our approach, where the trees are just a convenient prediction rule and we do not assume that the data were actually generated from a tree, Hill assumes that her regression functions are representative of the true underlying DGP. Which strategy is best will depend upon your application. For example, BART includes a homogeneous Gaussian additive error and is thus inappropriate for the heteroscedastic errors in Internet transaction data. BART is outperformed by forest algorithms in such settings (see, e.g., Taddy et al., 2015), but will outperform the forests when the homoscedasticity assumption is more valid.

Due to the similarity between Random and Bayesian Forests, our approach is also related to recent work by Wager and Athey (2015) on the use of Random Forests in HTE estimation. Wager and Athey use the forests to construct confidence intervals for a true treatment effect surface. This is more ambitious than our contribution, which interprets the forest as a posterior distribution for optimal prediction of treatment effects within a certain class of algorithms. Indeed, Wager and Athey are studying the frequentist properties of our posterior mean.

Figure 3: Bayesian Forest treatment effects after 5 weeks. Each plot shows the posterior distribution for population CART treatment effect prediction, , at features from a sampled user.

Example: Differenced treatment group forests

Returning to our EBay example, we focus on HTE prediction for the completed experiment including 13.22 million users over 5 weeks. Bayesian Forests of 1000 trees each were fit to the treatment and control group samples. Each forest is the posterior distribution over a population-CART algorithm run with maximum depth of 10 and no minimum leaf size. CART was applied without any random variable subsetting; hence, variability in the resulting prediction surface is due entirely to posterior uncertainty about the DGP.

The outcome is a posterior sample over prediction rules for the conditional average treatment effects, as in (18). Figure 3 shows four example posteriors for individual user treatment effects; this type of uncertainty quantification is available for any new user whose treatment effect you wish to predict. It is also possible to summarize, for a given DGP realization, the average treatment effect conditional on variable being in set as

(19)

The sum in (19) is over all observations in the sample (both treatment and control groups) that have their feature in . Figure 4 shows change in the posterior distributions for conditional average treatment effects corresponding to change in the user’s last purchase date and their spending (in dollars and items bought) over the period prior to the experiment. Both posterior mean and uncertainty tend to increase for groups of more active users. As in our OLS analysis of Figure 1, the posteriors can be highly skewed.

Finally, each DGP realization provides a prediction for the average treatment effect,

(20)

The random play a role here both in weighting each treatment effect prediction, , and in the CART fits that underly those predictions. The last row of Table 1 shows posterior mean and standard deviation for from (20) after 1-5 weeks of experimentation. We have no basis here to argue that this statistic is preferable to unadjusted or the adjusted metrics of Section 4.2; however, it presents an intuitively appealing option if you believe that the expected response within each treatment group is nonlinear in .

Figure 4: Variable average effects as defined in (19) for three important user features: the user’s last purchase date and their spending (in dollars and items bought) over the period prior to this experiment. Points mark the posterior mean and line segments extend from the to posterior percentile.

6 Discussion

This article outlines a nonparametric Bayesian framework for treatment effect analysis in A/B experiments. The approach is simple, practical, and scalable. It applies beyond the two studied classes of HTE statistics; for example, an earlier version of the work (Taddy et al., 2014) considered HTE summarization via moment conditions and our CART trees are just one possible prediction rule amongst many available machine learning tools.

One area for future research is in semi-parametric extensions of this framework. For example, we know from existing theory on the frequentist nonparametric bootstrap that it can fail for distributions with infinite variance (Athreya, 1987). This scenario could occur in digital experiments where the response is extremely heavy tailed. In response, Taddy et al. (2015) propose combining the Dirichlet-multinomial model with a parametric tail distribution.

Throughout, we have referenced large existing literatures on Bayesian parametric and semi-parametric and frequentist analysis of HTE. We are not aiming to replace these existing frameworks, nor are we advocating for any one HTE statistic over another. Instead, we simply present a novel set of Bayesian nonparametric analyses for some common and useful tools. The hope is that frequentists and parametric Bayesians alike will benefit from this alternative point of view.

Appendix A Population OLS gradient

Define and use to denote the gradient of on . Then

(21)

via repeated applications of and for appropriately sized matrices, and using the chain rule with a standard result from matrix calculus to get . Since , the formula in (21) reduces to .

Appendix B Posterior inference for regression-adjusted ATE

Our first-order approximation to is . Writing , this approximation has variance .

Theorem B.1.
(22)

where for generic length- vector and .

Proof.

Consider the shifted OLS projections , using design matrix that has been centered within each group (except for the intercept) so that . Say is the first-order approximation of (6) applied to , with variance . Note that the residuals are unchanged and that the non-intercept coefficients are exactly equal: for . Thus with variance . Using and summing completes the result. ∎

Making the rough equivalences and , the result in (22) leads to our expression in (13). Note that (22) ignores variance in the covariate mean, , which is correlated with and has variance

Footnotes

  1. Taddy is also a research fellow at EBay. The authors thank others at EBay who have contributed, especially Jay Weiler who assisted in data collection.

References

  1. Athey, S. and G. Imbens (2015). Machine learning methods for estimating heterogeneous causal effects. arXiv: 1504.01132.
  2. Athreya, K. (1987). Bootstrap of the mean in the infinite variance case. The Annals of Statistics, 724–731.
  3. Berk, R., E. Pitkin, L. Brown, A. Buja, E. George, and L. Zhao (2013). Covariance adjustments for the analysis of randomized field experiments. Evaluation Review 37, 170–196.
  4. Breiman, L. (1996). Heuristics of instability and stabilization in model selection. The Annals of Statistics 24(6), 2350–2383.
  5. Breiman, L. (2001). Random Forests. Machine Learning 45, 5–32.
  6. Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classification and regression trees. Chapman & Hall/CRC.
  7. Chamberlain, G. and G. W. Imbens (2003). Nonparametric applications of Bayesian inference. Journal of Business and Economic Statistics 21, 12–18.
  8. Chipman, H. A., E. I. George, and R. E. McCulloch (2010). BART: Bayesian Additive Regression Trees. The Annals of Applied Statistics 4, 266–298.
  9. Clyde, M. and H. Lee (2001). Bagging and the Bayesian bootstrap. In Artificial Intelligence and Statistics.
  10. Deng, A., Y. Xu, R. Kohavi, and T. Walker (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 123–132. ACM.
  11. Dudík, M., J. Langford, and L. Li (2011). Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning (ICML 2011).
  12. Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics, 1–26.
  13. Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics 1, 209–230.
  14. Foster, J. C., J. M. Taylor, and S. J. Ruberg (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine 30, 2867–2880.
  15. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
  16. Green, D. P. and H. L. Kern (2012). Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees. Public opinion quarterly, 491–511.
  17. Grimmer, J., S. Messing, and S. J. Westwood (2013). Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods.
  18. Hill, J. L. (2011, January). Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics 20(1), 217–240.
  19. Hjort, N. L., C. Holmes, P. Müller, and S. G. Walker (2010). Bayesian nonparametrics. Cambridge University Press.
  20. Hollander, M. and D. Wolfe (1999). Nonparametric Statistical Methods (2nd ed.). Wiley.
  21. Imai, K. and M. Ratkovic (2013). Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics 7, 443–470.
  22. Imbens, G. (2004). Nonparametric Estimation Of Average Treatment Effects under Exogeneity: A Review. The Review of Economics and Statistics 86, 4–29.
  23. Imbens, G. W. and D. B. Rubin (2015). Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
  24. Lancaster, T. (2003). A note on bootstraps and robustness. Technical report, Working Paper, Brown University, Department of Economics.
  25. Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. The Annals of Applied Statistics 7, 295–318.
  26. Lin, W. (2014). Comments on ‘Covariance adjustments for the analysis of randomized field experiments’. Evaluation Review 38, 449–451.
  27. Miratrix, L. W., J. S. Sekhon, and B. Yu (2013). Adjusting treatment effect estimates by post-stratification in randomized experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(2), 369–396.
  28. Pitkin, E., R. Berk, L. Brown, A. Buja, E. George, K. Zhang, and L. Zhao (2013). Improved precision in estimating average treatment effects. arXiv:1311.0291.
  29. Poirier, D. J. (2011). Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap. Econometric Reviews 30, 457–468.
  30. Rubin, D. (1981). The Bayesian Bootstrap. The Annals of Statistics 9, 130–134.
  31. Rubin, D. B. (2005). Causal inference using potential outcomes. Journal of the American Statistical Association 100, 322–331.
  32. Taddy, M., C.-S. Chen, J. Yu, and M. Wyle (2015). Bayesian and empirical Bayesian forests. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015).
  33. Taddy, M., H. Freitas, D. Goldberg, and M. Gardner (2015). Semi-parametric Bayesian inference for the means of heavy-tailed distributions. In prep.
  34. Taddy, M., M. Gardner, L. Chen, and D. Draper (2014). Heterogeneous treatment effects in digital experimentation. arXiv:1412.8563v3.
  35. Taddy, M. and A. Kottas (2010). A Bayesian nonparametric approach to inference for quantile regression. Journal of Business and Economic Statistics 28, 357–369.
  36. Wager, S. and S. Athey (2015). Estimation and inference of heterogeneous treatment effects using random forests. arXiv: 1510.04342.
  37. White, H. (1980, May). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48(4), 817.
  38. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics bulletin, 80–83.
100800
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
0
Comments 0
""
The feedback must be of minumum 40 characters
Add comment
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question