A nonparametric Bayesian analysis of heterogeneous
treatment effects in digital experimentation
Matt Taddy (firstname.lastname@example.org)
University of Chicago Booth School of Business ††Taddy is also a research fellow at EBay. The authors thank others at EBay who have contributed, especially Jay Weiler who assisted in data collection.
University of California, Santa Cruz
Abstract: Randomized controlled trials play an important role in how Internet companies predict the impact of policy decisions and product changes. In these ‘digital experiments’, different units (people, devices, products) respond differently to the treatment. This article presents a fast and scalable Bayesian nonparametric analysis of such heterogeneous treatment effects and their measurement in relation to observable covariates. New results and algorithms are provided for quantifying the uncertainty associated with treatment effect measurement via both linear projections and nonlinear regression trees (CART and Random Forests). For linear projections, our inference strategy leads to results that are mostly in agreement with those from the frequentist literature. We find that linear regression adjustment of treatment effect averages (i.e., post-stratification) can provide some variance reduction, but that this reduction will be vanishingly small in the low-signal and large-sample setting of digital experiments. For regression trees, we provide uncertainty quantification for the machine learning algorithms that are commonly applied in tree-fitting. We argue that practitioners should look to ensembles of trees (forests) rather than individual trees in their analysis. The ideas are applied on and illustrated through an example experiment involving 21 million unique users of EBay.com.
The Internet is host to a massive amount of experimentation. Online companies are constantly experimenting with changes to the ‘user’ experience. Randomized controlled trials are particularly common; they are referred to within technology companies as ‘A/B testing’ for the random assignment of control (option A) and treatment (option B) to experimental units (often users, but also products, auctions, or other dimensions). The treatments applied can involve changes to choice of the advertisements a user sees, the flow of information to users, the algorithms applied in product promotion, the pricing scheme and market design, or any aspect of website look and function. EBay, the source of our motivating application, experiments with these and other parts of the user experience with the goal of making it easier for buyers and sellers of specific items to find each other.
Heterogeneous treatment effects (HTE) refer to the phenomenon where the treatment effect for any individual user – the difference between how they would have responded under treatment rather than control – is different from the average. It is self-evident that HTE exist: different experimental units (people, products, or devices) each have unique responses to treatment. The task of interest is to measure this heterogeneity. Suppose that for each user with response , in either control or treatment, or respectively, there are available some pre-experiment attributes, . These attributes might be related to the effect of on . For example, if is during-experiment user spend, then could include pre-experiment spend by user on the company website. We can then attempt to index the HTE as a function of .
Digital (i.e., Internet) A/B experiments differ from most prior experimentation in important ways. First, the sample sizes are enormous. Our example EBay experiment (described in Section 2) has a sample size of over 21 million unique users. Second, the effect sizes are tiny. Our example treatment – increasing the size of product images – has response standard deviation around 1000 times larger than the estimated treatment effect. Finally, the response of interest (some transaction, such as user clicks or money spent) tends to be distributed with a majority of mass at zero, density spikes at other discrete points such as 1 or 99, a long tail, and variance that is correlated with available covariates. These data features – large samples, tiny effects that require careful uncertainty quantification, and unusual distributions that defy summarization through a parametric model – provide a natural setting for nonparametric analysis.
This article proposes a scalable framework for Bayesian nonparametric analysis of heterogeneous treatment effects. Our approach, detailed in Section 3, has two main steps.
Choose some statistic that is useful for decision making, regardless of the true data distribution. For example, this could be a difference in means between two groups.
Quantify uncertainty about this statistic as induced by the posterior distribution for a flexible Bayesian model of the data generating process (DGP).
This differs from the usual Bayesian nonparametric analysis strategy, in which a model for the data generating process is applied directly in prediction for future observations. See hjort2010bayesian and taddy_bayesian_2010 for examples. In contrast, we consider a scenario where there is a given statistic that will be used in decision making regardless of the true DGP. The role of Bayesian modeling is solely to quantify uncertainty about this statistic. For example, in Section LABEL:sec:tree we study regression trees; you do not need to believe that your data was generated from a tree in order for regression trees to be useful for prediction. Our statistic in () is then the output of a tree-fitting algorithm and in () we seek to evaluate stability in this algorithm under uncertainty about the true DGP.
We refer to this style of analysis as distribution-free Bayesian nonparametrics, in analogy to classical distribution-free statistics (e.g., as in hollander_nonparametric_1999) whose null-hypothesis distribution can be derived under minimal assumptions on the DGP (or without any assumptions at all, as in the case of the rank-sum test of wilcoxon1945individual). In both Bayesian and classical setups, the statistic of interest is decoupled from assumptions about the DGP. One advantage of this strategy in the Bayesian context is that it allows us to apply a class of simple but flexible DGP models whose posterior can be summarized analytically or easily sampled via a bootstrap algorithm. That is, we can avoid the computationally intensive Markov chain Monte Carlo algorithms that are usually required in Bayesian nonparametric analysis (and which currently do not scale for data of the sizes encountered in digital experiments). Moreover, decoupling the tool of interest from the DGP model lets us provide guidance to practitioners without requiring them to change how they are processing data and the statistics that they store.
Our Bayesian DGP model, detailed in Section 3, treats the observed sample as a draw from a multinomial distribution over a large but finite set of support points. We place a Dirichlet prior on the probabilities in this multinomial, and the posterior distribution over possible DGPs is induced by the posterior on these probabilities. This Dirichlet-multinomial setup was introduced in ferguson_bayesian_1973 as a precursor to his Dirichlet process model (which has infinite support). We are not the first to propose its use in data analysis. Indeed, it is the foundation for the Bayesian bootstrap of rubin_bayesian_1981, and chamberlain_nonparametric_2003 provide a nice survey of some econometric applications. Similarly, our goal is to show the potential for distribution-free Bayesian nonparametric analysis of treatment effects in massive datasets and to understand its implications for some common practices.
After introducing our motivating dataset in Section 2 and the multinomial-Dirichlet framework in Section 3, the remainder of the paper is devoted to working through the analysis of two application areas and illustrating the results on our EBay experiment data. First, Section 4 considers linear least-squares HTE projections and their use in adjusting the measurement of average treatment effects. We provide approximations for the posterior mean and variance associated with regression-adjusted treatment effects, and the results predict a phenomena that we have observed repeatedly in practice: the influence of linear regression adjustment tends to be vanishingly small in the low-signal and large-sample setting of digital experiments. Second, Section LABEL:sec:tree considers the use of the CART (breiman_classification_1984) algorithm for partitioning of units (users) according to their treatment effect heterogeneity. This is an increasingly popular strategy (e.g., athey_machine_2015), and we demonstrate that there can be considerable uncertainty associated with the resulting partitioning rules. As a result, we advocate and analyze ensembles of trees that allow one to average over and quantify posterior uncertainty.
First, some generic notation. For each independent experimental unit , which we label a ‘user’ following our eBay example, there is a response , a binary treatment indicator with (in control) and (in treatment), and a length- covariate vector . (We focus on scalar binary treatment for ease of exposition, but it is straightforward to generalize our results to multi-factor experiments.) There are users in control, in treatment, and in total. The user feature matrices for control and treatment groups are and , respectively, and these are accompanied by response vectors and . Stacked features and response are and , so that are in control and are treated.
Our example experiment involves 21 million users of the website EBay.com, randomly assigned 2/3 in treatment and 1/3 in control over a five week period. The treatment of interest is a change in image size for items in a user’s myEBay page – a dashboard that keeps track of items that the user has marked as interesting. In particular, the pictures are increased from pixels in control to pixels for the treated.
At EBay, where buyers and sellers transact sales and purchases of items listed on the website, an important outcome variable is the per-buyer total value of merchandise bought: the amount that a user spends on purchases during the experiment. This variable is measured in US$. The response is typical of Internet transaction data, in that it has
a majority at zero, since most users do not make a transaction during the experiment;
an long and fat right tail, corresponding to heavy spend by a minority of users;
density spikes at, e.g., psychological price thresholds such as $1 or $99; and
a variance that is correlated with both the treatment and sources of treatment heterogeneity. For our experiment, is much higher than .
Despite these unusual features, raw spending is the business-relevant variable of interest. For our results to be useful in decision making we need to understand treatment effects on the scale upon which EBay makes money. Common transformations employed by statisticians to facilitate modeling can lead to unintended consequences. For example, focusing on could drive the firm to target low-cost (and low-profit) items, and it is not difficult to define scenarios in which the mean effect is positive in but negative in . This motivates our use of nonparametric methods in analysis of the original dollar-valued response.
2.1 User features
Each user feature vector , representing potential covariates of heterogeneity, is constructed from user behavior tracked before the beginning of the experiment. Most metrics are aggregated over the four weeks prior to the start of the experiment, but we also include longer-term information in a three-dimensional indicator for whether the user made any purchases in the past month, quarter, or year. The metrics tracked include
transaction information such as total spending and number of bought items, sold items, or average price per bought item (treated as zero for zero bought items); and
activity information such as counts for site-session visits, page or item views, and actions such as bidding on an item or asking a seller a question.
The variables are tracked in aggregate, as well as broken out by product category (e.g., collectibles, fashion, or ‘unknown’) and market platform (e.g., auction or fixed price).
This gives around 100 total raw variables; they are extremely sparse and highly correlated. For the linear analysis of Section 4, we expand the raw data into indicators for whether the variable is greater than or equal to each of its positive quintiles. That is, there is a binary element to indicate when each variable is greater than 0 and when it is greater than or equal to the , , , and percentile of nonzero sample values for that variable. After collapsing across equal quintiles (e.g., some variables have up to nonzero percentile equal to one), this results in around 400 covariates. Finally, we set unless otherwise specified, so that the regression design matrices include an intercept. For the tree partitioning of Section LABEL:sec:tree, we do no preprocessing and simply input the 100 raw features to our algorithms.
3 Bayesian nonparametric sampling model
We employ Dirichlet-multinomial sampling as a flexible representation for the data generating process (DGP). The approach dates back to ferguson_bayesian_1973, chamberlain_nonparametric_2003 overview it in the context of econometric problems, and lancaster_note_2003 and poirier_bayesian_2011 provide detailed analysis of linear projections. rubin_bayesian_1981 proposed the Bayesian bootstrap as an algorithm for sampling from versions of the posterior implied by this strategy, and the algorithm has since become closely associated with this model.
This model represents the DGP through a probability mass function on a large but finite number of possible data points (including response, covariates, and treatment),
where is the support of the DGP and are random weights with . We will often suppress and write for unless the weights need to be made explicit. Here denotes , the norm. Observations are assumed drawn independently from (1) by first sampling with probability and then assigning . A posterior over is induced by the posterior over . Functionals of , such as for arbitrary function and where implies expectation over , are thus random variables.
The conjugate prior for the normalized weight vector, , is a Dirichlet distribution, written . We specify a single concentration parameter , such that for and . Note that, from the fact that a Dirichlet realization can be generated as a normalized vector of independent gamma random variables, this conjugate prior is equivalent to independent exponential prior distributions on each un-normalized weight: for .
Now, suppose you have the observed sample . For notational convenience, we allow for in the case of repeated values and write so that and . The posterior distribution for is proportional to . Again using the constructive definition for the Dirichlet distribution, the posterior distribution on can thus be written as . That is, each weight is an independent exponential random variable with both mean and standard deviation equal to .
This model places no real restrictions on the DGP beyond that of independence across observations. In particular, could be so large as to include an practically exhaustive set of values. However, from here on we will focus on the limiting prior that arises as . This ‘non-informative’ limit yields a massive computational convenience: as the weights for unobserved support points converge to a degenerate random variable at zero: for . Our posterior for the DGP is then a multinomial sampling model with random weights on the observed data points. We modify notation and re-write as the vector of weights, so that each DGP is realized in the posterior as
The resulting inference model is thus similar to that underlying the frequentist nonparametric bootstrap of efron1979bootstrap: the observed support acts as a stand-in for the population support. The prior is obviously not our ‘true’ prior, but we view this limiting case as a realistic theoretical construction that provides a convenient foundation for uncertainty quantification.
3.1 Posterior Inference
Consider some statistic of the DGP, say . This could be as simple as the DGP mean, written in the posterior as . Following rubin_bayesian_1981, we can obtain a sample from the posterior on through a Bayesian bootstrap. For :
draw ; then
It is also often possible to derive analytic expressions for the posterior on . These are useful for understanding the sources of uncertainty and they allow us to avoid Monte Carlo computation. For example, if is some scalar associated with each observation (e.g., an element of ) and is the sample vector, the DGP mean for is written as . We can derive the first two posterior moments for this DGP mean as
Throughout, and to refer to posterior mean and variance operators unless otherwise indicated. In more complex examples, we proceed by studying the posterior on first-order Taylor series expansions for the statistic of interest around the posterior mean DGP, which has . The availability of simple analytic expressions for the posterior mean and variance associated with arbitrary statistics is an important part of our approach.
4 Linear HTE projection and adjusted average effects
Linear regression is probably the most commonly applied tool in measurement for heterogeneous treatment effects; lin_agnostic_2013 includes a nice overview. Specifically, one can study HTE through the difference in ordinary least-squares (OLS) projections within each treatment group,
This statistic is obviously relevant if the response is truly linear in . In that case, under a wide range of additive error models, is a consistent and unbiased estimator for the conditional mean , with here denoting the frequentist’s expectation under a true but unknown DGP . In our setting of randomized controlled trials (i.e., an A/B test), treatment randomization implies that is independent from both and , so that can be interpreted as the conditional average causal effect of on given (see imbens_nonparametric_2004 and imbens2015causal for background on causal inference). Moreover, there are many applications where the difference in linear projections will be useful even if the true DGP mean is nonlinear in . For example, we consider in Section 4.2 below the common frequentist practice of using as a measure of the average treatment effect.
4.1 Linear analysis of heterogeneous treatment effects
From our nonparametric Bayesian perspective, the difference between treatment group population OLS projections is a statistic of the underlying DGP. Write the group projections as
where and is the sub-vector of weights for (recall and ). Since the treatment groups are independent, and so that the posterior for the difference can be derived directly from the posterior on (5). A Bayesian bootstrap for thus takes the difference between weighted-OLS fits for treatment and control groups, with weights drawn independently from the distribution.
To provide an analytic summary of the posterior on , we proceed by studying the exact posterior on first-order Taylor series expansions for around its value at the posterior mean DGP, where . lancaster_note_2003 and poirier_bayesian_2011 also apply this approach, under the same Dirichlet-multinomial model that we use, in analysis of OLS projections. A first-order Taylor approximation to the group-specific population OLS in (5) is
where is the OLS projection at posterior mean DGP. As detailed in Appendix LABEL:sec:olsgrad, the gradient is and
where and is the vector of group-specific OLS residuals, for . Since each has an independent exponential posterior distribution, such that , the approximation in (5) has exact posterior variance
This matches the Huber-White heteroskedastic-consistent OLS variance formula (white_heteroskedasticity-consistent_1980); hence, the first-order approximate nonparametric Bayesian posterior variance formula is the same as a widely used frequentist estimator for sampling variance in OLS.
4.1.1 Example: treatment effect heterogeneity between new and old users
To briefly illustrate these ideas, we turn to the EBay experiment data of Section 2 and consider the change in conditional average treatment effect across a single influential covariate: whether or not the user has made a purchase in the past year. Any user who has made such a purchase is labeled existing, while those who have not are new.
Write the feature vector as , with each element an indicator for whether the user is new or existing (so that always). In this simple case, the four random variables of interest – – each correspond to the means associated with independent sub-vectors of and are thus independent from each other. We can apply the formulas in (3) to obtain the exact posterior moments for each and hence for the treatment effects, Working through the formula in (8) shows that, in this setup, the variance of our first-order approximation is very nearly equal to the true posterior variance:
where is the number of members of treatment group that were of user type . Note that the variances in (9) would be exactly equal if the series in (6) was expanded for around , as is done in poirier_bayesian_2011, instead of for as we have here.
Figure 1 presents the posterior distribution for . Results are shown conditional upon the 7.45 million users who visited their myEBay page in the first week of the experiment, the 10.97 million users who visited in the first three weeks, and the 13.22 million who visited at any time during the experiment. The posteriors are summarized through 1000 draws from the Bayesian bootstrap and as well as through a normal distribution centered at the sample OLS difference, , and with variance . Following the discussion above, this approximate variance is very nearly equal to the true posterior variance and the new and existing user average treatment effects are independent from each other in the posterior. Comparing the normal approximation to the bootstrap samples, we see that the former’s tails are thinner than those of the true posterior. Even after 5 weeks and 13 million exposed users, the posterior is skewed with a long right tail for .
4.2 Average treatment effects
A primary goal of many A/B experiments is to infer the average treatment effect (ATE): the average difference between what a user will spend if they experience option B versus if they experience option A. This value, multiplied by the number of users on your website, is the change in revenue that you will receive if moving from option A to option B.
Again referring the reader to imbens2015causal for additional detail, the massive benefit of randomized controlled trials is that each user’s group assignment, , is independent of their treatment effect. The difference between the expected response given that a user has been assigned to treatment versus control can then be interpreted causally as the treatment effect. This motivates the usual measure of ATE for a randomized trial as the difference in group means, . Writing , the corresponding DGP statistic of interest is . Applying (3), we get the posterior moments
with the sum-squared-error for generic length- vector . The posterior is thus centered on the usual estimate for the difference in means between two populations, with variance that is a slight deflation of the common sampling variance formula for this estimator.
There is a large frequentist literature on inference for ATE that advocates replacing with alternative metrics that condition upon the information in covariates, . One recently popular strategy promotes a regression-adjusted average treatment effect (e.g., lin_agnostic_2013),
The advantage of such estimators is that they can have lower sampling variance than . When (11) is also unbiased for some definition of the true ATE (see imbens_nonparametric_2004, for a survey of frequentist inferential targets), it thus leads to more efficient estimation.
One way to justify (11) as representative of the ATE is to assume a linear relationship between and conditional upon . Linearity holds trivially if represents a partitioning of the data into mutually exclusive strata. Application of (11) in this situation is referred to as post stratification; see deng_improving_2013 for use in the context of digital experiments and miratrix_adjusting_2013 for a detailed theoretical overview. But (11) can also be justified for situations where linearity does not hold. berk_covariance_2013, pitkin_improved_2013, lin_agnostic_2013, and lin_comments_2014 study ATE estimation under a variety of assumed sampling models. In each case, they show that is an asymptotically more efficient estimator of average treatment effects than the simple difference in means, .
Both and (under the sampling models in the literature cited above) are unbiased for commonly targeted definitions of the ATE (e.g., the sample average treatment effect – what would have been the average effect if we could have observed every user under both treatment and control). The argument for use of the latter regression-adjusted statistic is that it reduces your estimation variance. Quantifying the amount of variance reduction requires assumptions about the underlying DGP (see the Berk, Pitken et al. and Lin references above) and it is not always clear what one should expect in real examples. Our personal experience with digital experiments is that the frequentist nonparametric bootstrap often shows little, if any, variance reduction after regression adjustment.
Consider covariate averages projected through the population OLS lines from (5),
where . This is a random variable; it is the Bayesian nonparametric target analogous to . As detailed in Appendix LABEL:sec:regappendix, a first-order approximation to (12) is . This has exact posterior mean and Theorem LABEL:ratevarthm shows that
where is the proportion of deviance explained for the group- OLS regression. Thus regression adjustment yields a large variance reduction only if elements of have large covariances with and if the sample size is not too big. Since our digital experiments have tiny signals and massive samples, we should expect precisely what we have experienced in practice: little variance reduction from regression adjustment.
4.2.1 Example: regression-adjustment for average treatment effects
Table LABEL:tab:perweek shows posterior inference for average treatment effects in our EBay experiment, conditional upon the data accumulated through each week of the experiment. We have applied here the full set of user features, expanded into length- covariate vectors as described in Section 2. The table shows posterior means and standard deviations for three ATE statistics: the unadjusted difference in group means, ; the regression-adjusted ATE, ; and the approximation to this regression-adjusted ATE, . For later reference, we also include the average difference between regression trees fit in each treatment group, labeled ; we defer discussion to Section LABEL:sec:tree. Posterior means and standard deviations for are obtained via Bayesian bootstrap sampling, while moments for the other statistics area available analytically as described above.
As predicted by Theorem LABEL:ratevarthm, since our samples are large and our correlations are small, there is practically zero reduction in variance between and . The standard deviations are the same up to two decimal places ($0.01) in every case. Moreover, the Bayesian bootstrap posterior sample for has means and standard deviations that are similar to those of its approximation, , giving evidence to support the practical relevance of (13). While neither this example nor our analytic results imply that regression-adjustment cannot yield lower-variance inference, it does fit with our practical experience that such adjustments makes little difference in experiments like the one studied here.