Applying the Delta method in metric analytics:
A practical guide with novel ideas
Abstract.
During the last decade, the information technology industry has adopted a datadriven culture, relying on online metrics to measure and monitor business performance. Under the setting of big data, the majority of such metrics approximately follow normal distributions, opening up potential opportunities to model them directly and solve big data problems using distributed algorithms. However, certain attributes of the metrics, such as their corresponding data generating processes and aggregation levels, pose numerous challenges for constructing trustworthy estimation and inference procedures. Motivated by four reallife examples in metric development and analytics for largescale A/B testing, we provide a practical guide to applying the Delta method, one of the most important tool from the classic statistics literature, to address the aforementioned challenges. We emphasize the central role of the Delta method in metric analytics, by highlighting both its classic and novel applications.
1. Introduction
1.1. Background
The era of big data brings both blessings and curses (Fan et al., 2014). On one hand, it allows us to measure more accurately and efficiently, to study smaller and more subtle effects, and to tackle problems with smaller signalnoiseratio. On the other hand, it demands larger storage and more intensive computations, forcing the data science community to strive for efficient algorithms which can run in parallel in a distributed system. Many existing algorithms and methods (e.g., support vector machines) that are known to work well in small data scenarios do not scale well in a big data setting (Cortes and Vapnik, 1995; Forero et al., 2010; Wang and Zhou, 2012). Recently, there has been an increasing amount of research interest in metaalgorithms, which can extend algorithms that are difficult to parallelize into distributed algorithms (Bottou, 2010; Zinkevich et al., 2010), and ideas that resemble the divideandconquer algorithm (Kleiner et al., 2014; Jordan et al., 2018).
At the same time, there is a class of algorithms which are trivially parallelizable and therefore can downsize big data problems into smaller ones. The key idea behind them, which dates back to Fisher (1922), is to summarize the original data set by a lowdimensional vector of summary statistics, which can often be computed in a parallel and distributed way. For example, to estimate the mean and variance of a normal distribution from independent and identically distributed (i.i.d.) observations, we only need to obtain their sum and sum of squares, which are the corresponding summary statistics^{1}^{1}1In fact, they are sufficient statistics (Casella and Berger, 2002), i.e., they can represent the original dataset perfectly without losing any information. and can be trivially computed in a distributed fashion. In datadriven businesses such as information technology, these summary statistics are often referred to as metrics, and used for measuring and monitoring key performance indicators (Dmitriev and Wu, 2016; Deng and Shi, 2016). In practice, it is often the changes or differences between metrics, rather than measurements at the most granular level, that are of greater interests. In the context of randomized controlled experimentation (or A/B testing), inferring the changes of metrics can establish causality (Kohavi et al., 2013; Tang et al., 2010; Xu et al., 2015), e.g., whether a new user interface design causes longer view time and more clicks.
1.2. Central limit theorem and Delta method
We advocate directly modeling the metrics rather than the original dataset. When analyzing data at the most granular level (e.g., user), we need some basic assumptions of the underlying probabilistic model, such as i.i.d. observations. When looking at the metrics level, we also need to know their joint distribution. This is where the blessing of big data comes into play. Given large sample sizes, the metrics often possess desirable asymptotic properties due to the central limit theorem (Van der Vaart, 2000). To ensure that the paper is selfcontained, we first review the central limit theorem, in its most wellknown form. Let be i.i.d. observations with finite mean and variance We let denote the sample average, then as the sample size in distribution
A common application of the central limit theorem is to construct the confidence interval of as where is the th quantile for This is arguably one of the most influential results of asymptotic statistics used in almost all scientific fields whenever an estimate with error bars is presented.
While an influential result, the central limit theorem in its basic form only applies to the average of i.i.d. random variables, and in practice our metrics are often more complex. To enjoy the blessing of big data, we employ the Delta method, which extends the normal approximations from the central limit theorem broadly. For illustration we only review the univariate case. For any random variable (the subscript indicates its dependency on e.g., sample average) and constant such that in distribution as the Delta method allows us to extend its asymptotic normality to any continuous transformation To be more specific, by using the fact that and the first order Taylor expansion (Rudin et al., 1964)
(1) 
we have in distribution
This is the Delta method. Without relying on any assumptions other than “big data,” the Delta method is general. It is also memorable – anyone with basic knowledge of calculus can derive it. Moreover, the calculation is trivially parallelizable and can be easily implemented in a distributed system. Nevertheless, although conceptually and theoretically straightforward, the practical difficulty is to find the right “link” function that transforms the simple average to our desired metric. Because of different attributes of the metrics, such as the underlying data generating process and aggregation levels, the process of discovering the corresponding transformation can be challenging. However, unfortunately, although various applications of the Delta method have previously appeared in the data mining literature (Deng et al., 2013; Kohavi et al., 2013; Tang et al., 2010; Kohavi et al., 2010), the method itself and the discovery of were often deemed technical details and only briefly mentioned or relegated to appendices. Motivated by this gap, we aim to provide a practical guide that highlights the effectiveness and importance of the Delta method, hoping to help fellow data scientists, developers and applied researchers conduct trustworthy metric analytics.
1.3. Scope and contributions
As a practical guide, this paper presents four applications of the Delta method in reallife scenarios, all of which have been deployed in Microsoft’s online A/B testing platform ExP (Kohavi et al., 2013, 2009a) and employed to analyze experimental results on a daily basis:

In Section 2, we derive the asymptotic distributions of ratio metrics^{2}^{2}2Pedantically, ratio of two measurements of the same metric, from the treatment and control groups.. Compared with standard approaches by Fieller (1940, 1954), the Delta method provides a much simpler and yet almost equally accurate and effective solution;

In Section 4, by extending the Delta method to outer confidence intervals (Meyer, 1987), we propose a novel hybrid method to construct confidence intervals for quantile metrics. Unlike most existing methods, Our proposal does not require estimating an unknown density function, which itself is often a theoretically challenging and computationally intensive task, and has been a center of criticism (Brown and Wolfe, 1983);

In Section 5, we handle missing data in withinsubject studies by combining the Delta method with data augmentation. We demonstrate the effectiveness of “big data small problem” approach when directly modeling metrics.
The main purpose of this paper is to promote the Delta method as a general, memorable and efficient tool for metric analytics. In particular, our contributions to the existing data mining literature include:

Practical guidance for scenarios such as inferring ratio metrics, where the Delta method offers scalable and easiertoimplement solutions;

A novel and computationally efficient solution to estimate the sampling variance of quantile metrics;

A novel data augmentation technique that employs the Delta method to model metrics resulting from withinsubject or longitudinal studies with unspecified missing data patterns.
For reproducibility and knowledge transfer, we provide all relevant computer programs at https://aka.ms/exp/deltamethod.
2. Inferring percent changes
2.1. Percent change and Fieller interval
Measuring change is a common theme in applied data science. In online A/B testing (Kohavi et al., 2013; Tang et al., 2010; Deng and Shi, 2016; Kohavi et al., 2007; Dmitriev et al., 2017), we estimate the average treatment effect (ATE) by the difference of the same metric measured from treatment and control groups, respectively. In time series analyses and longitudinal studies, we often track a metric over time and monitor changes between different time points. For illustration, let be i.i.d. observations from the control group with mean and variance i.i.d. observations from the treatment group with mean and variance and the covariance^{3}^{3}3For A/B testing where the treatment and control groups are independently sampled from a super population, between ’s and ’s. Let and be two measurements of the same metric from the treatment and control groups, respectively, and their difference is an unbiased estimate of the ATE Because both and are approximately normally distributed, their difference also follows an approximate normal distribution with mean and variance
Consequently, the wellknown 100(1)% confidence interval of is where is the finitesample analogue of and can be computed using the sample variances and covariance of the treatment and control observations, denoted as and respectively.
In practice, however, absolute differences as defined above are often hard to interpret because they are not scaleinvariant. Instead, we focus on the relative difference or percent change estimated by The key problem of this section is constructing a 100(1)% confidence interval for . For this classic problem, Fieller (1940, 1954) seems to be the first to provide a solution. To be specific, let denote the th quantile for the distribution with degree of freedom and then Fieller’s interval of is
(2) 
Although widely considered as the standard approach for estimating variances of percent changes (Kohavi et al., 2009b), deriving (2) is cumbersome (see (Von Luxburg and Franz, 2009) for a modern revisit). The greater problem is that, even when given this formula, applying it often requires quite some effort. According to (2) we need to not only estimate the sample variances and covariance, but also the parameter
2.2. Delta method and Edgeworth correction
The Delta method provides a more intuitive alternative solution. Although they can be found in classic textbooks such as (Casella and Berger, 2002), this paper (as a practical guide) still provides all the relevant technical details. We let , and A multivariate analogue of (1) suggests that which implies that
(3) 
For let which are also i.i.d. observations. Consequently, we can rewrite (3) as leading to the Delta method based confidence interval
(4) 
(4) is easier to implement than (2), and is in fact the limit of (2) in the large sample case, because as we have and Although Fieller’s confidence interval can be more accurate for small samples (Hirschberg and Lye, 2010), this benefit appears rather limited for big data. Moreover, the Delta method can also be easily extended for a better approximation by using Edgeworth expansion (Hall, 2013; Boos and HughesOliver, 2000). To be more specific, (3) suggests that in distribution whose cumulative distribution function
contains a correction term in addition to the cumulative distribution function of the standard normal, and , the skewness of the ’s. By simply replacing the terms “” in (4) with and the th and th quantiles of respectively, we obtain the corresponding Edgeworth expansion based confidence interval. Finally, we can add a secondorder bias correction term to the point estimate in (4); the same correct term can be applied to the Edgeworth expansion based interval.
2.3. Numerical examples
To illustrate the performance of Fieller’s interval in (2), the Delta method interval in (4), and the Edgeworth interval with and without bias correction under different scenarios, we let the sample size For each fixed we assume that the treatment and control groups are independent, and consider three simulation models for experimental i.i.d. experimental units
The above models aim to mimic the behaviors of our most common metrics, discrete or continuous. For each case, we repeatedly sample data sets, and for each data set we construct Fieller’s and the Delta method intervals, respectively, and then add correction to the Delta method result. We also construct the Edgeworth expansion interval without and with the bias correction. We report the corresponding coverage rates in Table 1, from which we can draw several conclusions. For big data (), all methods achieve nominal (i.e., ) coverage rates for all simulation models. For small data (), although Fieller’s interval seems more accurate for some simulation models (e.g., normal), other methods perform comparably, especially after the bias correction. For simplicity in implementation and transparency in applications, we recommend using the Delta method based interval (4), with the bias correction.
Method  n  Fieller  Delta  Delta(BC)  Edgeworth  Edgeworth(BC) 

Normal  20  0.9563  0.9421  0.9422  0.9426  0.9426 
Normal  50  0.9529  0.9477  0.9477  0.9478  0.9477 
Normal  200  0.9505  0.9490  0.9491  0.9490  0.9490 
Normal  2000  0.9504  0.9503  0.9503  0.9503  0.9503 
Poisson  20  0.9400  0.9322  0.9370  0.9341  0.9396 
Poisson  50  0.9481  0.9448  0.9464  0.9464  0.9478 
Poisson  200  0.9500  0.9491  0.9493  0.9496  0.9498 
Poisson  2000  0.9494  0.9494  0.9495  0.9494  0.9495 
Bernoulli  20  0.9539  0.9403  0.9490  0.9476  0.9521 
Bernoulli  50  0.9547  0.9507  0.9484  0.9513  0.9539 
Bernoulli  200  0.9525  0.9513  0.9509  0.9517  0.9513 
Bernoulli  2000  0.9502  0.9500  0.9499  0.9501  0.9500 
3. Decoding cluster randomization
3.1. The variance estimation problem
Two key concepts in a typical A/B test are the randomization unit – the granularity level where sampling or randomization is performed, and the analysis unit – the aggregation level of metric computation. Analysis is straightforward when the randomization and analysis units agree (Deng et al., 2017), e.g., when randomizing by user while also computing the average revenue per user. However, often the randomization unit is a cluster of analysis units (it cannot be more granular than the analysis unit, otherwise the analysis unit would contain observations under both treatment and control, nullifying the purpose of differentiating the two groups). Such cases, sometimes referred to as cluster randomized experiments in the econometrics and statistics literature (Athey and Imbens, 2017; Klar and Donner, 2001), are quite common in practice, e.g., enterprise policy prohibiting users within the same organization from different experiences, or the need to reduce bias in the presence of network interference (Backstrom and Kleinberg, 2011; Eckles et al., 2017; Gui et al., 2015). Perhaps more ubiquitously, for the same experiment we usually have metrics with different analysis units. For example, to meet different business needs, most userrandomized experiments run by ExP contain both userlevel and pagelevel metrics.
We consider two average metrics^{4}^{4}4Pedantically, they are two measurements of the same metric. We often use metric to refer to both the metric itself (e.g. revenueperuser) and measurements of the metric (e.g. revenueperuser of the treatment group) and this distinction would be clear in the context. of the treatment and control groups, assumed to be independent. Without loss of generality, we only focus on the treatment group with clusters. For the th cluster contains analysis unit level observations (). Then the corresponding average metric is We assume that within each cluster the observations ’s are i.i.d. with mean and variance , and across clusters are also i.i.d.
3.2. Existing and Delta method based solutions
is not an average of i.i.d. random variables, and the crux of our analysis is to estimate its variance. Under strict assumptions, closedform solutions for this problem exist (Donner, 1987; Klar and Donner, 2001). For example, when and for all
(5) 
where is the betweencluster variance and is the coefficient of intracluster correlation, which quantifies the contribution of betweencluster variance to the total variance. To facilitate understanding of the variance formula (5), two extreme cases are worth mentioning:

If then for each and all In this case, and

If , then for all and therefore the observations ’s are in fact i.i.d. In this case, and (5) reduces to .
However, unfortunately, although theoretically sound, (5) has only limited practical value because the assumptions it makes are unrealistic. In reality, the cluster sizes and distributions of for each cluster are different, which means that and are different.
Another common approach is the mixed effect model, also known as multilevel/hierarchical regression (Gelman and Hill, 2006), where depends on and while the parameters themselves follow a “higher order” distribution. Under this setting, we can infer the treatment effect as the “fixed” effect for the treatment indicator term^{5}^{5}5Y Treatment + (1—User) in lme4 notation. A detailed discussion is beyond the scope of this paper; see Gelman and Hill (2006); Bates et al. (2014a).. Stan (Carpenter et al., 2016) offers a Markov Chain Monte Carlo (MCMC) implementation aiming to infer the posterior distribution of those parameters, but this needs significant computational effort for big data. Moreover, the estimated ATE, i.e., the coefficient for the treatment assignment indicator, is for the randomization unit (i.e., cluster) but not the analysis unit level, because it treats all clusters with equal weights and can be viewed as the effect on the double average which is usually different than the population average (Deng and Shi, 2016). This distinction doesn’t make a systematic difference when effects across clusters are homogeneous. However, in practice the treatment effects are often heterogeneous, and using mixed effect model estimates without further adjustment steps could lead to severe biases.
On the contrary, the Delta method solves the problem directly from the metric definition. Rewrite into Let , and divide both the numerator and denominator by ,
Albeit not an average of i.i.d. random variables, is a ratio of two averages of i.i.d. randomization unit level quantities (Deng et al., 2017). Therefore, by following (3) in Section 2.2,
(6) 
Therefore, the variance of depends on the variance of a centered version of (by a multiple of , not the constant as typically done). Intuitively, both the variance of cluster size and that of withincluster sum of observations contributes to (3.2). In particular, is an important contributor of the variance of leading to a practical recommendation for the experimenters – it is desirable to make the cluster sizes homogeneous; otherwise it can be difficult to detect small treatment effects due to low statistical power.
3.3. Numerical examples
Because the coverage property of (3.2) has been extensively covered in Section 2.3, we only focus on comparing it with the mixed effect model here. We first describe the data generating process, which consists of a twolevel hierarchy as described in the previous section. First, at the randomization unit level, let the total number of clusters To mimic cluster heterogeneity, we divide clusters into three categories: small, medium and large. We generate the numbers of clusters in the three categories by the following multinomial distribution:
For the th cluster, depending on which category it belongs to, we generate and in the following way^{6}^{6}6The positive correlation between and is not important, and reader can try out code with different configuration.:

Small:

Medium:

High:
Second, for each fixed let for all This choice is motivated by binary metrics such as pageclickrate, and because of it we can determine the ground truth by computing the weighted average of weighted by the cluster sizes and the mixture of small, medium and large clusters.
Our goal is to infer and we compare the following three methods:

A naive estimator pretending all observations are i.i.d;

Fitting a mixed effect model with a cluster random effect

Using the metric as in the first method, but using the Delta method for variance estimation.
Based on the aforementioned data generating mechanism, we repeatedly and independently generate 1000 data sets. For each data set, we compute the point and variance estimates of using the naive, mixed effect, and delta methods. We then compute empirical variances for the three estimators, and compare them to the average of estimated variances. We report the results in Table 2.
Method  Ground Truth  SD(True)  Estimate  Avg. SE(Model) 

Naive  0.667  0.00895  0.667  0.00522 
Mixed effect  0.667  0.00977  0.547  0.00956 
Delta method  0.667  0.00895  0.667  0.00908 
Not surprisingly, the naive method underestimates the true variance – we had treated the correlated observations as independent. Both the Delta method and the mixed effect model produced satisfactory variance estimates. However, although both the naive and the Delta method correctly estimated , the mixed effect estimator is severely biased. This shouldn’t be a big surprise if we look deeper into the model and where the random effects are centered so . The sum of the intercept terms and stands for the percluster mean , and represents the average of percluster mean, where we compute the mean within each cluster first, and then average over clusters. This is different from the metrics defined as simple average of in the way that in the former all clusters are equally weighted and in the latter case bigger clusters have more weight. The two definitions will be the same if and only if either there is no heterogeneity, i.e. percluster means are all the same, or all clusters have the same size. We can still use the mixed effect model to get a unbiased estimate. This requires us to first estimate every (thus ), and then compute by applying the correct weight . The mixed effect model with the above formula gave a new estimate , much closer to the ground truth. Unfortunately, it is still hard to get the variance of this new estimator.
In this study we didn’t consider the treatment effect. In ATE estimation, the mixed effect model will similarly result in a biased estimate for the ATE for the same reason, as long as percluster treatment effects vary and cluster sizes are different. The fact that the mixed effect model provides a double average type estimate and the Delta method estimates the “population” mean is analogous to the comparison of the mixed effect model with GEE (generalized estimating equations) (Liang and Zeger, 1986). In fact, in the Gaussian case, the Delta method can be seen as the ultimate simplification of GEE’s sandwich variance estimator after summarizing data point into sufficient statistics. But the derivation of GEE is much more involved than the central limit theorem, while we can explain the Delta method in a few lines and it is not only more memorable but also provides more insights in (3.2).
4. Efficient variance estimation for quantile metrics
4.1. Sample quantiles and their asymptotics
Although the vast majority of metrics are averages of user telemetry data, quantile metrics form another category that is widely used to focus on the tails of distributions. In particular, this is often the case for performance measurements, where we not only care about an average user’s experience, but even more so about those that suffer from the slowest responses. Within the web performance community, quantiles (of, for example, page loading time) at 75%, 95% or 99% often take the spotlight. In addition, the 50% quantile (median) is sometimes used to replace the mean, because it is more robust to outlier observations (e.g., errors in instrumentation). This section focuses on estimating the variances of quantile estimates.
Suppose we have i.i.d. observations generated by a cumulative distribution function and a density function ^{7}^{7}7We do not consider cases when has discrete mass, and will have jumps. In this case the quantile can take many values and is not well defined. In practice this case can be seen as continuous case with some discrete correction.. The theoretical th quantile for the distribution is defined as . Let be the ascending ordering of the original observations. The sample quantile at is if is an integer. Otherwise, let be the floor of , then the sample quantile can be defined as any number between and or a linear interpolation of the two^{8}^{8}8When p is 0.5, the 50% quantile, or median, is often defined as the average of the middle two numbers if we have even number of observations.. For simplicity here we use which will not affect any asymptotic results. It is a wellknown fact that, if are i.i.d. observations, following the central limit theorem and a rather straightforward application of the Delta method, the sample quantile is approximately normal (Casella and Berger, 2002; Van der Vaart, 2000):
(7) 
where However, unfortunately, in practice we rarely have i.i.d. observations. A common scenario is in search engine performance telemetry, where we receive an observation (e.g., page loading time) for each server request or pageview, while randomization is done at a higher level such as device or user. This is the same situation we have seen in Section 3, where are clustered. To simplify future notations, we let where is the indicator function. Then (3.2) can be used to compute , and (7) holds in the clustered case with . This generalizes the i.i.d. case where . Note that the Delta method is instrumental in proving (7) itself, but a rigorous proof involves a rather technical last step that is beyond our scope. A formal proof can be found in (Van der Vaart, 2000).
4.2. A Delta method solution to a practical issue
Although theoretically sound, the difficulty of applying (7) in practice lies in the denominator whose computation requires the unknown density function at the unknown quantile A common approach is to estimate from the observed using nonparametric methods such as kernel density estimation (Wasserman, 2003). However, any nonparametric density estimation method is trading off between bias and variance. To reduce variance, more aggressive smoothing and hence larger bias need to be introduced to the procedure. This issue is less critical for quantiles at the body of the distribution, e.g. median, where density is high and more data exists around the quantile to make the variance smaller. As we move to the tail, e.g. 90%, 95% or 99%, however, the noise of the density estimation gets bigger, so we have to introduce more smoothing and more bias. Because the density shows up in the denominator and density in the tail often decays to , a small bias in estimated density can lead to a big bias for the estimated variance (Brown and Wolfe (1983) raised similar criticisms with their simulation study). A second approach is to bootstrap, resampling the whole dataset many times and computing quantiles repeatedly. Unlike an average, computing quantiles requires sorting, and sorting in distributed systems (data is distributed in the network) requires data shuffling between nodes, which incurs costly network I/O. Thus, bootstrap works well for small scale data but tends to be too expensive in large scale in its original form (efficient bootstrap on massive data is a research area of its own (Kleiner et al., 2014)).
An alternative method without the necessity for density estimation is more desirable, especially from a more practical perspective. One such method is called outer confidence interval (outer CI) (Krewski, 1976; Meyer, 1987), which produces a closedform formula for quantile confidence intervals using combinatorial arguments. Recall that and is the count of observations no greater than the quantile. In the aforementioned i.i.d. case, follows a binomial distribution. Consequently, when is large where . If the quantile value then . The above equation can be inverted into a confidence interval for . This means with 95% probability the true percentile is between the lower rank and upper rank !
The traditional outer CI depends on being i.i.d. But when there are clusters, simply takes a different formula (3.2) by the Delta method and the result above still holds. Hence the confidence interval for a quantile can be computed in the following steps. First we compute the quantile, which can done efficiently in distributed system utilizing sketching and external sort. Then we compute using (3.2). The last step is to compute and and then fetch the two ranks. We call this outer CI with preadjustment. This method reduces the complexity of computing a quantile and its confidence interval into a Delta method step and subsequently fetching three “ntiles”. However, in this algorithm the ranks depend on the data (more specifically the , requiring a pass through the data to be computed), which means that it is more difficult to optimize the quantile retrieval. To make the algorithm more efficient, we noticed that (7) also implies that the change from being i.i.d. to clustered only requires an adjustment to the numerator , which is a simple rescaling step, and the correction factor does not depend on the unknown density function in the denominator. If the outer CI were to provide a good confidence interval in i.i.d. case, a rescaled outer CI with the same correction term should also work for the clustered case, at least when is large. This leads to the outer CI with postadjustment algorithm, where we first compute the ranks according to outer CI for the i.i.d. case (which only requires and ) and retrieve them, and then compute the correction factor via the Delta method and apply it directly to the previously computed confidence interval. We implemented this latter method in ExP using Apache Spark (Zaharia et al., 2016) and SCOPE (Chaiken et al., 2008).
4.3. Numerical examples
To test the validity and performance of the adjusted outer CI method, we compare its coverage to a standard nonparametric bootstrap ( replicates). The simulation setup consists of users (clusters) with observations each ( are uniformly distributed). Each observation is the sum of two i.i.d. random variables , where is constant for each user. We consider two cases, one symmetric and one heavytailed distribution:
First, we find the ”true” 95th percentile value of these distribution by computing its value for a very large sample (). Second, we compute the confidence intervals for simulation runs using bootstrap and outer CI with pre and postadjustment and compare their coverage estimates (0.002 standard error), shown in Table 3. We found that when the sample contains 1000 or more clusters, all methods provide good coverage. Pre and postadjustment outer CI results are both very close to the much more computationally expensive bootstrap (in our unoptimized simulations, the outer CI method was 20 times faster than bootstrap). When the sample size was smaller than 1000 clusters, bootstrap was noticeably inferior to outer CI. For all sample sizes, preadjustment provided slightly larger coverage than postadjustment, and this difference increased for smaller samples. In conclusion, postadjustment outer CI works very well for large sample sizes and reasonably well for smaller samples, but has slightly inferior coverage compared to preadjustment. For big data, outer CI with postadjustment is recommended due its efficiency, while for median to small sample sizes, outer CI with preadjustment is preferred if accuracy is paramount.
Distribution  Bootstrap  Outer CI  Outer CI  

(preadj.)  (postadj.)  
Normal  100  0.9039  0.9465  0.9369 
Normal  1000  0.9500  0.9549  0.9506 
Normal  10000  0.9500  0.9500  0.9482 
Lognormal  100  0.8551  0.9198  0.9049 
Lognormal  1000  0.9403  0.9474  0.9421 
Lognormal  10000  0.9458  0.9482  0.9479 
5. Missing Data and WithinSubject Analyses
5.1. Background
Consider the case that we are tracking a metric overtime. For example, businesses need to track key performance indicators like user engagement and revenue daily or weekly for a given audience. To simplify, let us say we are tracking weekly visitsperuser. Visitsperuser is defined as a simple average for each week. If there is a drop between and , how do we set up an alert so that we can avoid sending triggering it due to random noise and control the false alarm rate? Looking at the variance, we see
and we need to have a good estimate of the covariance term because we know its nonzero.
Normally, estimating the covariance of two sample averages is trivial and the procedure is very similar to the estimation of the sample variance. But there is something special about this case — missing data. Not everyone uses the product every week. For any online and offline metric, we are only able to define the metric using observed data. If for every user we observe its value in week and , covariance can be estimated using the sample covariance formula. But if there are many users who appears in week but not in , it is unclear how to proceed. The naive approach is to estimate the covariance using complete observations, i.e. users who appear in both weeks. However, this only works if the data is missing completely at random. In reality, active users will show up more often and are more likely to appear in both weeks, meaning the missing data are obviously not random.
In this section we show how the Delta method can be very useful for estimating for any two time points and . We then use this to show how we can analyze withinsubject studies, also known as prepost, repeated measurement, or longitudinal studies. Our method starts with metrics directly, highly contrasting with traditional methods such as mixed effect models which build models from individual user’s data. Because we study metrics directly, our model is small and easy to solve with its complexity constant to the scale of data.^{9}^{9}9Part of the work in this section has previously appeared in a technical report (Guo and Deng, 2015). Since authoring the technical report, the authors developed a better understanding of the differences between mixed effect models and the proposed method, and we present our new findings here. The technical report has more details about other types of repeated measurement models.
5.2. Methodology
How do we estimate covariance with a completely unknown missing data mechanism? We present a novel idea using the Delta method after data augmentation. Specifically, we use an additional indicator for the presence/absence status of a user in each period . For user in period let if user appears in period , and otherwise. For each user in period , instead of one scalar metric value , we augment it to a vector . When , i.e. user is missing, we define . Under this simple augmentation, the metric value , taking average over those nonmissing measurements in period , is the same as In this connection,
where the last equality is by dividing both numerator and denominator by the same total number of users who have appeared in any of the two periods.^{10}^{10}10Actually, if there are more than two period, we can either use only users appeared in any of the two periods, or users appeared in any of all the periods. It is mathematically the same thing if we added more users and then treat them as not appeared in and , i.e. remains the same. Thanks to the central limit theorem, the vector is also asymptotically (multivariate) normal with covariance matrix , which can be estimated using sample variance and covariance because there is no missing data after augmentation. In Section 2 we already applied the Delta method to compute the variance of a ratio of metrics by taking the first order Taylor expansion. Here, we can expand the two ratios to their first order linear form where and are the means of and and the expansion for is similar. can then be computed using .
We can apply this to the general case of a withinsubject study. Without loss of any generality and with the benefit of a concrete solution, we only discuss a crossover design here. Other designs including more periods are the same in principle. In a crossover design, we have two groups I and II. For group I, we assign them to the treatment for the first period and control for the second. For group II, the assignment order is reversed. We often pick the two groups using random selection, so each period is an A/B test by itself.
Let be the metric for group in period . Let . We know for a mean vector and covariance matrix . has the form since the two groups are independent with the same distribution. With the help of the Delta method, we can estimate from the data and will treating it as a constant covariance matrix (hence ). Our model concerns the mean vector using other parameters which represent our interest. In a crossover design, where the first two are baseline means for the two periods and our main interest is the treatment effect . (We assume there is no treatment effect for group I carried over to the 2nd period. To study carry over effect, a more complicated design need to be employed.) In this parameterization,
(8) 
The maximum likelihood estimator and Fisher information theory (Van der Vaart, 2000) paved a general way for us to estimate as well as the variance of the estimators for various mean vector models. For example, if we want to model the relative change directly, we just need to change the addition of into multiplication of . Notice the model we are fitting here is very small, almost a toy example from a text book. All the heavy lifting is in computing which is dealt with by the Delta method where all computations are trivially distributed. When is linear as in the additive crossover model above, and where the Fisher Information .
5.3. Numerical examples
We simulate two groups with 1000 users each. The first group receives treatment in period 1, then control in period 2, while the second group receives the inverse. For each user, we impose an independent user effect that follows a normal distribution with mean 10 and standard deviation 3, and independent noises with mean 0 and standard deviation 2. Each user’s base observations (before treatment effect) for the two periods are . We then model the missing data and treatment effect such that they are correlated. We define a user’s engagement level by its user effect through , i.e. the probability that a user’s is bigger than another random user. We model the treatment effect as an additive normal with mean 10 and standard deviation 0.3, multiplied by . For each user and each of the two periods, there is a 1  max(0.1, ) probability of this user being missing. We can interpret the missing data as a user not showing up, or as a failure to observe. In this model, without missing data, every user has two observations and the average treatment effect should be because . Since we have missing data and it is more likely for lower engagement levels to be missing, we expect the average treatment effect for the population of all observed users to be between 5 to 10. In current model we didn’t add a time period effect such that the second period could have a different mean from the first period’s mean , but in analysis we always assume that this effect could exist.
We simulate the following 1000 times: each time we run both the mixed effect model as well as the additive crossover model (8) and record their estimates of the ATE and the corresponding estimated variance. We then estimate the true estimator variance using the sample variance of those estimates among 1000 trials and compare that to the mean of the estimated variance to evaluate the quality of variance estimations. We also compute the average of ATE estimates and compare to the true ATE to assess the bias.
Ground Truth  SD(True)  Estimate  Avg. SE(Model)  

mixed effect  6.592  0.1295  7.129  0.1261 
Delta method  6.592  0.1573  6.593  0.1568 
Table 4 summarizes the results. We found both methods provide good variance estimation and the mixed effect model shows smaller variance. However, mixed effect also displays an upward bias in the ATE estimate while the Delta method closely tracks the true effect. To further understand the difference between the two, we separate the data into two groups: users with complete data, and users who only appear in one period (incomplete group). We run the mixed effect model for the the two groups separately. Note that in the second group each user only appears once in the data, so the model is essentially a linear model. Our working hypothesis is the following: because the mixed effect model assumes a fixed treatment effect, the effect for the complete and incomplete groups must be the same. The mixed effect model can take advantage of this assumption and construct an estimator by weighted average of the two estimates from the two groups, with the optimal weight inversely proportional to their variances. Table 5 shows the weighted average estimator is indeed very close to mixed effect model for both estimate and variance. The weighted average is closer to the complete group because the variance there is much smaller than that of the incomplete group since withinsubject comparison significantly reduces noise. This explains why the mixed effect model can produce misleading results in withinsubject analysis whenever missing data patterns can be correlated with the treatment effect. The Delta method, on the other hand, offers a flexible and robust solution.
Estimate  Var  

mixed effect model  7.1290  0.00161 
mixed effect model on complete group  7.3876  0.00180 
linear model on incomplete group  5.1174  0.01133 
weighted avg estimator  7.0766  0.00155 
6. Concluding remarks
Measure everything is not only an inspiring slogan, but also a crucial step towards the holy grail of datadriven decision making. In the big data era, innovative technologies have tremendously advanced user telemetry and feedback collection, and distilling insights and knowledge from them is an imperative task for business success. To do so, we typically apply certain statistical models at the level of individual observations, fit the model using numerical procedures such as solving optimization problems, and draw conclusions and assess uncertainties from the fitted models. However, for big data such an analytical procedure can be very challenging. Based on the key observation that metrics are approximately normal due to the central limit theorem, this paper offers an alternative perspective by advocating modeling metrics, i.e., summaries or aggregations of the original data sets, in a direct manner. By doing so, we can decompose big data into small problems. However, although conceptually sound, in practice these metric level models often involve nonlinear transformations of data or complex data generating mechanisms, posing several challenges for trustworthy and efficient metric analytics.
To address these issues, we promoted the Delta method’s central role in making it possible to extend the central limit theorem to new territories. We demonstrated how to apply the Delta method in four reallife applications, and illustrated how this approach naturally leads to trivially parallelizable and highly efficient implementations. Among these applications, ratio metrics, clustered randomizations and quantile metrics are all common and important scenarios for A/B testing, and business analytics in general. Withinsubject studies are becoming more popular for superior sensitivities, and missing data with an unknown mechanism is ubiquitous in both the online and offline worlds. Our contribution to technical improvements, novel ideas and new understandings includes bias correction for ratio metric estimation, combination of the Delta method and outer confidence intervals for quantile metric variance estimation, and the idea of data augmentation for general missing data problems in withinsubject studies. We also revealed the connection between the Delta method and mixed effect models, and explained their differences. In addition, we pointed out the advantage of the Delta method in the presence of an unknown missing data mechanism. Overall speaking, we hope this paper can serve as a practical guide of applying the Delta method in largescale metric analyses and A/B tests, so that it is no longer just a technical detail, but a starting point and a central piece of analytical thinking.
Although the Delta method can help tackle big data problems, it does not replace the need for rigorous experimental designs and probabilistic modeling. For example, “optimal” choices for cluster configurations, randomization mechanisms and data transformations are known to increase the sensitivities of metrics (Xie and Aurisset, 2016; Kharitonov et al., 2017; Budylin et al., 2018). We leave them as future research directions.
Acknowledgements.
We benefited from insightful discussions with Ya Xu and Daniel Ting on quantile metrics and outer confidence intervals. We thank Jong Ho Lee for implementing the efficient quantile metric estimation and confidence interval construction in ExP using Apache Spark, and Yu Guo for previous work on repeated measurements.References
 (1)
 Athey and Imbens (2017) Susan Athey and Guido W Imbens. 2017. The econometrics of randomized experiments. Handbook of Economic Field Experiments 1 (2017), 73–140.
 Backstrom and Kleinberg (2011) Lars Backstrom and Jon Kleinberg. 2011. Network bucket testing. In Proceedings of the 20th international conference on World wide web. ACM, 615–624.
 Bates et al. (2014a) Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2014a. Fitting linear mixedeffects models using lme4. arXiv preprint arXiv:1406.5823 (2014).
 Bates et al. (2014b) Douglas Bates, Martin Maechler, Ben Bolker, Steven Walker, et al. 2014b. lme4: Linear mixedeffects models using Eigen and S4. R package version 1, 7 (2014), 1–23.
 Boos and HughesOliver (2000) Dennis D Boos and Jacqueline M HughesOliver. 2000. How large does have to be for and intervals? The American Statistician 54, 2 (2000), 121–128.
 Bottou (2010) Léon Bottou. 2010. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010. Springer, 177–186.
 Brown and Wolfe (1983) Morton B Brown and Robert A Wolfe. 1983. Estimation of the variance of percentile estimates. Computational Statistics & Data Analysis 1 (1983), 167–174.
 Budylin et al. (2018) Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 55–63.
 Carpenter et al. (2016) Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2016. Stan: A probabilistic programming language. Journal of Statistical Software 20 (2016), 1–37.
 Casella and Berger (2002) George Casella and Roger L Berger. 2002. Statistical Inference, Second Edition. Duxbury Press: Pacific Grove, CA.
 Chaiken et al. (2008) Ronnie Chaiken, Bob Jenkins, PerÅke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1 (2008), 1265–1276.
 Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine Learning 20 (1995), 273–297.
 Deng et al. (2017) A. Deng, J. Lu, and J. Litz. 2017. Trustworthy analysis of online A/B tests: Pitfalls, challenges and solutions. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 641–649.
 Deng and Shi (2016) A. Deng and X. Shi. 2016. Datadriven metric development for online controlled experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
 Deng et al. (2013) Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing preexperiment data. In Proceedings of the 6th ACM WSDM Conference. 123–132.
 Dmitriev et al. (2017) Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 1427–1436. https://doi.org/10.1145/3097983.3098024
 Dmitriev and Wu (2016) Pavel Dmitriev and Xian Wu. 2016. Measuring Metrics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 429–437.
 Donner (1987) Allan Donner. 1987. Statistical methodology for paired cluster designs. American Journal of Epidemiology 126, 5 (1987), 972–979.
 Eckles et al. (2017) Dean Eckles, Brian Karrer, and Johan Ugander. 2017. Design and analysis of experiments in networks: Reducing bias from interference. Journal of Causal Inference 5, 1 (2017).
 Fan et al. (2014) Jianqing Fan, Fang Han, and Han Liu. 2014. Challenges of big data analysis. National Science Review 1 (2014), 293–314.
 Fieller (1940) Edgar C Fieller. 1940. The biological standardization of insulin. Supplement to the Journal of the Royal Statistical Society 7, 1 (1940), 1–64.
 Fieller (1954) Edgar C Fieller. 1954. Some problems in interval estimation. Journal of the Royal Statistical Society. Series B (Methodological) (1954), 175–185.
 Fisher (1922) Ronald Aylmer Fisher. 1922. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222 (1922), 309–368.
 Forero et al. (2010) Pedro A Forero, Alfonso Cano, and Georgios B Giannakis. 2010. Consensusbased distributed support vector machines. Journal of Machine Learning Research 11, May (2010), 1663–1707.
 Gelman and Hill (2006) Andrew Gelman and Jennifer Hill. 2006. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
 Gui et al. (2015) Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network A/B testing: From sampling to estimation. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 399–409.
 Guo and Deng (2015) Yu Guo and Alex Deng. 2015. Flexible Online Repeated Measures Experiment. arXiv preprint arXiv:1501.00450 (2015).
 Hall (2013) Peter Hall. 2013. The bootstrap and Edgeworth expansion. Springer Science & Business Media.
 Hirschberg and Lye (2010) Joe Hirschberg and Jenny Lye. 2010. A geometric comparison of the delta and Fieller confidence intervals. The American Statistician 64 (2010), 234–241.
 Jordan et al. (2018) Michael I Jordan, Jason D Lee, and Yun Yang. 2018. Communicationefficient distributed statistical inference. J. Amer. Statist. Assoc. in press (2018). doi:10.1080/01621459.2018.1429274.
 Kharitonov et al. (2017) Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning sensitive combinations of a/b test metrics. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 651–659.
 Klar and Donner (2001) Neil Klar and Allan Donner. 2001. Current and future challenges in the design and analysis of cluster randomization trials. Statistics in medicine 20, 24 (2001), 3729–3740.
 Kleiner et al. (2014) Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I Jordan. 2014. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 4 (2014), 795–816.
 Kohavi et al. (2009a) Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Juan Lavista Ferres, and Tamir Melamed. 2009a. Online experimentation at Microsoft. In Proceedings of the Third International Workshop on Data Mining Case Studies, held at the 5th ACM SIGKDD Conference. 11–23.
 Kohavi et al. (2013) Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online Controlled Experiments at Large Scale. Proceedings of the 19th ACM SIGKDD Conference (2013).
 Kohavi et al. (2007) Ron Kohavi, Randal M Henne, and Dan Sommerfield. 2007. Practical guide to controlled experiments on the web: listen to your customers not to the hippo. In Proceedings of the 13th ACM SIGKDD Conference. 959–967.
 Kohavi et al. (2009b) Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009b. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery 18, 1 (2009), 140–181.
 Kohavi et al. (2010) R. Kohavi, R. Longbotham, and T. Walker. 2010. Online Experiments: Practical Lessons. Computer 43, 9 (Sept 2010), 82–85.
 Krewski (1976) Daniel Krewski. 1976. Distributionfree confidence intervals for quantile intervals. J. Amer. Statist. Assoc. 71, 354 (1976), 420–422.
 Liang and Zeger (1986) KungYee Liang and Scott L Zeger. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73, 1 (1986), 13–22.
 Meyer (1987) John S Meyer. 1987. Outer and inner confidence intervals for finite population quantile intervals. J. Amer. Statist. Assoc. 82, 397 (1987), 201–204.
 Rudin et al. (1964) Walter Rudin et al. 1964. Principles of mathematical analysis. Vol. 3. McGrawhill New York.
 Tang et al. (2010) Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings of the 16th ACM SIGKDD Conference (2010).
 Van der Vaart (2000) Aad W Van der Vaart. 2000. Asymptotic statistics. Vol. 3. Cambridge university press.
 Von Luxburg and Franz (2009) Ulrike Von Luxburg and Volker H Franz. 2009. A geometric approach to confidence sets for ratios: Fieller’s theorem, generalizations and bootstrap. Statistica Sinica (2009), 1095–1117.
 Wang and Zhou (2012) Dongli Wang and Yan Zhou. 2012. Distributed support vector machines: An overview. In Control and Decision Conference (CCDC), 2012 24th Chinese. IEEE, 3897–3901.
 Wasserman (2003) Larry Wasserman. 2003. All of Statistics: A Concise Course in Statistical Inference. Springer.
 Xie and Aurisset (2016) Huizhi Xie and Juliette Aurisset. 2016. Improving the sensitivity of online controlled experiments: Case studies at netflix. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 645–654.
 Xu et al. (2015) Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2227–2236.
 Zaharia et al. (2016) Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.
 Zinkevich et al. (2010) Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. 2010. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems. 2595–2603.