Mediation Analysis in Online Experiments at Booking.com: Disentangling Direct and Indirect Effects
Online experimentation is at the core of Booking.com’s customer-centric product development. While randomised controlled trials are a powerful tool for estimating the overall effects of product changes on business metrics, they often fall short in explaining the mechanism of change. This becomes problematic when decision-making depends on being able to distinguish between the direct effect of a treatment on some outcome variable and its indirect effect via a mediator variable. In this paper, we demonstrate the need for mediation analyses in online experimentation, and use simulated data to show how these methods help identify and estimate direct causal effect. Failing to take into account all confounders can lead to biased estimates, so we include sensitivity analyses to help gauge the robustness of estimates to missing causal factors.
At Booking.com, one of the key ingredients to customer-centric product development is not (just) bright minds having great ideas, but collecting the evidence to support these ideas. We test each idea addressing a customer pain point via an AB test using our in-house experiment platform. This platform is able to test thousands of changes simultaneously, with real customers, collecting data on the outcome within minutes of being implemented. However, as Booking.com grows to more countries, more languages, and our products grow in scope from hotels to other types of accommodations, cruises, car rentals, and more, our product development procedures must also become more flexible and more sensitive to the interactions of these many goals.
A randomised controlled trial (AB test) helps us assess the causal effect of the implementation of an idea (from now on referred to as treatment) on some desired outcome. Even though we can never calculate the outcome for a given person under both exposed and not exposed conditions, we can still get an unbiased estimate of the effect of the treatment on the intended population, referred to as the Average Treatment Effect (ATE). If the ATE was all we cared about, we would be done. However, at Booking.com, we care about two things:
The ATE, the total effect of our treatment on the outcome variable
The mechanism of change, how the treatment affected our visitors’ behaviour and, as a result, how the outcome variable changed.
We value ii over i. This is best illustrated by a working example.
Experiment: Reducing ‘Cancellations per visitor’
Cancellations stemming from unclarity around accommodation policies, facilities or prices lead to a bad customer experience, which we want to avoid as a general principle. In addition, cancellations make our partners’ availability calculations more difficult and lead to a bad partner experience as well. A confused, dissatisfied customer or partner is more likely to call our customer service, which also increases our customer service agents’ load.
We have teams working on solutions to address the pain points of customers and partners regarding cancellations. In one experiment, one such team might change the design of a page to bring more clarity to potential guests before they make a reservation, adding a text box containing an explanation of the property policies. Their goal is to make guests more aware of the prices around their trips, so that they have a lower chance of cancelling later, resulting in a reduction of the metric ‘cancellations per visitor’. We can represent such a scenario visually using the causal graph in Figure 1.
If the only goal was to reduce cancellations, the experimenter could go ahead and use the ATE to see if this was achieved, testing the difference between the average cancellations per visitor between the control and treatment group. However, this would never tell us how this result was achieved. The learning comes with understanding the mechanism of change, and teams need this understanding to explore other ideas or abandon those that don’t work.
Modelling the flow of effects
We encourage teams to monitor complimentary metrics that can provide additional support for their hypothesised mechanism. For the example above, as the new information is displayed in a text box, a supporting metric might be if visitors hovered on the text box or not. We can extend the causal graph from Figure 1 to include these supplementary metrics as shown in Figure 2.
Here the team is explicitly checking one specific mechanism which is the reduction of cancellations complemented by a hover on the text box. If hovers increase and cancellations decrease, we understand better the mechanism. However, if cancellations change without the number of hovers changing then either an unforeseen mechanism is at work or we are dealing with a false positive. Either way, this additional insight can help the team better understand how their treatment is affecting visitor behaviour.
Direct & Indirect Treatment Effects
This approach gets more complicated when the metric that will help explain the mechanism is directly related to the outcome variable, while treatment is also expected to directly affect the outcome variable. For instance, if we observed a decrease in cancellations per visitor, but the number of bookings was also reduced. Did the reduction in cancellations originate from the new feature saving customers from making bookings, most of which would have been cancelled anyway, or did it inadvertently scare off previously satisfied customers from making bookings at all? Figure 3 expresses this mediation scenario graphically.
In this case, Bookings is a direct parent of the child metric Cancellations, meaning any change in bookings will automatically cause a change in cancellations. If you have more bookings, you have more opportunities for one of the bookings to get cancelled. If you have fewer bookings, you have fewer opportunities for a cancellation to happen.
The Average Treatment Effect in this example can be broken down into two effects:
The Indirect Effect is the effect of treatment on cancellations via bookings and the Direct Effect is the effect of treatment on cancellations directly. A standard AB test will only tell us the sum of these two effects (the ATE) which is a very crucial quantity. If cancellations increase beyond an amount that the company can tolerate, they may not care if the increase is due to additional bookings or the treatment’s direct effect. However, in most cases, to be able to understand the mechanism and to be able to make a decision, we need to disentangle these two effects.
Another lurking problem is that of known and unknown confounder variables. In the example above, imagine some visitors to the site are business travellers. We will use the causal graph with one confounder (‘is visitor travelling for business’) shown in Figure 4 as a running example in this analysis.
Confounders can add a layer of non-identifiability to the problem of interpreting the causal effect. While methods do exist which enable us to adjust for a post-treatment covariates, the moment we do this we lose the unbiasedness of the causal effect unless we can control for all potential confounders. The reason for this is that the confounders and treatment are not independent conditional on the mediator variable, bookings.
By design, we randomise the visitors so that treatment and whether a visitor is a business traveller or not are independent. However, they are not conditionally independent. Imagine a scenario in which business visitors book more and also visitors in the treatment group book more. If we know how many bookings a visitor has then knowing if they are in treatment or not would give us information about their likelihood of being a business traveller or not, and vice versa.
The method we employ follows Imai et al.. Given a set of pre-treatment confounders X, we assume the following, known as the sequential ignorability assumption:
The first equation is guaranteed to hold by design as the context we work in is randomised online experiments. The second equation says the mediator variable is ignorable conditioned on the pre-treatment confounders X. Looking at it from the perspective of Directed Acyclic Graphs notation, we assume that all the back-door paths we may be opening by conditioning on the mediator variable will be blocked by the set of pre-treatment variables X. If this assumption holds, then the direct and indirect effects of treatment on the outcome variable are identified and we can estimate them non-parametrically.
We simulated 100,000 data points with one pre-treatment confounder, whether a visitor is a business traveller or not, where half the visitors are randomly assigned to treatment and half to control groups and within each group, the probability of a visitor being a business traveller is 0.4. On average a business traveller makes 1 booking while a non-business traveller on average makes also 1 booking, draws coming from Poisson distributions. The treatment adds 2 more bookings for business travelers but doesn’t affect at all bookings of non-business travellers. In addition, treatment has no direct effect on cancellations; therefore, cancellations per booking stays the same for each group; 14% for business travellers and 7% for non-business travellers, drawn from binomial distributions with said probabilities.
As seen in Table 1, the treatment has no direct effect on cancellations as the cancellation rates per booking stay the same for the subpopulations. All the treatment does is make business travellers book more, from 1 booking per visitor on average to 3 on average. In this scenario, adding the number of bookings as a covariate in a regression model yields a positive regression coefficient for the effect of treatment on cancellations even though there is no direct effect of treatment on cancellations. Since there is a pre-treatment confounder and treatment and bookings are not conditionally independent, this approach gives a biased result. However, using the two-stage method and including the pre-treatment confounder in the models, we get direct effects close to zero, as expected. Table 2 contains point estimates for no adjustment, adjustment in a linear regression, and two-stage model results. Results in Table 2 further assume the experiment was run for 30 days in order to bring the point estimates down to familiar per day units.
|No Adjustment - ATE||384||0.00|
Next, we do a sensitivity analysis to see how robust the point estimates are to missing covariates. Data can never tell us whether we have successfully taken into account all pre-treatment confounders. However, a sensitivity analysis can tell us how robust the estimates are. In our simulation data, we know there is only one pre-treatment confounder. So we expect the estimate to be quite robust when we include this variable in our models, and not robust when we omit it.
Figure 5 shows how the direct effect of the treatment on cancellations changes with the correlation of the error terms of stage 1 and stage 2 models when we include the pre-treatment covariate. Observe that even when the correlation term on the x axis is at the extremes, -0.9 and 0.9, the 95 percent intervals still include 0, meaning we can never conclusively conclude that the effect is non-zero. On the other hand, when we omit the pre-treatment covariate, business booker, the conclusions change drastically depending on the value of ; as shown in Figure 6.
AB testing allows for rapid customer-centric product development, but is likely to produce biased direct effect estimates, as the method cannot account for confounders, nor identify missing confounders. This leads to decisions being governed by hidden factors, which will at best inject randomness into the decision making process (costing time and effort), and at worst erode the quality of a product. For example, in the hypothetical scenario above, naive use of standard AB testing might lead to a product better tailored for business travellers, and by consequence shrink our customer pool significantly. If identifying the mechanism of change is important for decision making, then we suggest (a) using multiple experiments to replicate findings and protect against false positives, (b) measuring important health metrics, as well as metrics known to be related (causally) to the outcome (and to include these in the model as demonstrated), and (c) performing a sensitivity analysis to check for missing confounders. We encourage experimenters to always interpret results in context; making decisions which keep in mind the big picture, the mechanism of change, and the long term impact on customers.
The ideas put forward in this paper were greatly influenced by our work on the in-house experimentation platform at Booking.com, as well as conversations with colleagues and other online experimentation practitioners. In particular, this work would not have been possible if not for the relentless challenging by Raphael Lopez Kaufman. Causal graph figures were crafted by our amazing designer Sergey Alimskiy.
-  Raphael Lopez Kaufman, Jegar Pitchforth, and Lukas Vermeer. Democratizing online controlled experiments at Booking. com. arXiv preprint arXiv:1710.08217, 2017.
-  Rubin, Donald B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66.5: 688, 1974.
-  Judea Pearl. Causality. Cambridge University Press, 14 sep., 2009.
-  Kosuke Imai, Luke Keele, Dustin Tingley, and Teppei Yamamoto. Unpacking the Black Box of Causality: Learning about Causal Mechanisms from Experimental and Observational Studies. American Political Science Review, 105(4), 765-789, 2011.
-  Dustin Tingley, Teppei Yamamoto, Kentaro Hirose, Luke Keele, and Kosuke Imai. mediation: R package for causal mediation analysis. Journal of Statistical Software Vol. 59, Issue 5, 2014.