Causal inference using Bayesian non-parametric quasi-experimental design
The de facto standard for causal inference is the randomized controlled trial, where one compares an manipulated group with a control group in order to determine the effect of an intervention. However, this research design is not always realistically possible due to pragmatic or ethical concerns. In these situations, quasi-experimental designs may provide a solution, as these allow for causal conclusions at the cost of additional design assumptions. In this paper, we provide a generic framework for quasi-experimental design using Bayesian model comparison, and we show how it can be used as an alternative to several common research designs. We provide a theoretical motivation for a Gaussian process based approach and demonstrate its convenient use in a number of simulations. Finally, we apply the framework to determine the effect of population-based thresholds for municipality funding in France, of the 2005 smoking ban in Sicily on the number of acute coronary events, and of the effect of an alleged historical phantom border in the Netherlands on Dutch voting behaviour.
The bread and butter of scientific research is the randomized-controlled trial (RCT) (Hill1952). In this design, the sample population is randomly divided into two groups; one that is manipulated (e.g. a drug is administered or a treatment is performed), while the other is left unchanged. Because of the random group assignment, we may assume (if the sample size is large enough) that the distributions of all irrelevant variables are the same between the two groups, so that if we measure a difference, this can only be due to the intervention. The consequence is that the RCT allows us to perform causal inference, that is, via this approach we can learn about the causal effect of the intervention (Pearl2018).
While the RCT is a staple tool in the toolbox of any scientist, in practice there may be several insurmountable hurdles that deter one from using it. For instance, ethical considerations may force a study to halt midway when the intervention appears to cause serious adverse effects. Other objections may be pragmatic; it may be impossible to obtain sufficiently equivalent groups of samples if these samples represent for instance cities or species. Luckily, even in these cases all is not lost for causal inference. There exist several quasi-experimental designs that replace random assignment with deterministic assignment, that still allow for valid causal inferences at the cost of some additional assumptions. For example, regression discontinuity (RD) design (Campbell1963; Lauritzen2001; Pearl2009) and difference-within-differences (DiD) (Lechner2011) assign a sample to one of the two groups based on it passing or failing a threshold on an assignment variable. For instance, a patient may be assigned to the intervention group if their blood cholesterol levels exceed a certain predetermined limit. A closely related quasi-experimental design is interrupted time series (ITS) analysis, which is essentially RD, but with time as its assignment variable (McDowall1980). The idea behind these approaches is that, around the assignment threshold, data points are distributed essentially randomly, so that locally the conditions of RCT are recreated (Lee2010; Imbens2008). Crucially, for this to work, these methods assume that the sampled population cannot directly influence its score on the assignment variable. For example, if a patient can actively manipulate their own cholesterol level they can determine whether they are in the intervention or the control group, which would render the conclusion invalid.
The methodological pipeline of quasi-experimental designs like these generally consists of three steps (Rischard2018). First, a regression is fit to each of the two groups individually. Typically, one parametrizes this as linear regression. Next, the fits are extrapolated to the boundary (that is, the threshold of the assignment variable). Finally, the difference between the extrapolations of the two groups at the boundary is taken as the effect size of the intervention. A straightforward statistical test can be applied to check whether the effect is present. Although RD, ITS and DiD have been applied frequently in education, econometrics and political science, recent studies have pointed out that they are underutilized in other domains, such as neuroscience, behavioural science and medicine (Marinescu2018; Moscoe2015).
In this paper we provide a novel Bayesian non-parametric framework for quasi-experimental design, which we refer to as BNQD. BNQD is a Bayesian approach to quasi-experimental design that offers several advantages to the aforementioned examples (RD, DiD and ITS), as described in the following.
First, we frame the problem of detecting a difference between the two groups as a model comparison. Instead of comparing the pre- and post-intervention regressions, we introduce a continuous model and a discontinuous model. The continuous model is fitted on all observations, while in the discontinuous model we have two sub-models: one is fitted on the intervened group, while the other is fitted on the control group. We then perform Bayesian model comparison between the continuous and the discontinuous model. This allows us to quantify the evidence in favor of either model, rather than only for the alternative model (Wagenmakers2007). In addition, this makes it possible to compute the Bayesian model averaged effect size, which provides a more nuanced estimate compared to implicitly conditioning on the alternative model (Hoeting1999; Hinne2019).
Second, our approach is non-parametric and is based on Gaussian process (GP) regression. This means that we do not have to impose strong constraints on the functional form of the regression in our models. Instead, we specify a kernel that describes how similar any two points are depending on their distance from each other. The result is a highly flexible model that can capture all sorts of nonlinear interactions between the assignment and the outcome variable, such as periodicity in time series. By appropriate selection of the GP kernel, BNQD serves as an alternative for several traditional quasi-experimental designs. For instance, with a linear kernel, BNQD can be used for RD and DiD studies. Alternatively, with an exponential kernel BNQD serves as a non-parametric form of RD design. Finally, BNQD can just as easily be applied to multiple assignment variables (Reardon2012; Papay2011; Wong2013), as we demonstrate in one of our examples.
Third, in most methods for quasi-experimental design, a bandwidth parameter determines the trade-off between estimation reliability and the local randomness assumptions that are needed to draw causal inferences (Geneletti2015). In BNQD, all observations are used to estimate both the continuous and discontinuous model, but by optimizing the length scale parameter of the GP kernels we control the sensitivity to different types of discontinuities and adherence to locality assumptions.
The paper is organized as follows. In Section 2 we explain the proposed method. We assess its performance in simulations in Section 3, and demonstrate how it can be applied in practice in Section 4. In Section 5 we discuss related work and and future directions for Bayesian quasi-experimental designs. Lastly, we conclude in Section 6.
2 Bayesian non-parametric quasi-experimental design
Suppose we have collected data , where is a predictor and is its corresponding response. For this exposition we will assume , but extension to multidimensional input is straightforward, as we will show in an example later on. In addition to our data, we have a label function that returns per data point whether it is in the intervention group , or in the control group . In randomized controlled trials, data points would be assigned at random and performs the role of a simple look-up table, but in regression discontinuity studies, for instance, compares to a threshold value and assigns its label according to the outcome of that comparison. With these preliminaries, BNQD tries to establish whether the observations are best described by a single, continuous, regression model , or by a discontinuous model that has a separate regression for the intervention and the control group. Mathematically, this comes down to computing the Bayes factor (Goodman1999b; Kass1995):
where and are the subsets of the data belonging to the control and the intervention, respectively. Note that . This Bayes factor provides an intuitive representation of the model comparison. That is, it indicates how much more likely the data are given the discontinuous model, compared to the continuous model (Jarosz2014). Unlike a -value, it can provide evidence for either model, so that it is possible to find evidence supporting the absence of a discontinuity (Wagenmakers2007; Goodman1999a).
2.1 Gaussian process regression
In order to compute the Bayes factor in Eq. (1), we need to define the generative models for that describe how the observed responses follow from the predictor variable. We assume that the responses follow a Gaussian distribution around a latent function of the predictors , such that
in which is the observation noise. Note that this observation model is the same for the continuous and the discontinuous models. In the continuous model, there is a single latent function , while in the discontinuous model we have two functions, and , that apply to the observations in the control or the intervention group, respectively. In standard linear regression, these functions would be given a parametric form, namely , in which is the bias term and the slope. In BNQD however, we suppose that we do not know the exact form of , and instead assign a prior distribution, which is a Gaussian process (GP). We write:
The parameters of the GP, and , are functions themselves and have the following meaning. The mean function specifies the expected mean of the functions that may be drawn from the process. Typically, this is simply set to zero for every . An essential component of the Gaussian process is the kernel function which specifies for two distinct points and how similar the function values and are. Depending on what kernel function we specify, widely different behaviour for can be achieved. For instance, we may specify that we expect functions to be smooth in , or that functions exhibit periodicity. The properties of these kernels, like their length scale or frequency, are captured by the hyperparameters . In terms of parameter learning and optimization, the observation noise is often considered an element of as well. Note that and share the hyperparameters , so that differences in their predictions at the intervention point are not due to non-stationarities in the latent process. An illustration of the BNQD procedure is shown in Fig. 1A, in which the left panel shows the continuous model and the right panel shows the discontinuous model , both using linear kernels. In this example, we find a Bayes factor of 1.8, indicating that the data are 1.8 more likely under the discontinuous model.
For reference, we list the kernels that are used throughout this paper in Appendix A. Computational details of hyperparameter optimization and the Bayes factor of (1) are discussed in Appendix B. For an extensive discussion of Gaussian processes, we refer the reader to Rasmussen2004.
The GP forms a very flexible model. By choosing the appropriate kernel function, GP regression becomes equivalent to many parametrized models. For instance, if we select a linear kernel, the functions drawn from the GP prior are straight lines. This way, BNQD can be used as an alternative to classic quasi-experimental designs such as RD, DiD and ITS. Furthermore, BNQD can also recover true randomized controlled trials, by using the assignment function as a look-up table to see whether an observation was part of the control or the intervention group. By using a constant kernel, the functions drawn from the GP are horizontal lines, turning BNQD into a t-test between the intervention and the control groups. Figure 1B visualizes these ideas. It shows the output of three different applications of BNQD. In the leftmost panel, we have used a linear kernel, and the two groups of observations are cleanly separated at . That is, . In the middle panel, the threshold is applied in a fuzzy way, where most, but not all, data points follow the threshold-based assignment. In the final panel, assignment is completely at random and there is no notion of a threshold here. By choosing a constant kernel, we obtain the GP regressions for both groups that simply reflect their means and variances, which suffices to perform a z-test.
2.2 Estimating the size of a discontinuity
If we apply BNQD to detect a potential discontinuity at an intervention point , like in RD or ITS, we can estimate the effect size of the discontinuity (we use the term effect size synonymously with the average treatment effect (Kim2016) or average causal effect (Pearl2009b)) using the Bayesian model average (BMA) (Hoeting1999). The BMA accounts not only for the uncertainty in our estimate of the effect size given a particular model, but also the uncertainty in the models themselves. For an effect size , the BMA is given by
that is, the distribution of the effect size is the sum of the distributions of given each model, weighted by the posterior probability of the respective model. Given the discontinuous model, the discontinuity is given by
where and give the predictions by the two GP regressions within . Since both of these predictions are Gaussian distributions, their difference is once again Gaussian with mean and variance . Since under the continuous model we have by definition, this has a regularizing effect, shrinking the effect size estimate towards zero. The estimation of the effect size for the earlier example is shown in Fig. 1C. Note that the BMA is a mixture model of the (Gaussian) distribution and a spike density at 0.
2.3 Model averaging across GP kernels
One of the key advantages of our approach to quasi-experimental design is that by changing the kernel of the GP regression, widely different behaviour can be modelled. However, the pragmatic researcher may be agnostic about which kernel best describes their observations. In this case, multiple kernels may be considered, and a final Bayes factor may be obtained by Bayesian model averaging across all kernels. Consider the following example in which we entertain both a rigid linear kernel and a flexible RBF kernel, so that . We can compute the Bayes factor for each kernel, but we can also obtain the total Bayes factor given the set of kernels:
The quantity serves as a final decision metric to determine an effect in a quasi-experimental design, while a detailed report is provided by inspecting the Bayes factors corresponding to each considered kernel. In practice, the evidence of one kernel can dominate all others, i.e. , in which case the BMA procedure is effectively equivalent to performing the analysis with the best kernel only.
2.4 Theoretical motivation of kernel choices
The choice of the Gaussian process kernels plays two conceptually distinct roles in BNQD. First, our choice of kernel reflects our beliefs about the latent process that generated the observations. In traditional RD designs, one assumes a parametrized model such as linear regression. In BNQD, this explicit parametric form is replaced by a GP prior that assigns a probability distribution to the space of functions. However, BNQD can still replicate parametrized models by selecting degenerate kernels, such as the constant or linear kernel. These modeling choices are crucial in RD design as model misspecification can lead to incorrect inference. In the case that we do not have clear prior beliefs about an appropriate kernel, we can look at the weighted average of several kernels via the Bayesian model average.
|Kernel||Non-parametric||Difference in differences||Order of discontinuity|
The second role of the kernel choice is that it determines to which types of discontinuities BNQD is sensitive. Importantly, different kernels can be used to test fundamentally different hypotheses, as the kernel determines which features of the latent function are part of the alleged effect. For example, the simplest (degenerate) kernel, the constant kernel, is sensitive only to differences in the means of the two groups, while the linear kernel is sensitive to both the difference in mean as well as the difference in slope (which means that it can be used both for regression discontinuity as well as difference-in-differences designs). The RBF kernel on the other hand can detect discontinuities in a function, but also in any of its derivatives. A middle ground is provided by the Matérn kernel, which is actually a class of kernels with an order parameter . This kernel can detect discontinuities in up to the -th derivative. An interesting special case is the exponential covariance function (which is equivalent to a Matérn kernel with ). This non-parametric kernel is only sensitive to 0-th order discontinuities, that is, to discontinuities in the function itself. Note that, while some covariance functions like the RBF can theoretically detect discontinuities in derivatives of any order, the amount of data requires to detect these subtle effects can become prohibitively high. The properties of these standard kernels are summarized in Table 1. The use of more sophisticated covariance functions allows one to test more global discontinuity effects that cannot be captured as discontinuities in any of the derivatives. For example, periodic covariance functions can detect discontinuities in oscillatory signals, by identifying differences in frequency, the amplitude or waveform.
BNQD is implemented in Python using the GPy module (GPy2014) and is available at https://github.com/mhinne/BNQD.
In the following, we test the performance of BNQD using simulations in different experimental settings.
3.1 Discontinuity detection
Arguably the most straightforward quasi-experimental setup is regression discontinuity (RD) design. Here, we study the application of BNQD on simulated data with a known discontinuity . We compare the performance of BNQD to a two-step approach by Branson2019, which first fits a discontinuous Gaussian process model to the data and then uses the predictions at the intervention to obtain a -value of the test for discontinuity, as well as to a classical RD approach111The implementation of the classic RDD is available at https://github.com/evan-magnusson/rdd.. We consider two different regression discontinuity cases; one based on a linear model and the other based on a nonlinear one.
We apply BNQD using four kernels: a linear, an exponential, a first-order Matérn, and an RBF kernel. For each kernel , we obtain a Bayes factor as well as estimates of the discontinuity according to either the discontinuous model () or using the Bayesian model average (, see Eq. (4)). Furthermore, we obtain these measures by aggregating over all four kernels. In total, this results in five Bayes factors, five posterior distributions of the effect size assuming a discontinuity, and five posterior distributions that average over continuous and discontinuous models.
The estimated discontinuity , according to the discontinuous models, can be used to compute the -value of the test for a discontinuity (Branson2019) according to
where is the Gaussian cumulative density, with and the statistics of the discontinuity distribution at the boundary (see Section 2).
3.1.1 Discontinuity in a linear model
The data are generated as follows. For each of the effect sizes , we simulate 100 data sets with observations each, according to
where we set and . Note that effect sizes are indistinguishable from the observation noise. An example data set is shown in Fig. 2A.
Figure 2B-D shows the result of the simulation study. Notice that for small effect sizes (, the Bayes factors provide positive evidence for the continuous models, while the -values only express that the null hypothesis cannot be rejected. The largest Bayes factors are obtained for the linear kernel. The other kernels provide more flexible regressions, which makes the corresponding continuous models fit the data better, in spite of a present discontinuity. Still, all kernels provide the same qualitative assessment of the simulations.
The two-step approach is the least sensitive in detecting a small discontinuity. For example, at , it leads to using the linear kernel, while the classical RDD approach gives . BNQD is confident about the discontinuity and gives a log Bayes factor of 6.6, which reflects that the data are over 700 times more likely under the discontinuous model.
3.1.2 Discontinuity in a nonlinear model
In the second simulation we consider what happens if the underlying data generating process itself is nonlinear. We proceed as before, but now generate data from a shifted periodic function:
where again we set and . Some example data sets for different discontinuities are shown in Fig. 3A.
The results for the second simulation are shown in Fig. 3B. Most striking is that the linear kernel does result in large Bayes factors, indicating evidence for a discontinuity, but this is mostly due to its poor fit to the data. This is also apparent from the estimated discontinuity sizes, which is consistently overestimated by the model with a linear kernel. The flexible alternative kernels do capture the data much better and recover the size of the discontinuity well.
Whereas regression discontinuity design investigates a potential discontinuity in the magnitude of the latent process, difference-within-differences aims to detect such a discontinuity in the derivative of the latent process, that is, it detects changes in slope. Here, we simulate data sets from an underlying process that has a changing slope at the location of intervention and use BNQD with linear, exponential, RBF and first-order Matérn kernels as before. Because of the flexibility of all kernels except the linear one, a higher signal-to-noise ratio is needed to effectively detect a difference. We therefore generate data according to
where , , and with . We simulate 100 data sets for each combination of and .
The results of this simulation are shown in Fig. 4. As to be expected, the difference in slope is easily picked up by the linear kernel. On the other hand, flexible kernels such as the RBF and the Matérn can fit the data well enough in spite of the slope discontinuity, and so conclude that there is no effect unless the noise level is very small. As expected from the motivation of the different kernels (see Section 2.4), the exponential kernel is unable to detect a difference in slope, even for small noise levels. If the amount of observation noise is increased, only the linear kernel detects an effect.
3.3 From random to threshold-based assignment
Several extensions exist to standard quasi-experimental paradigms. For instance, in regression discontinuity the threshold-based assignment may not always strictly be adhered to (Feir2016). Here, we generate data with a true underlying discontinuity, but we randomly assign data points to the intervention or control group according to a sigmoid function
where the logistic gain determines how sharply changes. This illustrates that randomized controlled trials and threshold-based quasi-experimental designs lie on a continuum; for we recover randomized assignments, while for we obtain a sharp discontinuity design.
Figure 5 shows how the log Bayes factor of the model comparison approach changes as a function of the gain. This illustrates that BNQD can be used regardless of the randomness in the assignment, as long as the labels per observation (that is, whether they ended up in the control or the intervention group) are known. This also applies to fuzzy applications of RD design, where a threshold is provided but not always complied with (Geneletti2015).
BNQD can be used in several different quasi-experimental studies, as we show below.
4.1 RD and DiD: French municipality population sizes
Many governmental policy decisions at the subnational levels, such as salaries of officials or the number of board positions, depend (only) on the number of inhabitants of a municipality passing a particular threshold. However, this creates an incentive to report inflated population sizes, in order to acquire for example increased funding. Here, we re-analyse data on small French municipalities (Eggers2018; Grembi2017) using BNQD to see whether there is active intervention of the (official) population counts in order to strategically fall right below or above the threshold; an effect known as sorting (Imbens2008; McCrary2008).
Figure 6 shows the number of municipalities that fall within a certain distance from their nearest population threshold. The data have been binned into bins of size 1. Visually, the data for small municipalities (with either around 100 or 500 inhabitants; red squares and yellow diamonds, respectively) seem to display a discontinuity around the population threshold, and for the smallest municipalities a difference in slope is clearly visible. For the larger municipalities, it is not immediately obvious whether such a discontinuity exists.
|Kernel||BNQD (log BF)||2-step GP ()||RDD ()|
In this example, we apply BNQD with a linear and an RBF kernel. The results of the regressions and model comparisons are shown in Fig. 6 and listed in Table 2. Since the RBF kernel is very flexible compared to the linear model, the regression fits differ widely, in particular for the models trained on small municipalities. For the smallest municipalities with a threshold of 100 inhabitants, the choice in kernel leads to a qualitatively different result. Using a linear kernel we find clear evidence for a discontinuity, but with the flexible RBF kernel we instead find strong evidence for the absence of this discontinuity. For the largest municipalities, the different kernels are in agreement and find no effect. If we consider only the Bayesian model averaged Bayes factor, then for the smallest and largest municipalities there seems to be no influence of the population threshold, while for the middle two municipality sizes, there is a clear effect. In contrast, the two-step GP approach finds evidence for a discontinuity in all but the largest municipalities, while the classical RDD approach finds evidence in favor of a discontinuity for each threshold. In conclusion, it appears as though indeed the population counts for the medium-sized municipalities are reported fraudulently, while for the smallest municipalities the conclusion depends on the specific kernel. For the largest municipalities in this data set, the reported numbers are clear of these manipulations.
4.2 ITS: Effects of the Sicilian smoking ban
On January 10th 2005, the Italian government installed a national smoking ban in public places. One of the main reasons for this decision is the mounting evidence for the adverse effects of secondhand smoking (Richiardi2009). Several studies indeed report a reduction in rates of hospital admissions for acute coronary events (ACEs) after similar bans have been instantiated (Meyers2009; Gasparrini2009), but the benefits of this policy are not undisputed. Widely different effect sizes have been reported, ranging from large to practically absent (Barone-Adesi2011). The effect of the smoking ban in Sicily was investigated by Barone-Adesi2011 who collected data on the number of ACEs in the four years prior to the ban as well as up to two years after the ban. Barone-Adesi2011 and later Bernal2017 used an interrupted time-series design to analyse these data. Here, we apply BNQD to the same data for comparison. As in (Bernal2017), the goal of this example is not to provide definitive evidence in favor or against the effects of a smoking ban, but rather to illustrate how BNQD is applicable in similar situations.
The data set shows strong seasonal effects, for which both aforementioned studies corrected a priori. In the GP framework, such adjustment is unnecessary when using flexible kernels such as the RBF. Here, we perform the model comparison for a linear kernel and an RBF kernel. For both kernels, its corresponding hyperparameters are optimized using maximum likelihood. The results of the Bayesian non-parametric analysis are shown in Fig. 7. For each of the kernels, we find weak evidence in favor of the continuous model, i.e., the absence of an effect.
|Kernel||BNQD (log BF)||2-step GP ()||RDD ()|
The effect size estimates are similar across the two kernels. The discontinuous models estimate, in expectation, ACE risk reductions of 4.5% and 1.6%. However, when we take into account the substantial evidence in favor of the continuous models, these reductions are regularized to 0.2%, and 0.1%. None of the model comparisons support a discontinuity; the respective log Bayes factors are -2.9 and -2.9, and -3.3 for the BMA, indicating moderate to substantial evidence in favor of the continuous model (see Table 3). The two-step GP approach agrees with BNQD, as it cannot find evidence for a discontinuity either. The classical RDD approach does find a discontinuity, but this comes at the cost of the more strict assumptions.
4.3 GeoRDD: Dutch 2017 parliament elections
In 2017, the Dutch parliament elections were held and soon after the results were published online by the Dutch government (KOOP2018). According to Dutch electorate geographer De Voogd, the share of votes that go to populist parties222As this example merely serves to illustrate a two-dimensional quasi-experimental design, we refrain from an extensive discussion of the definition of populism. Instead, we use part of the definition by Mueller2016 and refer to populist parties as those parties that emphasize an alleged chasm between the elite and the general population. In the Dutch 2017 elections, parties that fit this description were PVV, SP, 50Plus and FvD. is different between the Dutch areas north and south of the so-called ‘phantom border’, a line that traditionally divided the catholic south of the Netherlands from the protestant north, and which originated in the Eighty Years’ War (1568–1648) (DeVoogd2016; DeVoogd2017). This border serves as a two-dimensional threshold along which one can apply RD design. This special case of RD design where the assignment threshold is a geographical boundary is referred to as GeoRDD (Rischard2018). Here, we test the hypothesis by De Voogd and demonstrate that BNQD is straightforward to apply a GeoRDD context.
|Kernel||log Bayes factor|
We apply BNQD using the linear, exponential, RBF and first-order Matérn kernels. The results of the analysis are shown in Fig. 8 and Table 4. Figure 8A shows the map of the Netherlands with the fraction of populist votes per municipality superimposed. In addition, each map shows a hand-drawn approximate phantom border representing the supposed divide in voting behaviour. As the results for the flexible three kernels are visually similar, we show only the linear and exponential kernel results. The top row of Fig. 8A shows the results of a linear kernel, which in two dimensions fits a plane (or two planes for the discontinuous model) to the data. The bottom row shows the results using an exponential kernel. Figure 8B shows the effect sizes conditioned on the discontinuous model, which are distributions at each point of the distribution (the shaded interval indicates one standard deviation around the mean).
If we assume a linear underlying process, there is strong evidence for a discontinuity (), confirming the hypothesis by De Voogd. Visually however, the data do not appear to follow these linear trends strongly. Of the three more flexible models, the first-order Matérn kernel results in the strongest evidence against an effect (). Overall, the continuous model with an exponential kernel has the largest marginal likelihood, indicating that it fits the data best of these four kernels. This causes it to dominate the Bayesian model average Bayes factor as well, which is approximately equal to the Bayes factor when using just the exponential kernel. Note that the effect size shown in Fig. 8B for the exponential kernel is conditioned on the discontinuous model, but this kernel strongly favors a continuous model. The BMA of the effect size is therefore approximately zero across the phantom border.
While some quasi-experimental designs, such as regression discontinuity design, have been around since the 1960s (Cook2008), recently there has been a renewed interest in this class of methods (Imbens2008; Choi2017; Shadish2002), in particular in epidemiology (Harris2004) and education (Bloom2012). At the same time, researchers from different domains are actively promoting the use of QED (Marinescu2018; Moscoe2015). This has prompted researchers to extend classical QED in several ways. For instance, several authors have proposed to use Bayesian models for designs such as RD (Geneletti2015), DiD (Normington2019), ITS (Freni-Sterrantino2019), single-subject designs (SS; DeVries2013) and stepped-wedge trials (SWT; Cunanan2016). In these studies, parameters such as (linear) regression coefficients are associated with prior probability distributions, and posterior distributions are estimated using Bayesian inference. Compared to classical implementations of such study designs, these approaches provide explicit descriptions of the uncertainty that is associated with their parameter estimates. In contrast to our work however, these methods focus on the estimation of the treatment effect, instead of model comparison.
Other studies have explored non-parametric models for QED, in particular for RD, to alleviate the reliance on a linear model. For example, Hahn2001 propose to use local linear non-parametric regression as an alternative to linear regression. Alternatively, one can use kernel methods that compute a smoothly weighted average of the data points to create an interpolated regression that does not depend on a specific parametric form (Bloom2012). However, in order to infer causality from QED, one assumes that the effect is local around the threshold. That is, the behaviour of the two groups changes is supposed to change sharply around the intervention instead of gradually. In standard RD studies, this locality is controlled via a bandwidth parameter that determines the sensitivity of the detection approach, but, as Bloom2012 points out, kernel methods require sufficient data points around the intervention point and they are sensitive to this bandwidth parameter. In BNQD, the bandwidth is replaced by the length scale hyperparameter of the non-degenerate kernels, which we optimize using the model marginal likelihood (for a review on the similarities between frequentist kernel-based methods and Gaussian processes we refer the interested reader to Kanagawa2018). The length scale regulates how fast the correlations between consecutive points decays as their distance increases, and thus how sensitive BNQD is around the threshold. With an exponential kernel the most rigorous form of locality can be enforced. In this case, the Markov properties of the Gaussian process guarantee that only discontinuities at the intervention threshold are detectable. On the other hand, non-local kernels such as the periodic kernel are vulnerable to spurious effects if the underlying process is non-stationary. Specifically, the presence of change points away from the intervention threshold can lead to false alarms. In this case, BNQD should be performed in a sliding-window fashion in order to ensure that the highest Bayes factor is actually at the intervention threshold.
Most similar to BNQD are the Gaussian process regression discontinuity models by Rischard2018 and Branson2019. In both cases, the authors fit a Gaussian process to the intervention and the control group, and compare the difference in their prediction at the intervention point. The primary difference with our work is that these studies are specific to regression discontinuity designs. The main inferred quantity is the treatment effect , where is the point of intervention and and correspond to the intervened and the controlled processes. Instead, we focus on differences in these processes of multiple orders, not just a straightforward discontinuity. This way, BNQD is applicable to other sorts of quasi-experimental studies, while the approaches mentioned here need to be modified substantially in order to be applied to, say, difference-within-differences design. Similarly, BNQD can be directly applied in semi-randomized data sets, where a threshold is not clearly defined, and extends naturally to the multivariate setting, where defining the effect on the boundary can become problematic. Furthermore, although the cited works use GP regression to fit the data, they use null-hypothesis significance testing to determine the presence of a discontinuity. We remain within the Bayesian paradigm, which enables us to quantify evidence in favor of, or against, an effect using Bayes factors (Wagenmakers2007), and to use Bayesian model averaging (Hoeting1999; Hinne2019). This way we can integrate over different kernel assumptions, and provide effect size estimates that are not conditioned on the discontinuous model, but on all considered models.
In most quasi-experimental designs, the non-random assignment to the control or the intervention group depends on a single variable (the assignment variable) to which some threshold is applied. However, it is possible that this decision is based instead on the interaction of multiple assignment variables (Reardon2012; Papay2011; Wong2013; Choi2017; Choi2018). A special case of multivariate QED is GeoRDD, in which two-dimensional assignment variable represents a spatial location (Rischard2018; Keele2015; Keele2017). Our approach does not assume a univariate threshold to determine the assignment to intervention and control group, and can work on arbitrary complex label functions. This can be a geographical border as in Section 4.3, but also more complex shapes such as, for example, one region versus the rest of a country, or a particular regime of diagnostic variables.
BNQD can be extended in several ways to make it more widely applicable. For instance, we do not currently account for covariates that may serve as confounds for causal inference (Harris2004). However, such covariates can be explicitly taken into account in the regression models, or even be learned from the observations (Kocaoglu2017). Other extensions include the fuzzy application of a threshold function (for instance, a drug may be prescribed according to a certain policy, but that policy is not always adhered to). Another extension is the modelling of partial effects (Choi2017; Choi2018). In our multivariate example (Section 4.3), a single assignment function is used to label each 2D point as above or below the phantom border. Alternatively, a threshold or other decision function can be applied to either of these two dimensions, which can then be used to test whether each dimension individually results in a discontinuity, in addition to the multivariate test. Furthermore, BNQD may be applied to detect the effect of interventions on oscillatory signals, such as heart rate and respiration, by using periodic kernels. In this case a discontinuity reflects changes in for instance the frequency, amplitude or phase of a signal.
Our approach is based on Gaussian process regression, which has a number of computationally attractive features. In particular, with a Gaussian observation model, many of the required computations become analytically tractable. However, via for instance the Laplace approximation, Gaussian process regression can be used in tandem with non-Gaussian observation models (Rasmussen2004). For instance, one observation model could be Poisson to model count data (Adams2009; Williams1998), which is arguably more applicable in the Sicilian smoking ban example, or Bernoulli for binary outcomes.
We have presented BNQD, a Bayesian non-parametric approach for causal inference in quasi-experimental designs. Compared to other methods for QED, BNQD estimates a causal effect based on all observations instead of only those close to the threshold. This allows us to capture more subtle effects of an intervention, such as difference-within-differences, using the same framework as for regression discontinuity design. By selecting the appropriate kernel, one has precise control over the type of discontinuity that can be detected. For example, using BNQD with an exponential kernel results in a non-parametric quasi-experimental setup that is only sensitive to discontinuities in the latent process itself, while an RBF kernel can detect discontinuities in any derivative of the latent process.
Researchers wanting to apply BNQD to their data can find our Python implementation on GitHub.
Appendix A Gaussian process kernels
A Gaussian process is completely described by its mean function and its covariance function , known as its kernel, with hyperparameters . Typically, the mean function is simply set to zero unless otherwise specified, which leaves the kernel to describe the similarity between any pair of input points. We describe a number of basic kernels that are used throughout this paper, and refer the reader to e.g. Rasmussen2004 for a more thorough exposition. We distinguish between degenerate and non-degenerate kernels. Of the first class, two kernels are used in the main text:
The constant kernel is the simplest possible kernel and captures functions that are straight horizontal lines
The linear kernel describes line with arbitrary slope, and is given by
In the linear kernel, the offset represents the point in the posterior distribution that all samples from the GP pass through (i.e. a point with zero variance). The variance is the variance of the prior bias (i.e. the height of the function at ). Finally, is the output variance.
The following non-degenerate kernels are commonly used:
The exponential kernel is given by
The radial basis function (RBF; also known as the Gaussian, squared-exponential or exponentiated quadratic kernel) is often used as a default kernel in GP regression. It is given by
in which represents the length scale of the process and the output variance.
The Matérn kernel is actually a class of kernels with a parameter . For , this kernel coincides with the exponential kernel, while for the Matérn kernel converges to the RBF kernel. A common kernel in machine learning is the Matérn kernel, which is given by
Appendix B Computation of Bayes factors
Recall that the Bayes factor for model comparison is given by
Here, the marginal likelihoods integrate out all the potential latent functions , via
Such integration is intractable in many Bayesian models, but one of the attractive properties of Gaussian processes is that this term is available analytically (Rasmussen2004; MacKay1998). However, integrating out the kernel hyperparameters remains intractable and can only be addressed by approximation. Here we make use of a Bayesian Information Criterion (BIC) approach. If the number of observations is much greater than the number of hyperparameters, and we assume that the posterior distribution over hyperparameters to be unimodal, then we can approximate the (log) evidence by
where is obtained by optimizing the marginal likelihood. However, we need to correct for the variance in this distribution which is ignored when using the MAP estimate. This is done in the form of a penalty added to (19) to compensate for the information lost through optimization. This results in
where is the number of observations and the number of hyperparameters. Although more accurate approximations are possible, such as the Laplace approximation (Rasmussen2004) or sequential Monte Carlo (Svensson2015), we consider this out of the scope of the current paper.
It is important that the hyperparameters of the two sub-models within the discontinuous model are the same, as otherwise they may reflect non-stationarities in the estimated process that are not due to the alleged discontinuity. We enforce this by simply optimizing both sub-models separately, and then setting their optimized hyperparameters to the average values of the two.