Causal inference using Bayesian non-parametric quasi-experimental design

Causal inference using Bayesian non-parametric quasi-experimental design

Max Hinne, Marcel A.J. van Gerven, Luca Ambrogioni

The de facto standard for causal inference is the randomized controlled trial, where one compares an manipulated group with a control group in order to determine the effect of an intervention. However, this research design is not always realistically possible due to pragmatic or ethical concerns. In these situations, quasi-experimental designs may provide a solution, as these allow for causal conclusions at the cost of additional design assumptions. In this paper, we provide a generic framework for quasi-experimental design using Bayesian model comparison, and we show how it can be used as an alternative to several common research designs. We provide a theoretical motivation for a Gaussian process based approach and demonstrate its convenient use in a number of simulations. Finally, we apply the framework to determine the effect of population-based thresholds for municipality funding in France, of the 2005 smoking ban in Sicily on the number of acute coronary events, and of the effect of an alleged historical phantom border in the Netherlands on Dutch voting behaviour.

1 Introduction

The bread and butter of scientific research is the randomized-controlled trial (RCT) (Hill1952). In this design, the sample population is randomly divided into two groups; one that is manipulated (e.g. a drug is administered or a treatment is performed), while the other is left unchanged. Because of the random group assignment, we may assume (if the sample size is large enough) that the distributions of all irrelevant variables are the same between the two groups, so that if we measure a difference, this can only be due to the intervention. The consequence is that the RCT allows us to perform causal inference, that is, via this approach we can learn about the causal effect of the intervention (Pearl2018).

While the RCT is a staple tool in the toolbox of any scientist, in practice there may be several insurmountable hurdles that deter one from using it. For instance, ethical considerations may force a study to halt midway when the intervention appears to cause serious adverse effects. Other objections may be pragmatic; it may be impossible to obtain sufficiently equivalent groups of samples if these samples represent for instance cities or species. Luckily, even in these cases all is not lost for causal inference. There exist several quasi-experimental designs that replace random assignment with deterministic assignment, that still allow for valid causal inferences at the cost of some additional assumptions. For example, regression discontinuity (RD) design (Campbell1963; Lauritzen2001; Pearl2009) and difference-within-differences (DiD) (Lechner2011) assign a sample to one of the two groups based on it passing or failing a threshold on an assignment variable. For instance, a patient may be assigned to the intervention group if their blood cholesterol levels exceed a certain predetermined limit. A closely related quasi-experimental design is interrupted time series (ITS) analysis, which is essentially RD, but with time as its assignment variable (McDowall1980). The idea behind these approaches is that, around the assignment threshold, data points are distributed essentially randomly, so that locally the conditions of RCT are recreated (Lee2010; Imbens2008). Crucially, for this to work, these methods assume that the sampled population cannot directly influence its score on the assignment variable. For example, if a patient can actively manipulate their own cholesterol level they can determine whether they are in the intervention or the control group, which would render the conclusion invalid.

The methodological pipeline of quasi-experimental designs like these generally consists of three steps (Rischard2018). First, a regression is fit to each of the two groups individually. Typically, one parametrizes this as linear regression. Next, the fits are extrapolated to the boundary (that is, the threshold of the assignment variable). Finally, the difference between the extrapolations of the two groups at the boundary is taken as the effect size of the intervention. A straightforward statistical test can be applied to check whether the effect is present. Although RD, ITS and DiD have been applied frequently in education, econometrics and political science, recent studies have pointed out that they are underutilized in other domains, such as neuroscience, behavioural science and medicine (Marinescu2018; Moscoe2015).

In this paper we provide a novel Bayesian non-parametric framework for quasi-experimental design, which we refer to as BNQD. BNQD is a Bayesian approach to quasi-experimental design that offers several advantages to the aforementioned examples (RD, DiD and ITS), as described in the following.

First, we frame the problem of detecting a difference between the two groups as a model comparison. Instead of comparing the pre- and post-intervention regressions, we introduce a continuous model and a discontinuous model. The continuous model is fitted on all observations, while in the discontinuous model we have two sub-models: one is fitted on the intervened group, while the other is fitted on the control group. We then perform Bayesian model comparison between the continuous and the discontinuous model. This allows us to quantify the evidence in favor of either model, rather than only for the alternative model (Wagenmakers2007). In addition, this makes it possible to compute the Bayesian model averaged effect size, which provides a more nuanced estimate compared to implicitly conditioning on the alternative model (Hoeting1999; Hinne2019).

Second, our approach is non-parametric and is based on Gaussian process (GP) regression. This means that we do not have to impose strong constraints on the functional form of the regression in our models. Instead, we specify a kernel that describes how similar any two points are depending on their distance from each other. The result is a highly flexible model that can capture all sorts of nonlinear interactions between the assignment and the outcome variable, such as periodicity in time series. By appropriate selection of the GP kernel, BNQD serves as an alternative for several traditional quasi-experimental designs. For instance, with a linear kernel, BNQD can be used for RD and DiD studies. Alternatively, with an exponential kernel BNQD serves as a non-parametric form of RD design. Finally, BNQD can just as easily be applied to multiple assignment variables (Reardon2012; Papay2011; Wong2013), as we demonstrate in one of our examples.

Third, in most methods for quasi-experimental design, a bandwidth parameter determines the trade-off between estimation reliability and the local randomness assumptions that are needed to draw causal inferences (Geneletti2015). In BNQD, all observations are used to estimate both the continuous and discontinuous model, but by optimizing the length scale parameter of the GP kernels we control the sensitivity to different types of discontinuities and adherence to locality assumptions.

The paper is organized as follows. In Section 2 we explain the proposed method. We assess its performance in simulations in Section 3, and demonstrate how it can be applied in practice in Section 4. In Section 5 we discuss related work and and future directions for Bayesian quasi-experimental designs. Lastly, we conclude in Section 6.

2 Bayesian non-parametric quasi-experimental design

Suppose we have collected data , where is a predictor and is its corresponding response. For this exposition we will assume , but extension to multidimensional input is straightforward, as we will show in an example later on. In addition to our data, we have a label function that returns per data point whether it is in the intervention group , or in the control group . In randomized controlled trials, data points would be assigned at random and performs the role of a simple look-up table, but in regression discontinuity studies, for instance, compares to a threshold value and assigns its label according to the outcome of that comparison. With these preliminaries, BNQD tries to establish whether the observations are best described by a single, continuous, regression model , or by a discontinuous model that has a separate regression for the intervention and the control group. Mathematically, this comes down to computing the Bayes factor (Goodman1999b; Kass1995):


where and are the subsets of the data belonging to the control and the intervention, respectively. Note that . This Bayes factor provides an intuitive representation of the model comparison. That is, it indicates how much more likely the data are given the discontinuous model, compared to the continuous model (Jarosz2014). Unlike a -value, it can provide evidence for either model, so that it is possible to find evidence supporting the absence of a discontinuity (Wagenmakers2007; Goodman1999a).

2.1 Gaussian process regression

Figure 1: A. A typical application of BNQD. The left panel shows the continuous model with a single continuous regression through all the observations. The right panel shows the discontinuous model with a separate fit for the control and intervention groups. B. Examples of applications of BNQD recover different special cases of experimental designs. On the left, linear kernels and a clear application of the threshold-based assignment gives regression discontinuity design. In the middle we see the same approach, but with a fuzzy application of the threshold. Note that both regressions are trained on all the data points of that respective condition, not just the ones above or below the threshold. On the right, we see how random assignment and a constant kernel results in a t-test for RCT designs. C. The estimated effect size of the discontinuity. For regression discontinuity quasi-experimental designs, one can estimate the posterior over effect size either as (which is a spike density at zero by definition), as the Gaussian distribution , or as the Bayesian model average mixture distribution of these two. Depending on which model is more likely, the BMA will be dominated by the prediction of that model.

In order to compute the Bayes factor in Eq. (1), we need to define the generative models for that describe how the observed responses follow from the predictor variable. We assume that the responses follow a Gaussian distribution around a latent function of the predictors , such that


in which is the observation noise. Note that this observation model is the same for the continuous and the discontinuous models. In the continuous model, there is a single latent function , while in the discontinuous model we have two functions, and , that apply to the observations in the control or the intervention group, respectively. In standard linear regression, these functions would be given a parametric form, namely , in which is the bias term and the slope. In BNQD however, we suppose that we do not know the exact form of , and instead assign a prior distribution, which is a Gaussian process (GP). We write:


The parameters of the GP, and , are functions themselves and have the following meaning. The mean function specifies the expected mean of the functions that may be drawn from the process. Typically, this is simply set to zero for every . An essential component of the Gaussian process is the kernel function which specifies for two distinct points and how similar the function values and are. Depending on what kernel function we specify, widely different behaviour for can be achieved. For instance, we may specify that we expect functions to be smooth in , or that functions exhibit periodicity. The properties of these kernels, like their length scale or frequency, are captured by the hyperparameters . In terms of parameter learning and optimization, the observation noise is often considered an element of as well. Note that and share the hyperparameters , so that differences in their predictions at the intervention point are not due to non-stationarities in the latent process. An illustration of the BNQD procedure is shown in Fig. 1A, in which the left panel shows the continuous model and the right panel shows the discontinuous model , both using linear kernels. In this example, we find a Bayes factor of 1.8, indicating that the data are 1.8 more likely under the discontinuous model.

For reference, we list the kernels that are used throughout this paper in Appendix A. Computational details of hyperparameter optimization and the Bayes factor of (1) are discussed in Appendix B. For an extensive discussion of Gaussian processes, we refer the reader to Rasmussen2004.

The GP forms a very flexible model. By choosing the appropriate kernel function, GP regression becomes equivalent to many parametrized models. For instance, if we select a linear kernel, the functions drawn from the GP prior are straight lines. This way, BNQD can be used as an alternative to classic quasi-experimental designs such as RD, DiD and ITS. Furthermore, BNQD can also recover true randomized controlled trials, by using the assignment function as a look-up table to see whether an observation was part of the control or the intervention group. By using a constant kernel, the functions drawn from the GP are horizontal lines, turning BNQD into a t-test between the intervention and the control groups. Figure 1B visualizes these ideas. It shows the output of three different applications of BNQD. In the leftmost panel, we have used a linear kernel, and the two groups of observations are cleanly separated at . That is, . In the middle panel, the threshold is applied in a fuzzy way, where most, but not all, data points follow the threshold-based assignment. In the final panel, assignment is completely at random and there is no notion of a threshold here. By choosing a constant kernel, we obtain the GP regressions for both groups that simply reflect their means and variances, which suffices to perform a z-test.

2.2 Estimating the size of a discontinuity

If we apply BNQD to detect a potential discontinuity at an intervention point , like in RD or ITS, we can estimate the effect size of the discontinuity (we use the term effect size synonymously with the average treatment effect (Kim2016) or average causal effect (Pearl2009b)) using the Bayesian model average (BMA) (Hoeting1999). The BMA accounts not only for the uncertainty in our estimate of the effect size given a particular model, but also the uncertainty in the models themselves. For an effect size , the BMA is given by


that is, the distribution of the effect size is the sum of the distributions of given each model, weighted by the posterior probability of the respective model. Given the discontinuous model, the discontinuity is given by


where and give the predictions by the two GP regressions within . Since both of these predictions are Gaussian distributions, their difference is once again Gaussian with mean and variance . Since under the continuous model we have by definition, this has a regularizing effect, shrinking the effect size estimate towards zero. The estimation of the effect size for the earlier example is shown in Fig. 1C. Note that the BMA is a mixture model of the (Gaussian) distribution and a spike density at 0.

2.3 Model averaging across GP kernels

One of the key advantages of our approach to quasi-experimental design is that by changing the kernel of the GP regression, widely different behaviour can be modelled. However, the pragmatic researcher may be agnostic about which kernel best describes their observations. In this case, multiple kernels may be considered, and a final Bayes factor may be obtained by Bayesian model averaging across all kernels. Consider the following example in which we entertain both a rigid linear kernel and a flexible RBF kernel, so that . We can compute the Bayes factor for each kernel, but we can also obtain the total Bayes factor given the set of kernels:


The quantity serves as a final decision metric to determine an effect in a quasi-experimental design, while a detailed report is provided by inspecting the Bayes factors corresponding to each considered kernel. In practice, the evidence of one kernel can dominate all others, i.e. , in which case the BMA procedure is effectively equivalent to performing the analysis with the best kernel only.

2.4 Theoretical motivation of kernel choices

The choice of the Gaussian process kernels plays two conceptually distinct roles in BNQD. First, our choice of kernel reflects our beliefs about the latent process that generated the observations. In traditional RD designs, one assumes a parametrized model such as linear regression. In BNQD, this explicit parametric form is replaced by a GP prior that assigns a probability distribution to the space of functions. However, BNQD can still replicate parametrized models by selecting degenerate kernels, such as the constant or linear kernel. These modeling choices are crucial in RD design as model misspecification can lead to incorrect inference. In the case that we do not have clear prior beliefs about an appropriate kernel, we can look at the weighted average of several kernels via the Bayesian model average.

Kernel Non-parametric Difference in differences Order of discontinuity
Constant 0
Linear 1
Exponential 0
1-Matérn 1
Table 1: Properties of common kernels. The columns indicate whether a kernel results in parametric or non-parametric regressions, whether it can capture differences in slopes, and what order of discontinuity it can detect (that is, a discontinuity in which derivative).

The second role of the kernel choice is that it determines to which types of discontinuities BNQD is sensitive. Importantly, different kernels can be used to test fundamentally different hypotheses, as the kernel determines which features of the latent function are part of the alleged effect. For example, the simplest (degenerate) kernel, the constant kernel, is sensitive only to differences in the means of the two groups, while the linear kernel is sensitive to both the difference in mean as well as the difference in slope (which means that it can be used both for regression discontinuity as well as difference-in-differences designs). The RBF kernel on the other hand can detect discontinuities in a function, but also in any of its derivatives. A middle ground is provided by the Matérn kernel, which is actually a class of kernels with an order parameter . This kernel can detect discontinuities in up to the -th derivative. An interesting special case is the exponential covariance function (which is equivalent to a Matérn kernel with ). This non-parametric kernel is only sensitive to 0-th order discontinuities, that is, to discontinuities in the function itself. Note that, while some covariance functions like the RBF can theoretically detect discontinuities in derivatives of any order, the amount of data requires to detect these subtle effects can become prohibitively high. The properties of these standard kernels are summarized in Table 1. The use of more sophisticated covariance functions allows one to test more global discontinuity effects that cannot be captured as discontinuities in any of the derivatives. For example, periodic covariance functions can detect discontinuities in oscillatory signals, by identifying differences in frequency, the amplitude or waveform.

2.5 Implementation

BNQD is implemented in Python using the GPy module (GPy2014) and is available at

3 Simulations

In the following, we test the performance of BNQD using simulations in different experimental settings.

3.1 Discontinuity detection

Arguably the most straightforward quasi-experimental setup is regression discontinuity (RD) design. Here, we study the application of BNQD on simulated data with a known discontinuity . We compare the performance of BNQD to a two-step approach by Branson2019, which first fits a discontinuous Gaussian process model to the data and then uses the predictions at the intervention to obtain a -value of the test for discontinuity, as well as to a classical RD approach111The implementation of the classic RDD is available at We consider two different regression discontinuity cases; one based on a linear model and the other based on a nonlinear one.

We apply BNQD using four kernels: a linear, an exponential, a first-order Matérn, and an RBF kernel. For each kernel , we obtain a Bayes factor as well as estimates of the discontinuity according to either the discontinuous model () or using the Bayesian model average (, see Eq. (4)). Furthermore, we obtain these measures by aggregating over all four kernels. In total, this results in five Bayes factors, five posterior distributions of the effect size assuming a discontinuity, and five posterior distributions that average over continuous and discontinuous models.

The estimated discontinuity , according to the discontinuous models, can be used to compute the -value of the test for a discontinuity (Branson2019) according to


where is the Gaussian cumulative density, with and the statistics of the discontinuity distribution at the boundary (see Section 2).

3.1.1 Discontinuity in a linear model

Figure 2: Linear true model. Simulation of regression discontinuity design using BNQD. A. Example generated data sets for the given true discontinuity sizes. B. The (logarithm of the) Bayes factor as a function of the discontinuity size, for the linear, exponential, RBF and Matérn kernels. Note that for undetectable discontinuities (), the log Bayes factor is negative, indicating evidence against a discontinuity. The linear kernel provides the largest Bayes factors, as the other models are more flexible and can fit the data better even in spite of a discontinuity. C. The -values obtained from the distribution of the effect size at (Branson2019) for the same kernels, as well as for the classical RDD approach. The difference approaches are qualitatively consistent, but, in contrast to the Bayes factor approach, the -values for small discontinuities cannot be interpreted as evidence in favor of the null hypothesis. D. The estimated discontinuity according to each kernel, broken down by kernel and conditioned on the discontinuous model or the Bayesian model average. The black, dashed line indicates the effect size across all models. As the Bayes factors in favor of a discontinuity increase, so is the BMA dominated by the discontinuous model; hence they solid and dashed lines converge to the same estimates for larger discontinuities. Error bars indicate one standard deviation over the 100 simulations.

The data are generated as follows. For each of the effect sizes , we simulate 100 data sets with observations each, according to


where we set and . Note that effect sizes are indistinguishable from the observation noise. An example data set is shown in Fig. 2A.

Figure 2B-D shows the result of the simulation study. Notice that for small effect sizes (, the Bayes factors provide positive evidence for the continuous models, while the -values only express that the null hypothesis cannot be rejected. The largest Bayes factors are obtained for the linear kernel. The other kernels provide more flexible regressions, which makes the corresponding continuous models fit the data better, in spite of a present discontinuity. Still, all kernels provide the same qualitative assessment of the simulations.

The two-step approach is the least sensitive in detecting a small discontinuity. For example, at , it leads to using the linear kernel, while the classical RDD approach gives . BNQD is confident about the discontinuity and gives a log Bayes factor of 6.6, which reflects that the data are over 700 times more likely under the discontinuous model.

3.1.2 Discontinuity in a nonlinear model

Figure 3: Nonlinear true model. Simulation of regression discontinuity design using BNQD. A. Example generated data sets for the given true discontinuity sizes. B. The (logarithm of the) Bayes factor as a function of the discontinuity size, for the linear, exponential, RBF and Matérn kernels. As these data are difficult to describe with a linear model, this results in large Bayes factors; after all, the data will be described much better by two separate models. However, the discontinuity will be overestimated, as the fits are poor compared to the more flexible nonlinear kernels (see D). C. The -values obtained from the distribution of the effect size at (Branson2019) for the same kernels, as well as for the classical RDD approach. The two-step approach with a linear, as well as the classical RDD approach lead to small -values, as the flexibility of the other models makes it possible to fit the data well with the continuous model. D. The estimated discontinuity according to each kernel, broken down by kernel and conditioned on the discontinuous model or the Bayesian model average. The black, dashed line indicates the effect size across all models. As the Bayes factors in favor of a discontinuity increase, so is the BMA dominated by the discontinuous model, hence, the solid and dashed lines converge to the same estimates for larger discontinuities. As mentioned above, the linear kernel overestimates the discontinuity, due to its poor fit to the data. The flexible kernels that allow for nonlinear regression do result in accurate recovery of the discontinuity. Error bars indicate one standard deviation over the 100 simulations.

In the second simulation we consider what happens if the underlying data generating process itself is nonlinear. We proceed as before, but now generate data from a shifted periodic function:


where again we set and . Some example data sets for different discontinuities are shown in Fig. 3A.

The results for the second simulation are shown in Fig. 3B. Most striking is that the linear kernel does result in large Bayes factors, indicating evidence for a discontinuity, but this is mostly due to its poor fit to the data. This is also apparent from the estimated discontinuity sizes, which is consistently overestimated by the model with a linear kernel. The flexible alternative kernels do capture the data much better and recover the size of the discontinuity well.

3.2 Difference-within-differences

Whereas regression discontinuity design investigates a potential discontinuity in the magnitude of the latent process, difference-within-differences aims to detect such a discontinuity in the derivative of the latent process, that is, it detects changes in slope. Here, we simulate data sets from an underlying process that has a changing slope at the location of intervention and use BNQD with linear, exponential, RBF and first-order Matérn kernels as before. Because of the flexibility of all kernels except the linear one, a higher signal-to-noise ratio is needed to effectively detect a difference. We therefore generate data according to


where , , and with . We simulate 100 data sets for each combination of and .

Figure 4: Differences in linear slope. Simulation of difference-within-differences design using BNQD. A. Example data sets for the indicated differences in slope, for . B. The Bayes factors for the detection of a difference in slope, for decreasing signal-to-noise ratio . Note that the linear kernel is visualized separately, as it results in much higher Bayes factors than the other kernels. However, in all cases the Bayesian model averaged Bayes factor is determined mostly by the flexible RBF and Matérn kernels, that can easily fit these data. Despite this, when the signal-to-noise ratio is increased, these two kernels are able to detect the difference in slope. In the case of more observation noise, none of the flexible kernels is able to detect a difference, even for the case. The linear kernel detects this with ease, however.

The results of this simulation are shown in Fig. 4. As to be expected, the difference in slope is easily picked up by the linear kernel. On the other hand, flexible kernels such as the RBF and the Matérn can fit the data well enough in spite of the slope discontinuity, and so conclude that there is no effect unless the noise level is very small. As expected from the motivation of the different kernels (see Section 2.4), the exponential kernel is unable to detect a difference in slope, even for small noise levels. If the amount of observation noise is increased, only the linear kernel detects an effect.

3.3 From random to threshold-based assignment

Several extensions exist to standard quasi-experimental paradigms. For instance, in regression discontinuity the threshold-based assignment may not always strictly be adhered to (Feir2016). Here, we generate data with a true underlying discontinuity, but we randomly assign data points to the intervention or control group according to a sigmoid function


where the logistic gain determines how sharply changes. This illustrates that randomized controlled trials and threshold-based quasi-experimental designs lie on a continuum; for we recover randomized assignments, while for we obtain a sharp discontinuity design.

Figure 5: In true randomized controlled trials, data points are assigned to either group at random. For threshold-based quasi-experimental designs, this assignment instead depends on the assignment variable . From left to right, the probability of being assigned to the intervention group increases from uniformly at random to completely determined by the threshold on the predictor. The rightmost panel shows the Bayes factors for the model comparison using BNQD with the same four kernels as before, as well as the Bayesian model average. The figure shows that BNQD can be used when there is complete randomization, as well as a threshold-based discontinuity, or a probabilistic application thereof.

Figure 5 shows how the log Bayes factor of the model comparison approach changes as a function of the gain. This illustrates that BNQD can be used regardless of the randomness in the assignment, as long as the labels per observation (that is, whether they ended up in the control or the intervention group) are known. This also applies to fuzzy applications of RD design, where a threshold is provided but not always complied with (Geneletti2015).

4 Applications

BNQD can be used in several different quasi-experimental studies, as we show below.

4.1 RD and DiD: French municipality population sizes

Many governmental policy decisions at the subnational levels, such as salaries of officials or the number of board positions, depend (only) on the number of inhabitants of a municipality passing a particular threshold. However, this creates an incentive to report inflated population sizes, in order to acquire for example increased funding. Here, we re-analyse data on small French municipalities (Eggers2018; Grembi2017) using BNQD to see whether there is active intervention of the (official) population counts in order to strategically fall right below or above the threshold; an effect known as sorting (Imbens2008; McCrary2008).

Figure 6: BNQD for French municipality population threshold data set (Grembi2017). Observations have been collected into bins of size 1. The left panel shows the different GP regressions using a linear kernel, while the right panel shows these fits using an RBF kernel. The dashed line indicates the nearest population threshold. The -axis represents the distance away from the threshold.

Figure 6 shows the number of municipalities that fall within a certain distance from their nearest population threshold. The data have been binned into bins of size 1. Visually, the data for small municipalities (with either around 100 or 500 inhabitants; red squares and yellow diamonds, respectively) seem to display a discontinuity around the population threshold, and for the smallest municipalities a difference in slope is clearly visible. For the larger municipalities, it is not immediately obvious whether such a discontinuity exists.

Threshold Method
Kernel BNQD (log BF) 2-step GP () RDD ()
100 Linear 4.1e2 6.7e-3 9.2e-22
RBF -2.8e1 1.5e-23 -
BMA -2.8e1 - -
500 Linear 1.3e2 2.8e-24 1.5e-27
RBF 4.9e1 9.3e-79 -
BMA 4.9e1 - -
1000 Linear 4.0e1 6.0e-7 9.1e-21
RBF 2.8e1 4.2e-20 -
BMA 2.8e1 - -
1500 Linear -6.2e0 1.9e-1 1.7e-2
RBF -6.5e0 1.9e-1 -
BMA -6.3e0 - -
Table 2: The discontinuity detection for BNQD, the two-step Gaussian process approach (Branson2019), and the classical RDD approach on the French municipality size thresholds data. The decision metric is the (logarithm of the) Bayes factor for BNQD, and a -value for the other methods.

In this example, we apply BNQD with a linear and an RBF kernel. The results of the regressions and model comparisons are shown in Fig. 6 and listed in Table 2. Since the RBF kernel is very flexible compared to the linear model, the regression fits differ widely, in particular for the models trained on small municipalities. For the smallest municipalities with a threshold of 100 inhabitants, the choice in kernel leads to a qualitatively different result. Using a linear kernel we find clear evidence for a discontinuity, but with the flexible RBF kernel we instead find strong evidence for the absence of this discontinuity. For the largest municipalities, the different kernels are in agreement and find no effect. If we consider only the Bayesian model averaged Bayes factor, then for the smallest and largest municipalities there seems to be no influence of the population threshold, while for the middle two municipality sizes, there is a clear effect. In contrast, the two-step GP approach finds evidence for a discontinuity in all but the largest municipalities, while the classical RDD approach finds evidence in favor of a discontinuity for each threshold. In conclusion, it appears as though indeed the population counts for the medium-sized municipalities are reported fraudulently, while for the smallest municipalities the conclusion depends on the specific kernel. For the largest municipalities in this data set, the reported numbers are clear of these manipulations.

4.2 ITS: Effects of the Sicilian smoking ban

On January 10th 2005, the Italian government installed a national smoking ban in public places. One of the main reasons for this decision is the mounting evidence for the adverse effects of secondhand smoking (Richiardi2009). Several studies indeed report a reduction in rates of hospital admissions for acute coronary events (ACEs) after similar bans have been instantiated (Meyers2009; Gasparrini2009), but the benefits of this policy are not undisputed. Widely different effect sizes have been reported, ranging from large to practically absent (Barone-Adesi2011). The effect of the smoking ban in Sicily was investigated by Barone-Adesi2011 who collected data on the number of ACEs in the four years prior to the ban as well as up to two years after the ban. Barone-Adesi2011 and later Bernal2017 used an interrupted time-series design to analyse these data. Here, we apply BNQD to the same data for comparison. As in (Bernal2017), the goal of this example is not to provide definitive evidence in favor or against the effects of a smoking ban, but rather to illustrate how BNQD is applicable in similar situations.

The data set shows strong seasonal effects, for which both aforementioned studies corrected a priori. In the GP framework, such adjustment is unnecessary when using flexible kernels such as the RBF. Here, we perform the model comparison for a linear kernel and an RBF kernel. For both kernels, its corresponding hyperparameters are optimized using maximum likelihood. The results of the Bayesian non-parametric analysis are shown in Fig. 7. For each of the kernels, we find weak evidence in favor of the continuous model, i.e., the absence of an effect.

Figure 7: Application of BNQD to the Sicily smoking ban data set (Barone-Adesi2011). Results are shown for two different kernels: a linear kernel and an RBF kernel. In addition to the GP regression fits as shown in the left and middle column, the right column shows estimated effect sizes according to either model as well as the Bayesian model average. The inset pie chart shows the posterior model probabilities. As the continuous model dominates the model posterior, the BMA consists mostly of the spike density at .
Kernel BNQD (log BF) 2-step GP () RDD ()
Linear -2.9 3.1e-1 4.5e-3
RBF -2.9 4.2e-1 -
BMA -3.3 - -
Table 3: The discontinuity detection for BNQD, the two-step Gaussian process approach (Branson2019), and the classical RDD approach or the Sicilian smoking ban data. The decision metric is the (logarithm of the) Bayes factor for BNQD, and a -value for the other methods.

The effect size estimates are similar across the two kernels. The discontinuous models estimate, in expectation, ACE risk reductions of 4.5% and 1.6%. However, when we take into account the substantial evidence in favor of the continuous models, these reductions are regularized to 0.2%, and 0.1%. None of the model comparisons support a discontinuity; the respective log Bayes factors are -2.9 and -2.9, and -3.3 for the BMA, indicating moderate to substantial evidence in favor of the continuous model (see Table 3). The two-step GP approach agrees with BNQD, as it cannot find evidence for a discontinuity either. The classical RDD approach does find a discontinuity, but this comes at the cost of the more strict assumptions.

4.3 GeoRDD: Dutch 2017 parliament elections

In 2017, the Dutch parliament elections were held and soon after the results were published online by the Dutch government (KOOP2018). According to Dutch electorate geographer De Voogd, the share of votes that go to populist parties222As this example merely serves to illustrate a two-dimensional quasi-experimental design, we refrain from an extensive discussion of the definition of populism. Instead, we use part of the definition by Mueller2016 and refer to populist parties as those parties that emphasize an alleged chasm between the elite and the general population. In the Dutch 2017 elections, parties that fit this description were PVV, SP, 50Plus and FvD. is different between the Dutch areas north and south of the so-called ‘phantom border’, a line that traditionally divided the catholic south of the Netherlands from the protestant north, and which originated in the Eighty Years’ War (1568–1648) (DeVoogd2016; DeVoogd2017). This border serves as a two-dimensional threshold along which one can apply RD design. This special case of RD design where the assignment threshold is a geographical boundary is referred to as GeoRDD (Rischard2018). Here, we test the hypothesis by De Voogd and demonstrate that BNQD is straightforward to apply a GeoRDD context.

Figure 8: The results for BNQD applied to the Dutch 2017 parliament elections. A. Voting behaviour per municipality, as well as the continuous and discontinuous GP regressions. The historical ‘phantom border’ is used as a two-dimensional threshold, as indicated by the black, dashed line. The top row presents the results of the model comparison using a linear kernel, the bottom row presents the results for an RBF kernel. The rigid linear kernel provides strong evidence for the discontinuous model, but the more flexible RBF kernel provides moderate evidence in favor of a single continuous model. B. The effect sizes according to the discontinuous models. As this is an example of a multivariate regression, the effect sizes are two-dimensional and shown across the border.
Kernel log Bayes factor
Linear 16.6 401.7 418.3
Exponential -10.1 484.0 473.9
RBF -6.0 464.3 458.4
Matérn -10.9 476.9 466.0
BMA -10.1 485.4 475.0
Table 4: The Bayes factors en marginal likelihoods for different kernels, applied to the Dutch 2017 parliament election results.

We apply BNQD using the linear, exponential, RBF and first-order Matérn kernels. The results of the analysis are shown in Fig. 8 and Table 4. Figure 8A shows the map of the Netherlands with the fraction of populist votes per municipality superimposed. In addition, each map shows a hand-drawn approximate phantom border representing the supposed divide in voting behaviour. As the results for the flexible three kernels are visually similar, we show only the linear and exponential kernel results. The top row of Fig. 8A shows the results of a linear kernel, which in two dimensions fits a plane (or two planes for the discontinuous model) to the data. The bottom row shows the results using an exponential kernel. Figure 8B shows the effect sizes conditioned on the discontinuous model, which are distributions at each point of the distribution (the shaded interval indicates one standard deviation around the mean).

If we assume a linear underlying process, there is strong evidence for a discontinuity (), confirming the hypothesis by De Voogd. Visually however, the data do not appear to follow these linear trends strongly. Of the three more flexible models, the first-order Matérn kernel results in the strongest evidence against an effect (). Overall, the continuous model with an exponential kernel has the largest marginal likelihood, indicating that it fits the data best of these four kernels. This causes it to dominate the Bayesian model average Bayes factor as well, which is approximately equal to the Bayes factor when using just the exponential kernel. Note that the effect size shown in Fig. 8B for the exponential kernel is conditioned on the discontinuous model, but this kernel strongly favors a continuous model. The BMA of the effect size is therefore approximately zero across the phantom border.

5 Discussion

While some quasi-experimental designs, such as regression discontinuity design, have been around since the 1960s (Cook2008), recently there has been a renewed interest in this class of methods (Imbens2008; Choi2017; Shadish2002), in particular in epidemiology (Harris2004) and education (Bloom2012). At the same time, researchers from different domains are actively promoting the use of QED (Marinescu2018; Moscoe2015). This has prompted researchers to extend classical QED in several ways. For instance, several authors have proposed to use Bayesian models for designs such as RD (Geneletti2015), DiD (Normington2019), ITS (Freni-Sterrantino2019), single-subject designs (SS; DeVries2013) and stepped-wedge trials (SWT; Cunanan2016). In these studies, parameters such as (linear) regression coefficients are associated with prior probability distributions, and posterior distributions are estimated using Bayesian inference. Compared to classical implementations of such study designs, these approaches provide explicit descriptions of the uncertainty that is associated with their parameter estimates. In contrast to our work however, these methods focus on the estimation of the treatment effect, instead of model comparison.

Other studies have explored non-parametric models for QED, in particular for RD, to alleviate the reliance on a linear model. For example, Hahn2001 propose to use local linear non-parametric regression as an alternative to linear regression. Alternatively, one can use kernel methods that compute a smoothly weighted average of the data points to create an interpolated regression that does not depend on a specific parametric form (Bloom2012). However, in order to infer causality from QED, one assumes that the effect is local around the threshold. That is, the behaviour of the two groups changes is supposed to change sharply around the intervention instead of gradually. In standard RD studies, this locality is controlled via a bandwidth parameter that determines the sensitivity of the detection approach, but, as Bloom2012 points out, kernel methods require sufficient data points around the intervention point and they are sensitive to this bandwidth parameter. In BNQD, the bandwidth is replaced by the length scale hyperparameter of the non-degenerate kernels, which we optimize using the model marginal likelihood (for a review on the similarities between frequentist kernel-based methods and Gaussian processes we refer the interested reader to Kanagawa2018). The length scale regulates how fast the correlations between consecutive points decays as their distance increases, and thus how sensitive BNQD is around the threshold. With an exponential kernel the most rigorous form of locality can be enforced. In this case, the Markov properties of the Gaussian process guarantee that only discontinuities at the intervention threshold are detectable. On the other hand, non-local kernels such as the periodic kernel are vulnerable to spurious effects if the underlying process is non-stationary. Specifically, the presence of change points away from the intervention threshold can lead to false alarms. In this case, BNQD should be performed in a sliding-window fashion in order to ensure that the highest Bayes factor is actually at the intervention threshold.

Most similar to BNQD are the Gaussian process regression discontinuity models by Rischard2018 and Branson2019. In both cases, the authors fit a Gaussian process to the intervention and the control group, and compare the difference in their prediction at the intervention point. The primary difference with our work is that these studies are specific to regression discontinuity designs. The main inferred quantity is the treatment effect , where is the point of intervention and and correspond to the intervened and the controlled processes. Instead, we focus on differences in these processes of multiple orders, not just a straightforward discontinuity. This way, BNQD is applicable to other sorts of quasi-experimental studies, while the approaches mentioned here need to be modified substantially in order to be applied to, say, difference-within-differences design. Similarly, BNQD can be directly applied in semi-randomized data sets, where a threshold is not clearly defined, and extends naturally to the multivariate setting, where defining the effect on the boundary can become problematic. Furthermore, although the cited works use GP regression to fit the data, they use null-hypothesis significance testing to determine the presence of a discontinuity. We remain within the Bayesian paradigm, which enables us to quantify evidence in favor of, or against, an effect using Bayes factors (Wagenmakers2007), and to use Bayesian model averaging (Hoeting1999; Hinne2019). This way we can integrate over different kernel assumptions, and provide effect size estimates that are not conditioned on the discontinuous model, but on all considered models.

In most quasi-experimental designs, the non-random assignment to the control or the intervention group depends on a single variable (the assignment variable) to which some threshold is applied. However, it is possible that this decision is based instead on the interaction of multiple assignment variables (Reardon2012; Papay2011; Wong2013; Choi2017; Choi2018). A special case of multivariate QED is GeoRDD, in which two-dimensional assignment variable represents a spatial location (Rischard2018; Keele2015; Keele2017). Our approach does not assume a univariate threshold to determine the assignment to intervention and control group, and can work on arbitrary complex label functions. This can be a geographical border as in Section 4.3, but also more complex shapes such as, for example, one region versus the rest of a country, or a particular regime of diagnostic variables.

BNQD can be extended in several ways to make it more widely applicable. For instance, we do not currently account for covariates that may serve as confounds for causal inference (Harris2004). However, such covariates can be explicitly taken into account in the regression models, or even be learned from the observations (Kocaoglu2017). Other extensions include the fuzzy application of a threshold function (for instance, a drug may be prescribed according to a certain policy, but that policy is not always adhered to). Another extension is the modelling of partial effects (Choi2017; Choi2018). In our multivariate example (Section 4.3), a single assignment function is used to label each 2D point as above or below the phantom border. Alternatively, a threshold or other decision function can be applied to either of these two dimensions, which can then be used to test whether each dimension individually results in a discontinuity, in addition to the multivariate test. Furthermore, BNQD may be applied to detect the effect of interventions on oscillatory signals, such as heart rate and respiration, by using periodic kernels. In this case a discontinuity reflects changes in for instance the frequency, amplitude or phase of a signal.

Our approach is based on Gaussian process regression, which has a number of computationally attractive features. In particular, with a Gaussian observation model, many of the required computations become analytically tractable. However, via for instance the Laplace approximation, Gaussian process regression can be used in tandem with non-Gaussian observation models (Rasmussen2004). For instance, one observation model could be Poisson to model count data (Adams2009; Williams1998), which is arguably more applicable in the Sicilian smoking ban example, or Bernoulli for binary outcomes.

6 Conclusion

We have presented BNQD, a Bayesian non-parametric approach for causal inference in quasi-experimental designs. Compared to other methods for QED, BNQD estimates a causal effect based on all observations instead of only those close to the threshold. This allows us to capture more subtle effects of an intervention, such as difference-within-differences, using the same framework as for regression discontinuity design. By selecting the appropriate kernel, one has precise control over the type of discontinuity that can be detected. For example, using BNQD with an exponential kernel results in a non-parametric quasi-experimental setup that is only sensitive to discontinuities in the latent process itself, while an RBF kernel can detect discontinuities in any derivative of the latent process.

Researchers wanting to apply BNQD to their data can find our Python implementation on GitHub.


Appendix A Gaussian process kernels

A Gaussian process is completely described by its mean function and its covariance function , known as its kernel, with hyperparameters . Typically, the mean function is simply set to zero unless otherwise specified, which leaves the kernel to describe the similarity between any pair of input points. We describe a number of basic kernels that are used throughout this paper, and refer the reader to e.g. Rasmussen2004 for a more thorough exposition. We distinguish between degenerate and non-degenerate kernels. Of the first class, two kernels are used in the main text:

  • The constant kernel is the simplest possible kernel and captures functions that are straight horizontal lines

  • The linear kernel describes line with arbitrary slope, and is given by


    In the linear kernel, the offset represents the point in the posterior distribution that all samples from the GP pass through (i.e. a point with zero variance). The variance is the variance of the prior bias (i.e. the height of the function at ). Finally, is the output variance.

The following non-degenerate kernels are commonly used:

  • The exponential kernel is given by

  • The radial basis function (RBF; also known as the Gaussian, squared-exponential or exponentiated quadratic kernel) is often used as a default kernel in GP regression. It is given by


    in which represents the length scale of the process and the output variance.

  • The Matérn kernel is actually a class of kernels with a parameter . For , this kernel coincides with the exponential kernel, while for the Matérn kernel converges to the RBF kernel. A common kernel in machine learning is the Matérn kernel, which is given by


Appendix B Computation of Bayes factors

Recall that the Bayes factor for model comparison is given by


Here, the marginal likelihoods integrate out all the potential latent functions , via


Such integration is intractable in many Bayesian models, but one of the attractive properties of Gaussian processes is that this term is available analytically (Rasmussen2004; MacKay1998). However, integrating out the kernel hyperparameters remains intractable and can only be addressed by approximation. Here we make use of a Bayesian Information Criterion (BIC) approach. If the number of observations is much greater than the number of hyperparameters, and we assume that the posterior distribution over hyperparameters to be unimodal, then we can approximate the (log) evidence by


where is obtained by optimizing the marginal likelihood. However, we need to correct for the variance in this distribution which is ignored when using the MAP estimate. This is done in the form of a penalty added to (19) to compensate for the information lost through optimization. This results in


where is the number of observations and the number of hyperparameters. Although more accurate approximations are possible, such as the Laplace approximation (Rasmussen2004) or sequential Monte Carlo (Svensson2015), we consider this out of the scope of the current paper.

It is important that the hyperparameters of the two sub-models within the discontinuous model are the same, as otherwise they may reflect non-stationarities in the estimated process that are not due to the alleged discontinuity. We enforce this by simply optimizing both sub-models separately, and then setting their optimized hyperparameters to the average values of the two.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description