Optimized Regression Discontinuity Designs1footnote 11footnote 1We are grateful for helpful comments from Timothy Armstrong, Max Farrell, Michal Kolesár, Christoph Rothe, Cun-Hui Zhang, seminar participants at Berkeley, Stanford and the University of Chicago, as well as the editor and four anonymous referees.

Optimized Regression Discontinuity Designs111We are grateful for helpful comments from Timothy Armstrong, Max Farrell, Michal Kolesár, Christoph Rothe, Cun-Hui Zhang, seminar participants at Berkeley, Stanford and the University of Chicago, as well as the editor and four anonymous referees.

Guido Imbens
imbens@stanford.edu
   Stefan Wager
swager@stanford.edu
Current version July 2019  
Abstract

The increasing popularity of regression discontinuity methods for causal inference in observational studies has led to a proliferation of different estimating strategies, most of which involve first fitting non-parametric regression models on both sides of a treatment assignment boundary and then reporting plug-in estimates for the effect of interest. In applications, however, it is often difficult to tune the non-parametric regressions in a way that is well calibrated for the specific target of inference; for example, the model with the best global in-sample fit may provide poor estimates of the discontinuity parameter. In this paper, we propose an alternative method for estimation and statistical inference in regression discontinuity designs that uses numerical convex optimization to directly obtain the finite-sample-minimax linear estimator for the regression discontinuity parameter, subject to bounds on the second derivative of the conditional response function. Given a bound on the second derivative, our proposed method is fully data-driven, and provides uniform confidence intervals for the regression discontinuity parameter with both discrete and continuous running variables. The method also naturally extends to the case of multiple running variables.

Keywords: Convex optimization, discrete running variable, multiple running variables, partial identification, uniform asymptotic inference.

1 Introduction

Regression discontinuity designs, first developed in the sixties (Thistlethwaite and Campbell, 1960), often allow for simple and transparent identification of treatment effects from observational data (Hahn, Todd, and Van der Klaauw, 2001; Imbens and Lemieux, 2008; Trochim, 1984), and their statistical properties have been the subject of recent interest (Armstrong and Kolesár, 2018; Calonico, Cattaneo, and Titiunik, 2014; Cheng, Fan, and Marron, 1997; Kolesár and Rothe, 2018). The sharp regression discontinuity design assumes a treatment assignment generated by a running variable , such that individuals get treated if and only for some set . For example, in epidemiology, could be a severity index and patients are assigned a medical intervention whenever for some threshold (i.e., ); or, in political science, could denote the latitude and longitude of a household, and could be an administrative region that has enacted a specific policy.

Given appropriate assumptions, we can identify a causal effect by comparing subjects with barely falling within the treatment region to those with just outside of it. Variants of this identification strategy have proven to be useful in education (Angrist and Lavy, 1999; Black, 1999; Jacob and Lefgren, 2004), political science (Caughey and Sekhon, 2011; Keele and Titiunik, 2014; Lee, 2008), criminal justice (Berk and Rauma, 1983), program evaluation (Lalive, 2008; Ludwig and Miller, 2007), and other areas. As discussed in more detail below, standard methods for inference in the regression discontinuity design rely on plugin-in estimates from local linear regression.

In this paper, motivated by a large literature on minimax linear estimation (Armstrong and Kolesár, 2018; Cai and Low, 2003; Donoho, 1994; Donoho and Liu, 1991; Ibragimov and Khas’minskii, 1985; Johnstone, 2011; Juditsky and Nemirovski, 2009), we study an alternative approach based on directly minimizing finite sample error bounds via numerical optimization, under an assumption that the second derivative of the response surface is bounded away from the boundary of the treatment region.222Of these papers, our work is most closely related to that of Armstrong and Kolesár (2018), who explicitly consider minimax linear estimation in the regression discontinuity model for an “approximately linear” model in the sense of Sacks and Ylvisaker (1978) that places restrictions on second differences relative to the response surface at the threshold. In contrast, we assume bounded second derivatives away from the threshold. An advantage of their approach is that it allows for a closed form solution. However, a disadvantage is that they allow for discontinuities in the response surface away from the threshold, which implies that given the same value for our bound on the second derivative and their bound on second differences, our confidence intervals can be substantially shorter (moreover, allowing for discontinuities in the response surface does not seem conceptually attractive given that the assumption of continuity of the conditional expectation at the threshold is fundamental to the regression discontinuity design). We discuss this comparison further in Section 1.3. This approach has several advantages relative to local regression. Our estimator is well defined regardless of the shape of the treatment region , whether it be a half line as in the standard univariate regression discontinuity specification or an oddly shaped region as might appear with a geographic regression discontinuity; moreover, our inference is asymptotically valid for both discrete and continuous running variables. Finally, even with univariate designs, our approach strictly dominates local linear regression in terms of minimax mean-squared error.

For simplicity of exposition, we start by presenting our method in the context of classical univariate regression discontinuity designs with a single treatment cutoff, i.e., with and ; a solution to the more general problem will then follow by direct extension. A software implementation, optrdd for R, is available on CRAN and from github.com/swager/optrdd.

1.1 Optimized Inference with Univariate Discontinuities

We have access to independent pairs where is the running variable and is our outcome of interest; the treatment is assigned as . Following the potential outcomes model (Imbens and Rubin, 2015; Neyman, 1923; Rubin, 1974), we posit potential outcomes , for corresponding to the outcome subject would have experienced had they received treatment , and define the conditional average treatment effect in terms of the conditional response functions :

(1)

Provided the functions are both continuous at , the regression discontinuity identifies the conditional average treatment effect at the threshold ,333In the fuzzy regression discontinuity design where the probability of receiving the treatment changes discontinuously at , but not necessarily from zero to one, the estimand can be written as the ratio of two such differences. The issues we address in this paper also arise in that setting, and the present discussion extends naturally to it; see Section 5 for a discussion.

(2)

Given this setup, local linear regression is a popular strategy for estimating (Hahn et al., 2001; Porter, 2003):

(3)

where is some weighting function, is a bandwidth, , and and are nuisance parameters. When we do not observe data right at the boundary (e.g., when has discrete support), then is not point identified; however, given smoothness assumptions on , Kolesár and Rothe (2018) propose an approach to local linear regression that can still be used to construct partial identification intervals for in the sense of Imbens and Manski (2004) (see Section 2.1 for a discussion).

The behavior of regression discontinuity estimation via local linear regression is fairly well understood. When the running variable is continuous (i.e., has a continuous positive density at ) and is twice differentiable with a bounded second derivative in a neighborhood of , Cheng, Fan, and Marron (1997) show that the triangular kernel minimizes worst-case asymptotic mean-squared error among all possible choices of , Imbens and Kalyanaraman (2012) provide a data-adaptive choice of to minimize the mean-squared error of the resulting estimator, and Calonico, Cattaneo, and Titiunik (2014) propose a method for removing bias effects due to the curvature of to allow for asymptotically unbiased estimation. Meanwhile, given a second-derivative bound , Armstrong and Kolesár (2018) and Kolesár and Rothe (2018) construct confidence intervals centered at the local linear estimator that attain uniform asymptotic coverage, even when the running variable may be discrete.

Despite its ubiquity, however, local linear regression still has some shortfalls. First of all, under the bounded second derivative assumption often used to justify local linear regression (i.e., that is twice differentiable and in a neighborhood of ), local linear regression is not the minimax optimal linear estimator for —even with a continuous running variable. Second, and perhaps even more importantly, all the motivating theory for local linear regression relies on having a continuous distribution; however, in practice, often has a discrete distribution with a modest number of points of support. When the running variable is discrete there is no compelling reason to expect local linear regression to be particularly effective in estimating the causal effect of interest.444One practical inconvenience that can arise in local linear regression with discrete running variables is that, if we use a data-driven rule to pick the bandwidth (e.g., the one of Imbens and Kalyanaraman (2012)), we may end up with no data inside the specified range (i.e., there may be no observations with ); the practitioner is then forced to select a different bandwidth ad-hoc. Ideally, methods for regression discontinuity analysis should be fully data-driven, even when is discrete. In spite of these limitations, local linear regression is still the method of choice, largely because of its intuitive appeal.

As discussed above, the goal of this paper is to show that we can systematically do better. Regardless of the shape of the kernel in (3), local linear regression yields a linear estimator555We note the unfortunate terminological overlap between “local linear regression” estimators of and “linear” estimators of type . The word linear in these two contexts refers to different things. All “local linear regression” estimators are “linear” in the latter sense, but not vice-versa. for , i.e., one of the form for weights that depend only on the distances . Here, we find that if we are willing to rely on numerical optimization tools, then minimax linear estimation of , i.e., minimax among all estimators of the form conditionally on , is both simple and methodologically transparent under natural assumptions on the regularity of . If we know that and that , we propose estimating as follows:

(4)

Because this estimator is minimax among the class of linear estimators, it is at least as accurate as any local linear regression estimator in a minimax sense over all problems with and . For further discussion of related estimators, see Cai and Low (2003), Donoho (1994), Donoho and Liu (1991), and Juditsky and Nemirovski (2009).

When has a discrete distribution, the parameter is usually not point identified because there may not be any observations in a small neighborhood of . However, we can get meaningful partial identification of thanks to our bounds on the second derivative of . Moreover, because our approach controls for bias in finite samples, the estimator (4) is still justified in the partially identified setting and, as discussed further in Section 2.1, provides valid confidence intervals for in the sense of Imbens and Manski (2004). We view the fact that our estimator can seamlessly move between the point and partially identified settings as an important feature.

To motivate the above estimator qualitatively, note that the first term in the minimization problem corresponds to the conditional variance of given , while the second term is the worst-case conditional squared bias given that ; thus, given our regularity assumptions, the estimator minimizes the worst-case conditional mean-squared error among all linear estimators. Because no constraints are placed on or , the optimization in (4) also automatically enforces the constraints , , , and .666At values of for which these constraints are not all satisfied, we can choose and with second derivative bounded by such as to make the conditional bias arbitrarily bad, i.e., . Thus, the solution to (4) must satisfy the constraints. This is a convex program, and can be efficiently solved using readily available software described in, e.g., Boyd and Vandenberghe (2004).

Although this estimator depends explicitly on knowledge of and , we note that all practical methods for estimation in the regression discontinuity model, including Calonico et al. (2014), Imbens and Kalyanaraman (2012) and Kolesár and Rothe (2018), require estimating related quantities in order to tune the algorithm. Then, once estimators for these parameters have been specified, the procedure (4) is fully automatic—in particular, there is no need to ask whether the running variable is discrete or continuous, as the optimization is conditional on —whereas the baseline procedures still have other choices to make, e.g., what weight function to use, or whether to debias the resulting -estimator. See Sections 3 and 4 for discussions on how to choose in practice.

Figure 1: Optimized regression discontinuity design obtained via (4), for different values of and two different distributions. The red dots show the learned weighting function in a case where the running variable is discrete, and different support points are sampled with different probabilities (the probability mass function is shown in the left panel of Figure 2); the blue line shows for standard Gaussian . We plot , motivated by the fact that, with a continuous running variable, the optimal bandwidth for local linear regression scales as . The weights were computed with and .

Figure 1 compares the weights obtained via (4) in two different settings: one with a discrete, asymmetric running variable depicted in the left panel of Figure 2, and the other with a standard Gaussian running variable. We see that, for , the resulting weighting functions look fairly similar, and are also comparable to the implicit weighting function generated by local linear regression with a triangular kernel. However, as grows and the discreteness becomes more severe, our method changes both the shape and the scale of the weights, and the discrepancy between the optimized weighting schemes for discrete versus continuous running variables becomes more pronounced.

Figure 2: Left panel: Probability mass function underlying the example in Figure 1 and the right panel of the present figure. Right panel: Comparison of our procedure (4) with local linear regression, both using a rectangular () and triangular () kernel. We compare methods in terms of their worst-case mean-squared error conditional on ; for local linear regression, we always chose the bandwidth to make this quantity as small as possible. We depict performance relative to our estimator (4).

In the right panel of Figure 2, we also compare the worst-case conditional mean-squared error of our method relative to that of optimally tuned local linear regression, both with a rectangular and triangular kernel. For the smallest sample size we consider, , the discreteness of the running variable has a fairly mild effect on estimation and—as one might have expected—the triangular kernel is noticeably better than the rectangular kernel, while our method is slightly better than the triangular kernel. However, as the sample size increases, the performance of local linear regression relative to our method ebbs and flows rather unpredictably.777As a matter of intellectual curiosity, it is intriguing to ask whether there exist discrete distributions for which the rectangular kernel may work substantially better than the triangular kernel, or whether additional algorithmic tweaks—such as using different bandwidths on different sides of the threshold—may have helped (in the above example, we used the same bandwidth for local linear regression on both sides of the boundary). However, from a practical perspective, the estimator (4) removes the need to consider such questions in applied data analysis, and automatically adapts to the structure of the data at hand.

1.2 Optimized Inference with Generic Discontinuities

The methods presented above extend naturally to the general case, where may be multivariate and is unrestricted. The problem of regression discontinuity inference with multiple running variables is considerably richer than the corresponding problem with a single running variable, because an investigator could now plausibly hope to identify many different treatment effects along the boundary of the treated region . Most of the existing literature on this setup, including Papay et al. (2011), Reardon and Robinson (2012) and Wong et al. (2013), have focused on these questions of identification, while using some form of local linear regression for estimation.

In the multivariate case, however, questions about how to tune local linear regression are exacerbated, as the problems of choosing the kernel function and the bandwidth are now multivariate. Perhaps for this reason, it is still popular to use univariate methods to estimate treatment effects in the multivariate setting by, e.g., using shortest distance to the boundary of the treatment region as a univariate running variable (Black, 1999), or only considering a subset of the data where univariate methods are appropriate (Jacob and Lefgren, 2004; Matsudaira, 2008).

Here, we show how our optimization-based method can be used to side-step the problem of choosing a multivariate kernel function by hand. In addition to providing a simple-to-apply algorithm, our method lets us explicitly account for the curvature of the mean-response function for statistical inference, thus strengthening formal guarantees relative to prior work.

Relative to the univariate case, the multivariate case has two additional subtleties we need to address. First, in (4) it is natural to impose a constraint to ensure smoothness; in the multivariate case, however, we have more choices to make. For example, do we constrain to be an additive function, or do we allow for interactions? Here, we opt for the more flexible specification, and simply require that , where denotes the operator norm (i.e., the largest absolute eigenvalue of the second derivative).

Moreover, as emphasized by Papay et al. (2011), whereas the univariate design only enables us to identify the conditional average treatment effect at the threshold , the multivariate design enables us to potentially identify a larger family of treatment effect functionals. Here, we focus on the following two causal estimands. First, writing for a focal point of interest, let with

(5)

denote an estimator for the conditional average treatment effect at . The upside of this approach is that it gives us an estimand that is easy to interpret; the downside is that, when curvature is non-negligible, (5) can effectively only make use of data near the specified focal point , thus resulting in relatively long confidence intervals.

In order to potentially improve precision, we also study weighted conditional average treatment effect estimation with weights greedily chosen such as to make the inference as precise as possible: In the spirit of Crump et al. (2009), Li et al. (2017) or Robins et al. (2008), we consider , with

(6)

In other words, we seek to pick weights that are nearly immune to bias due to curvature of the baseline response surface . By construction, this estimator satisfies

(7)

Because , we see that is in fact a weighted average of the conditional average treatment effect function over the treated sample. If we ignored the curvature of , we could interpret as an estimate for the conditional average treatment effect at .

In some cases, it is helpful to consider other interpretations of the estimand underlying (6). It we are willing to assume a constant treatment effect , then , and is the minimax linear estimator for . Relatedly, we can always use the confidence intervals from Section 2.1 built around to test the global null hypothesis for all .

Figure 3: Weighting functions for the geographic regression discontinuity example of Keele and Titiunik (2014), where points depict potential voters within a single school district and the solid black line is a media market boundary. The left panel depicts an optimal weighting function for the conditional average treatment effect at the point marked with a boldface- as in (5), while the right one allows for a weighted treatment effect as in (6) or, equivalently, shows the optimal weighting function for a constant effect. Households below the line are treated (i.e., in the Philadelphia media market), whereas those above it are controls (i.e., in the New York media market). The color of the point depicts the -weight: Red points receive positive weight and blue points receive negative weight, while the shading indicates absolute value of the weight (darker is larger).

To gain intuition for the multivariate version of our method, we outline a simple example building on the work of Keele and Titiunik (2014) on the effect of television advertising on voter turnout in presidential elections. To estimate this effect, Keele and Titiunik (2014) examine a school district in New Jersey, half of which belongs to the Philadelphia media market and the other half to the New York media market. Before the 2008 presidential elections, the Philadelphia half was subject to heavy campaign advertising whereas the New York half was not, thus creating a natural experiment for the effect of television advertising assuming the media market boundary didn’t coincide with other major boundaries within the school district. Keele and Titiunik (2014) use this identification strategy to build a regression discontinuity design, comparing sets of households straddling the media market boundary.

However, despite the multivariate identification strategy, Keele and Titiunik (2014) then reduce the problem to a univariate regression discontinuity problem for estimation: They first compute Euclidean distances to a focal point , and then use as a univariate running variable. In contrast, our approach allows for transparent inference without needing to rely on such a reduction. Figure 3 depicts -weights generated by our optimized approach; the resulting treatment effect estimator is then . Qualitatively, we replicate the “no measurable effect” finding of Keele and Titiunik (2014), while directly and uniformly controlling for spatial curvature effects. We discuss details, including placebo diagnostics and the choice of tuning parameter, in Section 4.2.

We also see that, at least here, standard heuristics used to reduce the multivariate regression discontinuity problem to a univariate one are not sharp. In the setup of the left panel of Figure 3, where we seek to estimate the treatment effect at a focal point , some treated points due west of get a positive weight, whereas points the same distance south from get a mildly negative weight, thus violating the heuristic of Keele and Titiunik (2014) that weights should only depend on . Meanwhile, we can compare the approach in the right panel of Figure 3, where we allow for averaging of treatment effects along the boundary, to the popular heuristic of using shortest distance to the boundary of the treatment region as a univariate running variable (Black, 1999). But this reduction again does not capture the behavior of our optimized estimator: There are some points at the eastern edge of the treated region that are very close to the boundary, but get essentially zero weight (presumably because there are no nearby units on the control side of the boundary).

1.3 Related Work

The idea of constructing estimators of the type (4) that are minimax with respect to a regularity class for the underlying data-generating process has a long history in statistics. In early work, Legostaeva and Shiryaev (1971) and Sacks and Ylvisaker (1978) independently studied inference in “almost” linear models that arise from taking a Taylor expansion around a point; see also Cheng et al. (1997). For a broader discussion of minimax linear estimation over non-parametric function classes, see Cai and Low (2003), Donoho (1994), Ibragimov and Khas’minskii (1985), Johnstone (2011), Juditsky and Nemirovski (2009), and references therein. An important result in this literature that, for many problems of interest, minimax linear estimators are within a small explicit constant of being minimax among all estimators (Donoho and Liu, 1991).

Armstrong and Kolesár (2018) apply these methods to regression discontinuity designs, resulting in an estimator of the form (4), except with weights888Armstrong and Kolesár (2018) also consider a more general setting where we assume accuracy of the -th order Taylor expansion of around ; and, in fact, our method also extends to this setting. Here, however, we focus on second-derivative bounds, which are by far the most common in applications.

(8)

Now, although this class of functions is cosmetically quite similar to the bounded-second-derivative class used in (4), we note that the class of weights allowed for in (8) is substantially larger, even if the value of is the same. This is because the functions underlying the above weighting scheme need not be continuous, and can in fact have jumps of magnitude . Given that the key assumption underlying regression discontinuity designs is continuity of the conditional means of the potential outcomes at the threshold for the running variable, it would appear to be reasonable to impose continuity away from the threshold as well. Allowing for jumps through the condition (8) can make the resulting confidence intervals for substantially larger than they are under the smoothness condition with bounded second derivatives. One key motivation for the weighting scheme (8) rather than our proposed one (4) appears to be that the optimization problem induced by (8) is substantially easier, and allows for closed-form solutions for . Conversely, we are aware of no closed-form solution for (4), and instead need to rely on numeric convex optimization.

In the special case where the running variable is assumed to have a continuous density around the threshold , there has been a considerable number of recent proposals for asymptotic confidence intervals while imposing smoothness assumptions on . Calonico, Cattaneo, and Titiunik (2014) propose a bias correction to the local linear regression estimator that allows for valid inference, and Calonico, Cattaneo, and Farrell (2018) provide further evidence that such bias corrections may be preferable to undersmoothing. Meanwhile, Armstrong and Kolesár (2016) show that when is twice differentiable and has a continuous density around , we can use local linear regression with a bandwidth chosen to optimize mean-squared error as the basis for bias-adjusted confidence intervals, provided we inflate confidence intervals by an appropriate, universal constant (e.g., to build 95% confidence intervals, one should use a critical threshold of 2.18 instead of 1.96). Gao (2017) characterizes the asymptotically optimal kernel for the regression discontinuity parameter under the bounded second derivative assumption with a continuous running variable. As discussed above, the value of our approach relative to this literature is that we allow for considerably more generality in the specification of the regression discontinuity design: the running variable may be discrete and/or multivariate, and the treatment boundary may be irregularly shaped.

Optimal inference with multiple running variables is less developed than in the univariate case. Papay et al. (2011) and Reardon and Robinson (2012) study local linear regression with a “small” bandwidth, but do not account for finite sample bias due to curvature. Zajonc (2012) extends the analysis of Imbens and Kalyanaraman (2012) to the multivariate case, and studies optimal bandwidth selection for continuous running variables given second derivative bounds; the inference, however, again requires undersmoothing. Keele, Titiunik, and Zubizarreta (2015) consider an approach to geographic regression discontinuity designs based on matching. To our knowledge, the approach we present below is the first to allow for uniform, bias-adjusted inference in the multivariate regression discontinuity setting.

Finally, although local methods for inference in the regression discontinuity design have desirable theoretical properties, many practitioners also seek to estimate by fitting using a global polynomial expansion along with a jump at ; see Lee and Lemieux (2010) for a review and examples. However, as argued by Gelman and Imbens (2017), this approach is not recommended, as the model with the best in-sample fit may provide poor estimates of the discontinuity parameter. For example, high-order polynomials may give large influence to samples for which is far from the decision boundary , and thus lead to unreliable performance.

Another approach to regression discontinuity designs (including in the discrete case) builds on randomization inference; see Cattaneo, Frandsen, and Titiunik (2015), Cattaneo, Idrobo, and Titiunik (2017) and Li, Mattei, and Mealli (2015) for a discussion. The problem of specification testing for regression discontinuity designs is considered by Cattaneo, Jansson, and Ma (2016), Frandsen (2017) and McCrary (2008).

2 Formal Considerations

2.1 Uniform Asymptotic Inference

Our main result verifies that optimized designs can be used for valid asymptotic inference about . We here consider the problem of estimating a conditional average treatment effect at a point as in (4) or (5); similar arguments extend directly to the averaging case as in (6). Following, e.g., Robins and van der Vaart (2006), we seek confidence intervals that attain uniform coverage over the whole regularity set under consideration:

(9)

As in Armstrong and Kolesár (2018), our approach to building such confidence intervals relies on an explicit characterization of the bias of rather than on undersmoothing. Our key result is as follows.

Theorem 1.

Suppose that we have a moment bound uniformly over all , for some exponent and constant . Suppose, moreover, that for all for a deterministic value , and that none of the weights derived in (4) or (5) dominates all the others, i.e.,999The bound on the relative contribution of any single is needed to obtain a Gaussian limit distribution for . In related literature, Armstrong and Kolesár (2018) follow Donoho (1994), and side-step this issue by assuming Gaussian errors , in which case no central limit theorem is needed. Conversely, Athey, Imbens, and Wager (2018) adopt an approach more similar to ours, and explicitly bound from above during the optimization.

(10)

Then, our estimator from (4) or (5) is asymptotically Gaussian,

(11)

where denotes the conditional bias, and .

Now, in solving the optimization problem (4), we also obtain an explicit bound on the conditional bias, , and so can use the following natural construction to obtain confidence intervals for (Imbens and Manski, 2004),

(12)

where is the significance level. These confidence intervals are asymptotically uniformly valid in the sense of (9), i.e., for any , there is a threshold for which, if , the confidence intervals (12) achieve -level coverage for any functions in our regularity class.

Finally, whenever does not have support arbitrarily close to , e.g., in the case where has a discrete distribution, the parameter is not point identified. Rather, even with infinite data, the strongest statement we could make is that

(13)

where denotes the support of . In this case, our confidence intervals (12) remain valid for ; however, they may not cover the whole optimal identification interval . In partially identified settings, this type of confidence intervals (i.e., ones that cover the parameter of interest but not necessarily the whole identification interval) are advocated by Imbens and Manski (2004). From the perspective of the practitioner, an advantage of our approach is that intervals for have the same interpretation whether or not is point identified, i.e., uniformly in large samples, will be covered with probability . Then, asymptotically, intervals (12) will converge to a point if and only if is point identified. For a further discussion of regression discontinuity inference with discrete running variables, see Kolesár and Rothe (2018).

2.2 Implementation via Convex Optimization

In our presentation so far, we have discussed several non-parametric convex optimization problems, and argued that solving them was feasible given advances in the numerical optimization literature over the past few decades (e.g., Boyd and Vandenberghe, 2004). Here, we present a concrete solution strategy via quadratic programming over a discrete grid, and show that the resulting discrete solutions converge uniformly to the continuous solution as the grid size becomes small.101010In the case where is univariate, the resulting optimization problem is a classical one, and arguments made by Karlin (1973) imply that the weights can be written as where is a perfect spline; and our proposed optimization strategy reflects this fact. However, in the multivariate case, we are not aware of a similar simple characterization.

To do so, we start by writing the optimization problems underlying (4), (5) and (6) in a unified form. Given a specified focal point , we seek to solve

(14)

where denotes the treatment assignment function, and lets us toggle between different problem types. If we want to estimate the CATE at as in (5) we set , whereas if we want an optimally weighted CATE estimator as in (6) we set .

To further characterize the solution to this problem, we can use Slater’s constraint qualification (e.g., Ponstein, 2004, Theorem 3.11.2) to verify that strong duality holds, and that the optimum of (14) matches the optimum of the following problem:

(15)

where is the number of running variables. Here, we also implicitly used von Neumann’s minimax theorem to move the maximization over outside the statement.

The advantage of this dual representation is that, by examining first order conditions in the term, we can analytically solve for and in the dual objective, e.g.,

(16)

where , , , etc., are the maximizers of (15). Carrying out the substitution results in a more tractable optimization problem over the space of twice differentiable functions , along with a finite number of Lagrange parameters :

(17)

where in the above problem corresponds to in (15), and we can recover our weights of interest via and . The upshot of these manipulations is that (17) can be approximated via a finite-dimensional quadratic program. In our software implementation optrdd, we use this type of a finite dimensional approximation to obtain following the construction described in the proof of Proposition 2.

Proposition 2.

Suppose that belong to some compact, convex set . Then, for any tolerance level , there exists a finite-dimensional quadratic program that can recover the solution to (17) with -error at most .

2.3 Minimizing Confidence Interval Length

As formulated in (4), our estimator seeks to minimize the worst-case mean-squared error over the specified bounded-second-derivative class. However, in some applications, we may be more interested in making the confidence intervals (12) as short as possible; and our approach can easily be adapted for this objective. To do so, consider the minimization objective in (14). Writing , we see that both the worst-case mean-squared error, , and the confidence interval length in (12) are monotone increasing functions of and ; the only difference is in how they weight these two quantities at the optimum.

Now, to derive the full Pareto frontier of pairs , we can simply re-run (14) with the term in the objective replaced with , for some . A practitioner wanting to minimize the length of confidence intervals could consider computing this whole solution path to (14), and then using the value of that yields the best intervals; this construction provides minimax linear fixed-length confidence intervals (Donoho, 1994). Since this procedure never looks at the responses , the inferential guarantees for the resulting confidence intervals remain valid.

In our applications, however, we did not find a meaningful gain from optimizing over instead of just minimizing worst-case mean-squared error as in (14), and so did not pursue this line of investigation further. This observation is in line with the analytic results of Armstrong and Kolesár (2016) who showed that, when has a continuous density and is twice differentiable, using the mean-squared error optimal bandwidth for local linear regression is over 99% efficient relative to using a bandwidth that minimizes the length of bias-adjusted confidence intervals.

Finally, although it is beyond the scope of the present paper, it is interesting to ask whether we can generalize our approach to obtain asymptotically quasi-minimax estimators for when the per-observation noise-scale needs to be estimated from the data. The resulting question is closely related to the classical issue of when feasible generalized least squares can emulate generalized least squares; see Romano and Wolf (2017) for a recent discussion.

3 Univariate Optimized Designs in Practice

To use this result in practice, we of course need to estimate the sum and choose a bound on curvature. Estimating the former is relatively routine; and we recommend the following. First, we estimate globally, or over a large plausible relevant interval around the threshold, and average the square of the residuals to obtain an estimate of the average value of . Then, we optimize weights using (4), with . Finally, once we have chosen the weights , we estimate the sampling error of as below, noting that the estimator will be consistent under standard conditions

(18)

Conceptually, this strategy is comparable to first running local linear regression without heteroskedasticity adjustments to get a point estimate, but then ensuring that the uncertainty quantification is heteroskedasticity-robust (White, 1980). We summarize the resulting method as Procedure 1. We always encourage plotting the weights against when applying our method.

Conversely, obtaining good bounds on the curvature is more difficult, and requires problem specific insight. In particular, adapting to the true curvature without a-priori bounds for is not always possible; see Armstrong and Kolesár (2018), Bertanha and Moreira (2016), and references therein. In applications, we recommend considering a range of plausible values of that could be obtained, e.g., from subject-matter expertise or from considering the mean-response function globally. For example, we could estimate using a quadratic function globally, or over a large plausible relevant interval around the threshold, and then multiply maximal curvature of the fitted model by a constant, e.g., 2 or 4. The larger the value of we use the more conservative the resulting inference. In practice, it is often helpful to conduct a sensitivity analysis for the robustness of confidence intervals to changing ; see Figure 6 for an example.111111An interesting wrinkle is that if we are able to bound in large samples—but not uniformly—then confidence intervals built using estimated values of will have asymptotic but not uniform coverage.

Procedure 1.
Optimized Regression Discontinuity Inference This algorithm provides confidence intervals for the conditional average treatment effect , given an a-priori bound on the second derivative of the functions . We assume that the conditional variance parameters are unknown; if they are known, they should be used as in (4). This procedure is implemented in our R package optrdd.1212footnotemark: 12 Pick a large window , such that data with can be safely ignored without loss of efficiency. (Here, we can select , but this may result in unnecessary computational burden.) Run ordinary least-squares regression of on the interaction of and over the window . Let be the residual error from this regression. Obtain via the quadratic program (14), with set to and weights outside of the range set to 0. Confirm that the optimized weights are small for . If not, start again with a larger value of .131313Only considering data over an a-priori specified “large plausible relevant interval” around that safely contains all the data relevant to fitting can also be of computational interest. Our method relies on estimating a smooth non-parametric function over the whole range of ; and being able to reduce the relevant range of a-priori reduces the required computation. Although defining such plausibility intervals is of course heuristic, our method ought not be too sensitive to how exactly the interval was chosen. For example, in the setup considered in Section 3.1, the optimal bandwidth for local linear regression is around 3 or 6 years depending on the amount of assumed smoothness (and choosing a good bandwidth is very important); conversely, using plausibility intervals extending 10, 15, or 20 years on both sides of appears to work reasonably well. When running the method (4), one should always make sure that the weights get very small near the edge of the plausibility interval; if not, the interval should be made larger. Estimate and , where the are predictions from the least squares regression from step 1. Build confidence intervals as in (12).

1111footnotetext: Here, the algorithm assumes that all observations are of roughly the same quality (i.e., we do not know that is lower for some observations than others). If we have a-priori information about the relative magnitudes of the conditional variances of different observations, e.g., some pairs outcomes are actually aggregated over many observations, then we should run steps 2 and 3 below using appropriate inverse-variance weights. Our software allows for such weighting.

3.1 Application: The Effect of Compulsory Schooling

In our first application, we consider a dataset from Oreopoulos (2006), who studied the effect of raising the minimum school-leaving age on earnings as an adult. The effect is identified by the UK changing its minimum school-leaving age from 14 to 15 in 1947, and the response is log-earnings among those with non-zero earnings (in 1998 pounds). This dataset exhibits notable discreteness in its running variable, and was used by Kolesár and Rothe (2018) to illustrate the value of their bias-adjusted confidence intervals for discrete regression discontinuity designs. For our analysis, we pre-process our data exactly as in Kolesár and Rothe (2018); we refer the reader to their paper and to Oreopoulos (2006) for a more in-depth discussion of the data.

As in Kolesár and Rothe (2018), we seek to identify the effect of the change in minimum school-leaving age on average earnings via a local analysis around the regression discontinuity; our running variable is the year in which a person turned 14, with a treatment threshold at 1947. Kolesár and Rothe (2018) consider analysis using local linear regression with a rectangular kernel and a bandwidth chosen such as to make to make their honest confidence intervals as short as possible (recall that we can measure confidence interval length without knowing the point estimate, and so tuning the interval length does not invalidate inference). Here, we also consider local linear regression with a triangular kernel, as well as our optimized design.141414 Oreopoulos (2006) analyze the dataset using a global polynomial specification with clustered random variables, following Lee and Card (2008). However, as discussed in detail by Kolesár and Rothe (2018), this approach does not yield valid confidence intervals.

Figure 4: Weighting functions produced explicitly by our estimator (4), and implicitly via local linear regression with a rectangular or triangular kernel. Both local linear regression methods have a finite bandwidth, and the effective weights of outside this bandwidth are not shown. The weighting functions were generated with .

In order to obtain confidence intervals, it remains to choose a bound . Following the discussion in Section 3, a 2nd-order polynomial fit with a “large” bandwidth of either 12 or 18 has a curvature of (the estimate is insensitive to the choice of large bandwidth); thus, we tried and . We also consider the more extreme choices and . For , we proceed as discussed in Section 3. Figure 4 shows the effective weighting functions for all 3 considered methods, with .

rect. kernel tri. kernel optimized
0.003
0.006
0.012
0.03
Table 1: Confidence interval () for the effect of raising the minimum school-leaving age on average log-earnings, as given by local linear regression with a rectangular kernel, local linear regression with a triangular kernel, and our optimized method (4). The confidence intervals account for curvature effects, provided the second derivative is bounded by .

We present results in Table 1. Overall, these results are in line with those presented in Figure 2. The optimized method yields materially shorter confidence intervals than local linear regression with a rectangular kernel: for example, with , the rectangular kernel intervals are 11% longer. In comparison, the triangular kernel comes closer to matching the performance of our method, although the optimized method still has shorter confidence intervals. Moreover, when considering comparisons with the triangular kernel, we note that the rectangular kernel is far more prevalent in practice, and that the motivation for using the triangular kernel often builds on the optimality results of Cheng et al. (1997). And, once one has set out on a quest for optimal weighting functions, there appears to be little reason to not just use the actually optimal weighting function (4).

Finally, we note that a bound on the second derivative also implies that the quadratic approximation (8) holds with the same bound . Thus, we could in principle also use the method of Armstrong and Kolesár (2018) to obtain uniform asymptotic confidence intervals here. However, the constraint (8) is weaker than the actual assumption we were willing to make (i.e., that the functions have a bounded second derivative), and so the resulting confidence intervals are substantially larger. Using their approach on this dataset gives confidence intervals of with and with ; these intervals are not only noticeably longer than our intervals, but are also longer than the best uniform confidence intervals we can get using local linear regression with a rectangular kernel as in Kolesár and Rothe (2018). Thus, the use of numerical convex optimization tools that let us solve (4) instead of (8) can be of considerable value in practice.

4 Applications with Multivariate Discontinuities

4.1 A Discontinuity Design with Two Cutoffs

We next consider the behavior of our method in a specific variant of the multiple running variable problem motivated by a common inference strategy in education. Some school districts mandate students to attend summer school if they fail a year-end standardized test in either math or reading (Jacob and Lefgren, 2004; Matsudaira, 2008), and it is of course important to understand the value of such summer schools. The fact that students are mandated to summer school based on a sharp test score cutoff suggests a natural identification strategy via regression discontinuities; however, standard univariate techniques cannot directly be applied as the regression discontinuity now no longer occurs along a point, but rather along a surface in the bivariate space encoding both a student’s math and reading scores.

fail reading pass reading
fail math pass math fail math pass math
number of students 3,586 1,488 10,331 15,336
summer school attendance 69.6% 61% 52.9% 10.6%
Table 2: Summary statistics for a subset of the dataset of Matsudaira (2008).

We illustrate our approach using the dataset of Matsudaira (2008). As discussed above, the goal is to study the impact of summer school on future test scores, and the effect of summer school is identified by a regression discontinuity: At the end of the school year, students need to take year-end tests in math and reading; then, students failing either of these tests are mandated to attend summer school. Here, we focus on the 2001 class of graduating 5th graders, and filter the sample to only include the students whose 5th-grade math and reading scores both fall between 40 points of the passing threshold; this represents of the full sample. Matsudaira (2008) analyzed this dataset with univariate methods, by using reading score as a running variable and only consider the subset of students who passed the math exam, etc. This allows for a simple analysis, but may also result in a loss of precision.151515In order to make a formal power comparison, we need to compare two estimators that target the same estimand. In the simplest case where is constant, our estimator (6) presents an unambiguous gain in power over those considered in Matsudaira (2008).

We present some summary statistics in Table 2. Clearly, not all students mandated to attend summer school in fact attend, and some students who pass both tests still need to attend for reasons discussed in Matsudaira (2008). That being said, the effect of passing tests on summer school attendance is quite strong and, furthermore, the treatment effect of being mandated to summer school is interesting in its own right, so here we perform an “intent to treat” analysis without considering non-compliance.

unweighted CATE (5) weighted CATE (6)
Figure 5: Weights underlying treatment effect estimates of the effect of summer school on the following year’s reading scores, using both (5) which seeks to estimate the conditional average treatment effect (CATE) at , and the estimator (6) which allows weighted CATE estimation. The size of is depicted by the color, ranging from dark red (very positive) to dark blue (very negative). In the right panel, the diamond marks the weighted mean of the treated -values, i.e., . These plots were generated with a maximum second derivative bound of .

We consider both of our optimized estimators, (5) and (6), and compare weight functions learned by both methods in Figure 5. The estimator is in fact quite conservative, and only gives large weights to students who scored close to . Our choice of estimating the conditional average treatment effect at may have been particularly challenging, as it is in a corner of control-space and so does not have particularly many control neighbors.

In contrast, the weighted method appears to have effectively learned matching: It constructs pairs of observations all along the treatment discontinuity, thus allowing it to use more data while canceling out curvature effects due to . As seen in Table 2, in this sample, it is much more common to fail math and pass reading than vice-versa; thus, the mean of the samples used for “matching” lies closer to the math pass/fail-boundary than the reading one.

estimator: unweighted CATE (5) weighted CATE (6)
subject conf. int. m. bias samp. err conf. int. m. bias s. err
math 0.030 0.038 0.009 0.017
math 0.041 0.052 0.011 0.019
reading 0.030 0.041 0.009 0.017
reading 0.040 0.054 0.011 0.019
Table 3: Estimates for the effect of summer school on math and reading scores on the following year’s test, using different estimators and choices of . Reported are bias-adjusted 95% confidence intervals, a bound on the maximum bias given our choice of , and an estimate of the sampling error conditional on .

In order to proceed with our approach, we again need to choose a value for . Running a 2nd order polynomial regression on the next year’s math and reading scores for both treated and control students separately, we find the largest curvature effect among the reading score of control students; roughly a curvature of along the direction. Thus, we run our algorithm with both an optimistic choice of and a more conservative choice (we report curvatures on the “scale” of the plots in Figure 5, such that a curvature of results in a worst-case bias of in the corners of the plot).

Results are given in Table 3. As expected, the confidence intervals using the weighted method (6) are much shorter than those obtained using (5), allowing for a 0.95-level significant detection in the first case but not in the second. Since the weighting method allows for shorter confidence intervals, and in practice seems to yield a matching-like estimator, we expect it to be more often applicable than the unweighted estimator (5).

Figure 6 presents some further diagnostics for our result. The first two panels depict a sensitivity analysis for our weighted CATE result: We vary the maximum bound on the second derivative, and examine how our confidence intervals change.161616We note that, if the CATE function is not constant, then our weighted CATE estimand may vary with . The result in Figure 6 should thus formally be interpreted as either a sensitivity analysis for the constant treatment effect parameter if we are willing to assume constant treatment effects, or as a robustness check for our rejection of the global null hypothesis for all . The result on the effect of summer school on math scores appears to be fairly robust, as we still get significant bias-aware 95% confidence intervals at , which is 4 times the largest apparent curvature observed in the data. The third panel plots a measure of the effective size of the treated and control samples used by the algorithm, . Although our analysis sample has almost exactly the same number of treated and control units (), it appears that our algorithm is able to make use of more control than treated samples, perhaps because the treated units are “wrapped around” the controls.

Math Conf. Interval Reading Conf. Interval Effective Samp. Size
Figure 6: The first two panels depict a sensitivity analysis for our weighted CATE result, for the math and reading outcomes respectively. We plot point estimates along with bias-aware 95% confidence intervals for different choices of ; the choices of used in Table 3 are indicated with dotted lines. The third panel depicts effective sample sizes used by the procedure, . For a given value of , the used for the math and reading outcomes are the same. In all cases, is multiplied by for readability.

Finally, it is of course natural to ask whether the bivariate specification considered here gave us anything in addition to the simpler approach used by Matsudaira (2008), i.e., of estimating the treatment effect of summer school on the next year’s math exam by running a univariate regression discontinuity analysis on only those students who passed the reading exam, and vice-versa to the effect on the reading exam.171717The estimator of Matsudaira (2008) is not exactly comparable to the two we consider. For example, when only focusing on students who passed the reading exam, his estimator effectively averages treatment effects over the math pass/fail boundary but not the reading pass/fail boundary. In contrast, we either estimate the conditional average treatment effect at a point (5), or allow for averaging over the full boundary (6). It is unclear whether the restriction of Matsudaira (2008) to averaging over only one of the two boundary segments targets a meaningfully more interpretable estimand than (6). We ran both of these analyses using our method (14), again considering bounds on the second derivative. For math, we obtained 95% confidence intervals of and for the smaller and larger -bounds respectively; for reading, we obtained and . In both cases, the corresponding bounds for the weighted estimator (6) in Table 3 are shorter, despite accounting for the possibility of bivariate curvature effects. The difference is particularly strong for the reading outcome, since our estimator can also use students near the math pass/fail-boundary for improved precision.181818The corresponding headline numbers from Matsudaira (2008) are a 95% confidence interval of for the effect on the math score, and for the reading score; see Tables 2 and 3, reduced form estimates for 5th graders. These confidence intervals, however, do not formally account for bias. They estimate the discontinuity parameter using a global cubic fit; such methods, however, do not reliably eliminate bias (Gelman and Imbens, 2017).

4.2 A Geographic Discontinuity Design

Our final application expands on the example of Keele and Titiunik (2014), briefly discussed in Section 1.2, the goal of which is to estimate the effect of political advertising on participation in presidential elections by comparing households straddling a media market boundary. As discussed in Keele and Titiunik (2014), inference hinges on the fact that they found a media market boundary that appears not to coincide with any other major administrative boundaries, thus making it more plausible that discontinuous responses across the geographic boundary are in fact caused by varying media exposure.

The way Keele and Titiunik (2014) approach this problem is that, given a focal point , they first measure Euclidean distance of each sample to it, , and then use as a running variable in a univariate regression discontinuity analysis. They establish consistency of their approach via a local regression argument following Hahn et al. (2001), and also discuss advantages of their approach relative to one that considers shortest distance to the treatment boundary (rather than ) as input to a univariate analysis. Beyond consistency arguments, however, the resulting estimator has no guarantees in terms of statistical optimality. For example, the approach of Keele and Titiunik (2014) collapses all treated (and respectively control) observations at distance from to the same point in the univariate representation, which may increase bias.

Figure 7: Non-parametric estimates for turnout in the 2008 presidential elections (left panel) and average age (right panel), as a function of geographic covariates. There is considerable variation in the underlying signal: localized turnout rates range from 26% to 46%, while localized age averages range from 43 years to 68 years. Darker regions correspond to larger values of the response. These predictive surfaces were used to pick the curvature bound .

We apply our method to this problem and, following Keele and Titiunik (2014), seek to measure the effect of media exposure on our main outcome of interest (participation in the 2008 presidential elections), as well as placebo outcomes (age, black race, political party affiliation, and gender). There were observations in this school district. Relative to the other examples considered in our paper, the responses here have a much richer dependence on the geographic running variables, thus making the problem of bounding much more delicate. To illustrate this issue, Figure 7 depicts non-parametric estimates of expected turnout and age as a function of spatial location, and reveals strong local structure. For example, we see two regions in the south-eastern and northern edges of the school district with a notably higher average age than elsewhere.

We did not find simple quadratic methods for bounding to be sufficiently sensitive to this strong local structure, and thus chose instead to set as the worst-case local curvature of a non-parametric global fit. Specifically, we ran a cross-validated ridge regression with interacted 7th-order natural splines as features on each side of the media market boundary, and then set at the 95-th percentile of the estimated operator norm curvature over all the training points. Cross-validation was carried out via the 1.se rule in the R-package glmnet (Friedman et al., 2010). We caution that this type of local choice of is only possible on fairly large datasets with strong local effects. With smaller datasets or weaker signals, cross-validation may over-smooth the estimated response function, thus resulting in an over-optimistically small value for .

outcome pt. estimate conf. interval max. bias std. err.

at point

2008 turnout 0.130 (-0.072, 0.331) 0.071 0.079
age 8.605 (-8.949, 26.158) 6.552 6.682
black -0.006 (-0.080, 0.069) 0.032 0.026
democrat 0.019 (-0.080, 0.118) 0.035 0.039
female 0.065 (-0.060, 0.191) 0.045 0.049

weighted

2008 turnout 0.045 (-0.107, 0.197) 0.052 0.060
age 3.105 (-8.702, 14.913) 4.147 4.649
black 0.016 (-0.043, 0.075) 0.024 0.021
democrat -0.027 (-0.110, 0.055) 0.026 0.034
female 0.019 (-0.083, 0.120) 0.033 0.042
Table 4: Estimates for the effect of television advertising on voter turnout, as well as placebo outcomes (age in years, black indicator, Democrat indicator, and female indicator). The upper half of the table displays estimates for the conditional average treatment effect at the point marked X in Figure 3, whereas the lower half allows for a weighted treatment effect as in (6). We provide a point estimate for the treatment effect, a 95% confidence interval, as well as the worst case bias and sampling error used to produce the confidence interval.

Table 4 shows results from our analysis, both in terms of pointwise estimates following (5) and weighted estimates (6). We set the curvature parameter separately for each different outcome, following the procedure outlined above.191919Keele and Titiunik (2014) also had a placebo outcome indicating Hispanic ethnicity. However, this outcome does not appear to have much local structure, and cross-validated ridge regression learned a constant function (resulting in ). Thus, we omit this placebo check from our present analysis; another alternative would have been to fall back on our previous approach and to select via a global quadratic fit. The -weighting functions underlying our analysis for the primary outcome are shown in Figure 3. Overall, we replicate the finding of Keele and Titiunik (2014), in that the media market boundary does not appear to have a substantial association with either the primary outcome of the placebo outcomes. Relative to existing methods, the value of our approach is that it allows for transparent inference of causal parameters using the original untransformed data (rather than relying on a hand-constructed univariate running variable), and uniformly accounts for spatial curvature effects.

5 Discussion

In this paper, we introduced an optimization-based approach to statistical inference in regression discontinuity designs. By using numerical convex optimization tools, we explicitly derive the minimax linear estimator for the regression discontinuity parameter under bounds on the second derivative of the conditional response surface. Because any method based on local linear regression is also a linear estimator of this type, our approach dominates local linear regression in terms of minimax mean-squared error. We also show how our approach can be used to build uniformly valid confidence intervals.

A key advantage of our procedure is that, given bounds on the second derivative, estimation of the regression discontinuity parameter is fully automatic. The proposed algorithm is the same whether the running variable is continuous or discrete, and whether or not is point identified. Moreover, it does not depend on the shape of treatment assignment boundary when is multivariate. We end our discussion with some potential extensions of our approach.

Fuzzy regression discontinuities

In the present paper, we only considered sharp regression discontinuities, where the treatment assignment is a deterministic function of . However, there is also considerable interest in fuzzy discontinuities, where is random but has a jump at the threshold ; see Imbens and Lemieux (2008) for a review. In this case, it is common to interpret the indicator as an instrument, and then to estimate a local average treatment effect via two-stage local linear regression (Imbens and Angrist, 1994). By analogy, we can estimate treatment effects with fuzzy regression discontinuity via two-stage optimized designs as

(19)

where the are obtained as in (14) with an appropriate choice penalty on the maximal squared imbalance . This approach would clearly be consistent based on results established in this paper; however, deriving the best way to trade off bias and variance in specifying and extending the approach of Donoho (1994) for uniform asymptotic inference is left for future work.

Balancing auxiliary covariates

In many applications, we have access to auxiliary covariates