Title
nprobust: Nonparametric KernelBased Estimation and Robust BiasCorrected Inference
Abstract
Nonparametric kernel density and local polynomial regression estimators are very popular in Statistics, Economics, and many other disciplines. They are routinely employed in applied work, either as part of the main empirical analysis or as a preliminary ingredient entering some other estimation or inference procedure. This article describes the main methodological and numerical features of the software package nprobust, which offers an array of estimation and inference procedures for nonparametric kernelbased density and local polynomial regression methods, implemented in both the R and Stata statistical platforms. The package includes not only classical bandwidth selection, estimation, and inference methods (Wand and Jones 1995; Fan and Gijbels 1996), but also other recent developments in the statistics and econometrics literatures such as robust biascorrected inference and coverage error optimal bandwidth selection (Calonico, Cattaneo, and Farrell 2018, 2019). Furthermore, this article also proposes a simple way of estimating optimal bandwidths in practice that always delivers the optimal mean square error convergence rate regardless of the specific evaluation point, that is, no matter whether it is implemented at a boundary or interior point. Numerical performance is illustrated using an empirical application and simulated data, where a detailed numerical comparison with other R packages is given.
Last update: June 24, 2019
[1]label*=()
\@oddhead
JSS
Journal of Statistical Software
MMMMMM YYYY, Volume VV, Issue II. doi: 10.18637/jss.v000.i00
nprobust: Nonparametric KernelBased Estimation and Robust BiasCorrected Inference
Sebastian Calonico Columbia University Matias D. Cattaneo Princeton University Max H. Farrell University of Chicago
Keywords: kernelbased nonparametrics, bandwidth selection, bias correction, robust inference, R, Stata.
1 Introduction
Nonparametric kernelbased density and local polynomial regression methods are popular in Statistics, Economics, and many other disciplines, as they are routinely employed in empirical work. Correct implementation of these nonparametric procedures, however, often requires subtle (but important) methodological and numerical decisions. In this article, we give a comprehensive discussion of the software package nprobust, which is available for both R and Stata, and offers modern statistical methods for bandwidth selection, estimation, and inference in the context of kernel density and local polynomial regression fitting. Furthermore, this article also presents a novel bandwidth selection approach, which always delivers the optimal rate of convergence no matter the evaluation point under consideration (i.e., the bandwidth methodology is “boundary rate adaptive”), and therefore is the default method used in the package. All the methods are fully datadriven and are implemented paying particular attention to finite sample properties as well as to recent theoretical developments in the statistics and econometrics literatures.
The package nprobust implements three types of statistical procedures for both density and local polynomial kernel smoothing. First, it provides several datadriven optimal bandwidth selection methods, specifically tailored to both (i) point estimation, based on both mean square error (MSE) and integrated MSE (IMSE) approximations, and (ii) confidence interval estimation or, equivalently, twosided hypothesis testing, based on coverage error (CE) approximations via valid higherorder Edgeworth expansions. Second, the package implements point and variance estimators for density estimation at an interior point, and regression function estimation both at interior and boundary points. Finally, the package offers conventional and robust biascorrected inference for all settings considered: density at interior point and regression function at both interior and boundary points. Some of these implementations build on classical results in the nonparametrics literature such as those reviewed in Wand and Jones (1995) and Fan and Gijbels (1996), while others build on more recent developments on nonparametrics (Calonico, Cattaneo, and Farrell 2018, 2019, and references therein) and/or new practical ideas explained below. See also Ruppert et al. (2009) for semi/nonparametric applications in statistics, and Li and Racine (2007) for semi/nonparametric applications in econometrics.
Section 2 below gives an heuristic introduction to the main methodological ideas and results underlying the package implementation, avoiding technical details as much as possible, and focusing instead on the big picture overview of the methods available. For a technical discussion of the methods, see Calonico, Cattaneo, and Farrell (2018, 2019) and their supplemental appendices. Related software implementing partitioningbased nonparametric methods, including binscatter, splines, and wavelets, are discussed in Cattaneo et al. (2019b) and Cattaneo et al. (2019a).
Unlike most other kernel smoothing implementations available in R and Stata, the package nprobust has two distinctive features, in addition to offering several new statistical procedures currently unavailable. First, all objects entering the different statistical procedures are computed using preasymptotic formulas, which has been shown to deliver superior estimation and inference. For example, when implemented in preasymptotic form, local polynomial regression inference at a boundary point exhibits higherorder boundary carpentry, while this important distributional feature is not present when asymptotic approximations are used instead for Studentization purposes. Second, also motivated by preasymptotic approximations and better finite sample properties, all inference procedures are constructed using both heteroskedasticity consistent (HC) variance estimators and cluster robust variance estimators. This approach departs from those employing homoskedasticity consistent variance estimators, which are implemented in most (if not all) packages currently available, because it has been found theoretically and in simulations to deliver more accurate inference procedures. The package nprobust implements HC variance estimators in two distinct ways: (i) using plugin estimated residuals as usually done in least squares estimation (see Long and Ervin (2000), MacKinnon (2012), and references therein), and (ii) using nearest neighbor (NN) residuals (motivated by Muller and Stadtmuller (1987) and Abadie and Imbens (2008)). In particular, both the HC and NN variance estimators have been found to perform quite well in simulations. Cluster robust variance estimation is implemented using either plugin estimated residuals with degreesoffreedom adjustment or NN residuals.
In this article, we focus the discussion on the R version of the package nprobust, which makes use of the packages ggplot2 (Wickham 2016) and Rcpp (Eddelbuettel and Balamuta 2017). Nevertheless, we remark that the exact same functionalities are available in its Stata version. The package nprobust is organized through five general purpose functions (or commands in Stata):

lprobust(): this function implements estimation and inference procedures for kernelbased local polynomial regression at both interior and boundary points in an unified way. This function takes the bandwidth choices as given, and implements point and variance estimation, bias correction, conventional and robust biascorrected confidence intervals, and related inference procedures. All polynomial orders are allowed and the implementation automatically adapts to boundary points, given the choice of bandwidth(s). When the bandwidth(s) are not specified, the companion function lpbwselect() is used to select the necessary bandwidth(s) in a fully datadriven, automatic fashion, while taking into account whether the evaluation point is an interior or boundary point.

lpbwselect(): this function implements bandwidth selection for kernelbased local polynomial regression methods at both interior and boundary points. Six distinct bandwidth selectors are available: direct plugin (DPI) and ruleofthumb (ROT) implementations of MSEoptimal and IMSEoptimal choices (four alternatives), and DPI and ROT implementations of the CEoptimal choice (two alternatives).

kdrobust(): this function implements estimation and inference procedures for kernel density estimation at an interior point. Like the function lprobust(), this function also takes the bandwidth choices as given and implements point and variance estimation, bias correction, conventional and robust biascorrected confidence intervals, and related inference procedures. When the bandwidth(s) are not specified, the companion function kdbwselect() is used. This function cannot be used for density estimation at boundary points; see Cattaneo, Jansson, and Ma (2019c) for boundaryadaptive kernel density estimation methods.

kdbwselect(): this function implements bandwidth selection for kernel density estimation at an interior point. Mimicking the structure and options of the function lpbwselect(), the same type of bandwidth selectors are available.

nprobust.plot: this function is used as wrapper for plotting results, and is only available in R because it builds on the package ggplot2. Analogous plotting capabilities are available in Stata, as we illustrate in our companion replication files for the latter platform.
We also present detailed numerical evidence on the performance of the package nprobust. First, in Section 3, we illustrate several of its main functionalities using the canonical dataset of Efron and Feldman (1991). Second, in Section 4, we investigate the finite sample performance of the package using a simulation study. We also compare its performance to four other popular R packages implementing similar nonparametric methods: see Table 1 for details. We find that the package nprobust outperforms the alternatives in most cases, and never underperforms in terms of bandwidth selection, point estimation, and inference. Finally, for comparability, the Appendix discusses the R syntax for all the packages considered.
Installation details, scripts in R and Stata replicating all numerical results, links to software repositories, and other companion information concerning the package nprobust can be found in the website: https://sites.google.com/site/nppackages/nprobust/.
2 Overview of Methods and Implementation
This section offers a brief, selfcontained review of the main methods implemented in the package nprobust for the case of local polynomial regression. Whenever possible, we employ exactly the same notation as in Calonico, Cattaneo, and Farrell (2018, CCF hereafter) to facilitate comparison and crossreference with their lengthy supplemental appendix. See also Calonico, Cattaneo, and Farrell (2019) for extensions to derivative estimation and other theoretical and practical results. The case of kernel density estimation at interior points is not discussed here to conserve space, but is nonetheless implemented in the package nprobust, as illustrated below. Further details can be found in the classical textbooks such as Wand and Jones (1995) and Fan and Gijbels (1996), as well as in CCF and references therein.
We assume that , with , is a random sample with , or its derivative, begin the object of interest. The evaluation point can be an “interior” or a “boundary” point, and we will discuss this issue when relevant. Regularity conditions such as smoothness or existence of moments are omitted herein, but can be found in the references already given. We discuss how the point estimator, its associated biascorrected point estimator, and their corresponding variance estimators are all constructed. We employ two bandwidths and , where is used to construct the original point estimator and is used to construct the bias correction (robust biascorrected inference allows for ). Finally, in our notation any object indexed by is computed with bandwidth and any object indexed by is computed with bandwidth and . We set for future reference. When appealing to asymptotic approximations, limits are taken and as unless explicitly stated otherwise. Finally, for any smooth function , we use the notation to denote its th derivative, with to save notation.
2.1 Point Estimation
The package nprobust implements fully datadriven, automatic kernelbased estimation and inference methods for the regression function and its derivatives. Since kernelbased local polynomial regression estimators are consistent at all evaluation points, we consider both interior and boundary points simultaneously.
The classical local polynomial estimator of , , is
where, for a nonnegative integer , is the th unit vector of dimension, , and denotes a symmetric kernel with bounded support. See Fan and Gijbels (1996) for further discussion and basic methodological issues. While CCF restrict attention to odd, as is standard in the local polynomial literature, the package nprobust allows for any . In particular, corresponds to the classical NadarayaWatson kernel regression estimator, while is the local linear estimator with the celebrated boundary carpentry feature. Some of the higherorder results reported in CCF for local polynomial estimators may change when is even, as we discuss further below, but all the firstorder properties remain valid and hence the package nprobust allows for any . The package uses as default, in part because all the first and higherorder results reported in Fan and Gijbels (1996), CCF, and references therein apply. The package offers three different kernel choices: Epanechnikov, Triangular, and Uniform.
For any fixed sample size, the local polynomial estimator is a weighted least squares estimator, and hence can be written in matrix form as follows: , where , , , and ; here denotes the diagonal matrix constructed using . This representation is convenient to develop preasymptotic approximations for estimation and inference, as we discuss below.
2.2 Bias and Variance
The basic statistical properties of the nonparametric estimators are captured by their bias and variance. In preasymptotic form, that is, for fixed sample size and bandwidth choice, the (conditional) bias and variance of can be represented by
respectively, where the objects , and depend on observable quantities such as , and , and each depend on only one unknown quantity: , , and the heteroskedasticity function , respectively. Throughout the paper, means that the approximation holds for large samples in probability.
To be more specific, the preasymptotic variance takes the form:
which is valid for any and and for interior and boundary points , as this formula follows straightforwardly from linear least squares algebra, where . In practice, is replaced by some “estimator”, leading to heteroskedasticity consistent (HC) or cluster robust variance estimation for (weighted) linear least squares fitting.
The preasymptotic bias, on the other hand, is more complicated as it depends on whether is an interior or a boundary point, and whether is odd or even. To be concrete, define
where . Then, if is odd or is a boundary point, is the leading term of the bias (i.e., regardless of the evaluation point when is odd), and is negligible in large samples. Alternatively, if is even and is an interior point, then
where the quantity is nonzero, and relatively easy to estimate using a pilot bandwidth and preasymptotic approximations. (It is, however, notationally cumbersome to state explicitly.) Therefore, when is even and is an interior point, both and are part of the leading bias. This distinction is important when it comes to (automatic) implementation, as the form of the bias changes depending on the evaluation point and regression estimator considered.
At a conceptual level, putting aside technical specifics on approximation and implementation, the bias and variance quantities given above play a crucial role in bandwidth selection, estimation, and inference. The package nprobust pays special attention to constructing robust estimates of these quantities, with good finite sample and boundary properties.
2.3 Mean Square Error and Bandwidth Selection
In order to implement the point estimator , a choice of bandwidth is needed. Following the classical nonparametric literature, the package nprobust employs pointwise and integrated MSE approximations for the point estimator to compute the corresponding MSEoptimal or IMSEoptimal bandwidth for point estimation.
The pointwise MSE approximation is
while the integrated MSE approximation is
with a weighting function.
2.3.1 Case 1: odd
Employing the approximations above, the MSEoptimal bandwidth is
which, in this case, has closed form solution given by
Here we abuse notation slightly: the quantities and are taken as fixed and independent of and , that is, denoting their probability limits as opposed to preasymptotic, as originally defined. Nevertheless, we abuse notation because in the end they will be approximated using their preasymptotic counterparts with a preliminary/pilot bandwidth.
Similarly, the integrated MSE approximation when is odd leads to the IMSEoptimal bandwidth
which also has a closed form solution given by
These MSEoptimal and IMSEoptimal bandwidth selectors are valid for odd, regardless of whether is a boundary or an interior point. Implementation of these selectors is not difficult, as the bias and variance quantities are taken to be preasymptotic and hence most parts are already in estimable form. For example, we have the following easy to implement bias estimator:
where and are preasymptotic and hence directly estimable once a preliminary/pilot bandwidth is set, and is simply another local polynomial fit of order . Similarly, the fixed variance estimator is given by:
where all the objects are directly computable from the data and valid for all (odd or even) and all evaluation points (interior or boundary), provided that is chosen appropriately. Following CCF, the package nprobust allows for five different under unknown conditional heteroskedasticity: HC0, HC1, HC2, HC3, and NN. The first four choices employ weighted estimated residuals, motivated by a weighted least squares interpretation of the local polynomial estimator, while the last choice uses a nearest neighbor approach to constructing residuals. Details underlying these choices, and their properties in finite and large samples, are discussed in CCF and its supplemental appendix. In addition, the package nprobust allows for two cluster robust choices for : plugin residuals with degreesoffreedom adjustment (similar to HC1) and NNcluster residuals (similar to NN).
As mentioned in the introduction, the outlined preasymptotic approach for bias and variance estimation is quite different from employing the usual asymptotic approximations, which is the way most implementations of kernel smoothing are done in both R and Stata. Importantly, using valid higherorder Edgeworth expansions, CCF established formally that using preasymptotic approximations for bias and variance quantities delivers demonstrably superior distributional approximations, and hence better finite sample performance.
2.3.2 Case 2: even
This case requires additional care. If is a boundary point, then the above formulas are valid and the same MSEoptimal and IMSEoptimal bandwidth selectors can be used. However, when is an interior point and even, the leading bias formulas change and so do the corresponding bandwidth selectors. Therefore, to account for this issue in an automatic way, since in any given application it is not possible to cleanly distinguish between interior and boundary evaluation points, the default bandwidth selector implemented in the package nprobust optimizes the preasymptotic full bias approximation, leading to the following MSEoptimal and IMSEoptimal bandwidth selectors
and
respectively.
These bandwidth selectors for local polynomial regression with even automatically adapt to the evaluation point in terms of the resulting convergence rate, i.e. the bandwidth choice will deliver point estimates with the MSE or IMSEoptimal convergence rate, but cannot be given in closed form. Furthermore, the package also offers the possibility of explicitly treating all points as interior points when even, in which case the estimator of the term is replaced by an estimator of the term , leading to the standard bandwidth selectors, which are more cumbersome to state but can be computed in closed form.
2.3.3 Boundary Adaptive Bandwidth Selection
The function lpbwselect() implements a DPI estimate of the MSEoptimal and IMSEoptimal bandwidth selectors discussed previously, taking into account whether is even or odd and whether is interior or boundary point. These bandwidth choices depend on , , , and the preliminary bias and variance estimators. For the bias quantity, the preliminary estimator is computed using a nonparametric point estimators for the unknown quantities , , etc., depending on the case of local polynomial regression considered, but in all cases preliminary quantities are estimated using MSEoptimal bandwidth choices. For the variance quantity, a choice of “estimated” residual is used, as already discussed, and a preliminary ruleofthumb bandwidth is employed to construct the remaining preasymptotic objects. The integrated versions are computed by taking a sample average of the pointwise bias and variance quantities over the evaluation points selected by the user. Finally, the function lpbwselect() also implements ROT versions of the MSEoptimal and IMSEoptimal bandwidth selectors for completeness.
All the proposed and implemented bandwidth selection methods outlined above are rate adaptive in the sense that the selected bandwidth always exhibits the (I)MSEoptimal convergence rate, regardless of the polynomial order (), the derivative order (), and the evaluation point () considered. Furthermore, in the case of odd or a boundary point, the plugin estimated constant entering the bandwidth selectors will also be consistent for the corresponding (I)MSEoptimal population constant. On the other hand, in the case of even and an interior point, the resulting bandwidth selectors will not estimate consistently the complete bias constant (only one of the two constants is estimated consistently). To resolve the later issue in cases where the user is considering only interior points, the function lpbwselect() has the option interior which implements bandwidth selection for interior points only in all cases.
2.4 Optimal Point Estimation
When implemented using the (pointwise) MSEoptimal bandwidth, the regression function estimator is an MSEoptimal point estimator, in the sense that it minimizes the pointwise asymptotic mean square error objective function. Sometimes, researchers (and other software packages) implementing kernel smoothing methods prefer to optimize the bandwidth choice in a global sense and thus focus on optimal bandwidths targeting the IMSE, as presented above, or some form of crossvalidation objective function. These alternative bandwidth selectors also lead to optimal point estimators in some sense. Regardless of the specific objective function considered, and hence bandwidth rule used, the resulting bandwidth selectors exhibit the same rate of decay as (albeit different constants) in all cases: for example, compare and given above in close form for the case of odd.
The package nprobust implements only plugin bandwidth selectors for both local (pointwise) and global (integrated) bandwidth choices, because such rules tends to have better finite sample properties. Crossvalidation methods are not implemented due to potential numerical instabilities, and potentially low convergence rates of the resulting bandwidth choices. Nevertheless, the package can of course be used to construct crossvalidation bandwidth choices manually.
2.5 Conventional and Robust BiasCorrected Inference
Given a choice of bandwidth, conventional inference in nonparametric kernelbased estimation employs a Gaussian distributional approximation for the usual Waldtype statistic. To be more precise, standard asymptotic results give
where denotes an estimator of , already discussed above, and denotes convergence in distribution. Using this classical result, we obtain the nominal percent conventional symmetric confidence intervals:
where with denoting the standard normal cumulative distribution function.
However, this standard distributional approximation result crucially requires undersmoothing ( of the bandwidth employed: the confidence interval will not have correct coverage when any of the mean square error optimal bandwidths discussed previously are used. The main reason is that, when or (or a crossvalidation bandwidth) is employed to construct the local polynomial point estimator , then , where denotes a nonvanishing asymptotic bias. Furthermore, if a larger bandwidth than an MSE optimal is used, then the distributional approximation fails altogether.
This observation is very important because most (if not all) available kernel smoothing implementations in R and Stata employ a mean square error optimal bandwidth for both point estimation and inference, which means that the corresponding inference results are incorrect. One solution to this problem is to undersmooth the bandwidth, that is, to employ a smaller bandwidth for conducting inference and constructing confidence intervals. However, while theoretically valid in large samples, this approach does not typically perform well in applications, leads to power losses, and requires employing different observations for estimation and inference purposes, all important drawbacks for empirical work.
Motivated in part by the above issues, CCF developed new inference methods based on nonparametric bias correction, which they termed Robust Bias Correction (RBC). This approach is based on traditional plugin bias correction, as an alternative to undersmoothing, but employs a different variance estimator for Studentization purposes (when compared to those discussed in textbooks). The key underlying idea is that because the bias is estimated when conducting bias correction, the variance estimator should be change to account for the additional variability introduced. This leads to a new test statistic, where both the centering (bias) and the scale (variance) have been adjusted, but the same bandwidth can be used. To be more specific, we have
where denotes a variance estimator of and denotes a bias estimator (i.e., an estimate of ). The biascorrected point estimator and its fixed variance estimator are, respectively,
where . As explained in CCF, this formula emerges from employing th order local polynomial with bandwidth to estimate the leading higherorder derivative featuring in the fixed bias approximation for the estimator . Finally, in this case, is also computed using any of the five HC variance estimators (HC0, HC1, HC2, HC3, NN) or the two clusterrobust versions mentioned above. The estimator was already introduced above for bandwidth selection, but the estimator is new because it accounts for the variance of , the variance of , and the covariance between these terms.
The RBC test statistic can be constructed with any of the (I)MSE optimal bandwidths, including the ones introduced above, and the standard Gaussian approximation remains valid. This result leads to the nominal percent robust biascorrected symmetric confidence intervals:
The package nprobust implements both conventional and RBC inference for local polynomial estimators, at interior and boundary points. CCF also show that the same ideas and results apply to density estimation at an interior point, and the package also implements those results.
The function lprobust() implements the point estimators, variance estimators, conventional inference, and robust biascorrected inference discussed above, taking into account whether the evaluation point is near the boundary or not, and employing all preasymptotic approximations for estimation of bias and variance. Two bandwidths are used: for constructing the main point estimator and related quantities, and for constructing the higherorder derivative entering the bias correction term. As discussed in CCF, setting is allowed by the theory, and leads to a simple method for bias correction. This is the default in nprobust.
2.6 Coverage Error and Bandwidth Selection
An (I)MSEoptimal bandwidth can be used to construct optimal point estimators but cannot be used directly to conduct inference or form confidence intervals, in the conventional way, since a firstorder bias renders the standard Gaussian approximation invalid as discussed above. Robust biascorrected inference allows the use of an (I)MSEoptimal bandwidth to form a test statistic with a standard Gaussian distribution in large samples, thereby leading to valid confidence intervals and hypothesis tests.
However, while valid, employing an (I)MSEoptimal bandwidth to form the robust biascorrected statistic may not be optimal from a distributional or inference perspective. CCF (for ) and Calonico, Cattaneo, and Farrell (2019) (for ) study this problem formally using valid Edgeworth expansions and show that indeed the (I)MSEoptimal bandwidth is suboptimal in general, even after robust biascorrection, because it is too “large” (i.e., it decays to zero too slowly) relative to the optimal choice. A very important exception is , , with an interior point, in which case the MSEoptimal bandwidth does have the CEoptimal rate of convergence.
Consider first the case of local polynomial regression with odd and robust bias correction inference. The asymptotic coverage of the robust biascorrected confidence interval estimator is:
where the higherorder coverage error is
This representation follows from Corollary 5 in CCF and results in Calonico, Cattaneo, and Farrell (2019) (see also their supplemental appendices), and is written in a way that is adaptive to the evaluation point , that is, it is asymptotically valid for both interior and boundary points. Thus, a CEoptimal bandwidth for implementing confidence intervals , with odd, is:
Like in the case of (I)MSEoptimal bandwidth selection, a plugin CEoptimal bandwidth estimator is obtained by replacing with consistent estimators , . The bandwidth selector , and its datadriven implementation, can be used to construct robust biascorrected confidence intervals with CEoptimal properties, that is, with minimal coverage error in large samples. Because the constants entering the coverage error are estimated consistently, the resulting CEoptimal bandwidth choice corresponds to a DPI bandwidth selector, which is available only for odd (and, of course, robust biascorrected inference). For even, only the ROT CEoptimal bandwidth selector discussed below is currently available.
As discussed in the supplemental appendix of CCF, it is always possible to construct a simple and intuitive ROT implementation of the CEoptimal bandwidth selector by rescaling any MSEoptimal bandwidth choice:
with denoting the previously discussed MSEoptimal bandwidth for the point estimator . This alternative bandwidth choice is available for all , and is implemented using the DPI MSEoptimal bandwidth choice for with denoting the polynomial order used construct the local polynomial regression point estimate. See the supplemental appendix of CCF for more details, and an alternative ROT bandwidth choice that could be used in some specific cases.
3 Empirical Illustration
This section showcases some of the features of the package nprobust employing the canonical dataset of Efron and Feldman (1991), which corresponds to a randomized clinical trial evaluating the efficacy of cholestyramine for reducing cholesterol levels. The dataset used herein consists of five variables (t, chol1, chol2, cholf, comp) for observations in a placebocontrolled doubleblind clinical trial, with units assigned to treatment and units assigned to control. As mentioned previously, we focus exclusively on the R implementation of the package only for concreteness, but reiterate that the companion Stata implementation offers exact same features. In fact, we do provide a Stata dofile replicating all the results discussed in this section.
From within R, the latest version of nprobust can be installed from CRAN (The Comprehensive R Archive Network) using the following:
Additionally, the package help manual can be accessed via:
Once the package has been installed, it can be loaded by typing:
Efron and Feldman (1991) dataset includes five variables: t is an indicator of treatment assignment, comp is a measure of posttreatment compliance, chol1 and chol2 are two measures of pretreatment measurements of cholesterol, and cholf records the average of posttreatment measurements of cholesterol. Basic summary statistics on these variables are as follows:
We first employ kdrobust() with its default options to depict the distribution of the preintervention and postintervention variables. Recall that this command is only valid for interior points, because of the wellknown boundary bias in standard density estimation, so the grid of evaluation points needs to be restricted accordingly. As a first illustration, and to display the output in a parsimonious way, we use the function kdrobust() to estimate the density function of the pretreatment variable chol1 among control units over only seven evaluation points. The command and its summary output is as follows:
Naturally, as the number of evaluation points increases the output via the R function summary() increases as well. (In Stata there is an option to suppress output, as in that platform commands typically issue detailed output by default.) Notice that the number of evaluation points was specified, via the option neval = 7, but not their location. The alternative option eval controls the number and location of evaluation points: for example, specifying eval = 3 would result in estimation at only or specifying eval = seq(1,2,.1) would result in estimation at the eleven evaluation points . Unless the option h is specified, kdrobust() employs the companion function kdbwselect(), with its default option bwselect = "IMSEDPI", to choose a datadriven bandwidth; that is, unless the user specifies the bandwidth(s) manually, the default choice is an automatic DPI implementation of the IMSEoptimal bandwidth.
We do not discuss other details of the output above to conserve space, and mainly because the generic structure and underlying ideas will be discussed in detail further below when employing the function lprobust(); both commands and outputs have the exact same structure. The key outputs needed at this point are the density point estimates (under Point Est.) and their companion robust biascorrected confidence intervals (under Robust B.C. [95% C.I.]), since these ingredients are needed for plotting the estimated density functions.
To now plot and compare the density functions for the different variables and groups, we use the function kdrobust() multiple times across a larger number of evaluation points, and store the results accordingly:
Given the above information, we can easily plot and compare the estimated densities with the function kdrobust.plot(), giving the results reported in Figure 1:
In Figure 1, the density functions with red and grey confidence intervals correspond to the control and treatment groups, respectively. Not surprisingly, this figure confirms that chol1 and chol2 are preintervention variables: the have very similar distributions for control and treatment groups. In contrast, both posttreatment cholesterol and compliance exhibit important (and statistically significant) changes between the control and treatment groups.
The average (intentiontotreat) treatment effect is about cholesterol points and highly statistically significant. This effect amounts roughly to a reduction in cholesterol given the baseline measurements chol1 and chol2. In addition, the average treatment compliance is percentage points. However, these are average effects not accounting for observed (treatment effect) heterogeneity across preintervention cholesterol levels.
We now illustrate the main functions for local polynomial regression estimation in the package nprobust by analyzing heterogeneous treatment effects. The function lprobust() provides point estimates and robust confidence intervals employing local polynomial estimators, given a grid of evaluation points and a bandwidth choice. If the evaluation points are not provided, 30 equallyspaced points are chosen over the full support of the data. In this empirical illustration, however, there are a very few observations at the extreme values of the preintervention covariates chol1 and chol2 (see Figure 1(a)(b)), and therefore we restrict the support of analysis to , which also ensure a common support across both preintervention variables. The following commands estimate local polynomial regressions at evenlyspaced evaluation points over the restricted support using the (default) IMSEoptimal plugin bandwidth selector computed by lpbwselect():
As before, the results are stored and outputs are not displayed because of their length. They look exactly like the output presented above when using the function kdrobust(). For instance, using the command summary(m0_cholf_1) after the commands above results in an output including a local polynomial regression estimate of the outcome variable cholf given the preintervention covariate chol1 for the control group () over the evaluation generated by the R function seq(250,350,length.out=30). Here is an example of the output, restricted to the first seven evaluation points:
The first panel of the output provides basic information on the options specified in the function. The default estimand is the regression function, indicated by Order of derivative estimated (deriv) = 0. The main panel of the output gives estimation results: (i) eval is the grid of evaluation points; (ii) h is the bandwidth used for point estimation; (iii) Eff.n is the effective sample size (determined by h); (iv) Est. is the point estimate using polynomial order , that is, using the notation above; (iv) Std. Error is the standard error of the point estimate, that is, using the notation above; and (v) Robust B.C. [95% C.I.] is the robust biascorrected confidence interval, that is, using the notation above.
We discuss alternative choices for implementation, estimation, and inference further below, but first we plot the results obtained above using the selected grid of evaluation points. This is with the following commands leading to Figure 2:
The results obtained in Figure 2 not only illustrate some of the main features of the nprobust package, but also are substantively interesting. We find that the (intentiontotreat) treatment effect is heterogeneous in the sense that there are no statistical significant effects for observations with low levels of preintervention cholesterol, while there are strongly statistically significant effects for units with moderate to high levels of pretreatment cholesterol (subfigures (a) and (b)). Interestingly, the effects are constant when present. Furthermore, we also find heterogeneous treatment compliance where observations with higher levels of preintervention cholesterol tend to comply with treatment while units with low to moderate levels tend not to.
As mentioned previously, the default datadriven bandwidth implementation is IMSEDPI. However, the package nprobust includes other bandwidth selections, including CEoptimal ones as explained above. These alternative bandwidth choices can be selected directly for estimation and inference via the option bwselect in the function lprobust(), or they can all be obtained using the companion bandwidth selection function lpbwselect(). We illustrate the latter by reporting the output underlying the first seven evaluation points used for local polynomial regression estimation of the outcome variable cholf given the preintervention covariate chol1 for the control group () over the evaluation generated by the R function seq(250,350,length.out=30):
To close this section, we showcase all bandwidth selection methods available. So far all the bandwidths were obtained for the choice MSEDPI or IMSEDPI. The following output exhibits all possible bandwidths, including the recently introduced CEoptimal bandwidth choices using a DPI selector (CEDPI).
This output gives MSE, IMSE, and CE bandwidth selectors, implemented using either DPI or a ROT approach. The above bandwidth selection execution included the (default) option bwcheck=21. This option forces the command to select the minimum bandwidth such that at least observations are used at each evaluation point, which is useful to mitigate issues related to sparse data at certain evaluation points. We combined this option with the IMSEDPI bandwidth because it employs a larger grid of evaluation points to approximate the outer integrals present in its construction, and some of these grid points exhibited issues of (local) sparsity in this particular application.
In the upcoming section we illustrate how these bandwidth selection methods, together with the estimation and inference procedures discussed above, perform in simulations. We also show how they compare to other available R packages.
4 Simulations and Comparisons with Other R Packages
In this section we present the results from a simulation study addressing the finitesample performance of nprobust in comparison to other R packages implementing kernel smoothing regression methods. We consider four other packages: locfit (Loader 2013), locpol, LPridge, and np (Hayfield and Racine 2008); Table 1 gives a brief comparison of the main features available in each of the packages. Among the main differences summarized there, the most important one is related to inference: none of the other packages account for the firstorder asymptotic bias present in the distributional approximation when employing automatic MSEoptimal or similar bandwidth selection. In addition, none of the other packages employ preasymptotic quantities when constructing the estimators and test statistics. As we show below, these differences in implementation will lead to substantial empirical coverage distortions of confidence intervals when contrasted with the performance of nprobust. See the Appendix for a comparison of R syntax across these packages.
We compare the performance in finite samples of the five packages described in Table 1 using a simulation study. The data generating process was previously used by Fan and Gijbels (1996) and Cattaneo and Farrell (2013), and is given as follows:
We consider simulations, where for each replication the data is generated as i.i.d. draws of size . We investigate several performance measures for point estimation and inference, whenever available in each package. Specifically, we consider bias, variance, MSE, and empirical coverage and average interval length of nominal confidence intervals, over five equallyspaced values . The regression function and evaluation points considered are plotted in Figure 3, together with a realization of the data and regression function estimate (and confidence intervals) using nprobust. Whenever possible we employ the Epanechnikov kernel, but otherwise whatever is the default kernel in each package.
We focus on two types of comparisons across packages in terms of point estimation and inference performance: (i) using a fixed, MSEoptimal bandwidth choice, and (ii) using an estimated bandwidth. This allows to disentangle the effects of constructing point estimators and inference procedures (given a fixed, common choice of bandwidth across packages) from the convoluted effects of different point estimators and inference procedures together with different datadriven bandwidth selectors.
We organize the presentation of the results by choice of polynomial order () and derivative being estimated () in five tables as follows:

Table 2: local linear estimator () for the level of the regression function ();

Table 3: local constant estimator () for the level of the regression function ();

Table 4: local quadratic estimator () for the level of the regression function ();

Table 5: local linear estimator () for the derivative of the regression function ();

Table 6: local quadratic estimator () for the derivative of the regression function ().
Recall that and is the default for the package nprobust, and corresponds to the recommended odd case for estimating ; similarly, Table 6 presents the results for the recommended odd case for estimating . Each of these tables reports results on the performance of each of the four packages available in R in addition to the package nprobust, for each of the five evaluation points mentioned previously. For the case of package nprobust, we consider all three bandwidth methods: MSEoptimal, IMSEoptimal, and CEoptimal. All this information is organized by rows in each of the tables.
Each of the Tables 2–6 have two sets of columns labeled “Population Bandwidth” and “Estimated Bandwidth”, which correspond to simulation results using either a common MSEoptimal population bandwidth or a different estimated bandwidth as computed by each package. For each of the two groups of collumns, we report: (i) population or average estimated bandwidths under the label “”; (ii) the average simulation bias of the point estimator under the label “Bias”; (iii) the average simulation variance of the point estimator under the label “Var”; (iv) the average simulation MSE of the point estimator under the label “MSE”; (v) the average simulation empirical coverage of confidence intervals under the label “EC”; and (vi) the average simulation empirical interval length of confidence intervals under the label “IL”.
For brevity, here we discuss only the default case and (Table 2), but the results are qualitatively consistent across all tables. We found that nprobust offers a point estimator with similar MSE properties as other packages available, but much better inference performance. In most cases, the empirical coverage of confidence intervals are about correct for the package nprobust (across evaluation points and bandwidth selection methods), while other packages exhibit important coverage distortions. For example, for the evaluation point , which exhibits substantial local curvature in the regression function, the package np delivers nominal confidence intervals with an empirical coverage of when the population MSEoptimal bandwidth is used and of with an estimated (by crossvalidation) bandwidth. In contrast, the empirical coverage when using the package nprobust is about in the same setting, and using any of the population or estimated bandwidth procedures available in this package. These empirical findings are in line with the underlying theory and illustrate the advantages of employing robust bias correction methods for inference (Calonico et al. 2018).
The other Tables 3–6 show qualitatively similar empirical findings. Finally, to offer a more complete set of results we include Table 7, which reports the performance of all the bandwidth selectors available in the package nprobust. Some of these results were already reported in the previous tables, but Table 7 allows for an easier comparison across models and methods, in addition to including the performance of the ROT bandwidth selectors (not reported previously). The simulation results show that these bandwidth selectors offer good performance on average relative to their corresponding population target.
5 Conclusion
We gave an introduction to the software package nprobust, which offers an array of estimation and inference procedures for nonparametric kernelbased density and local polynomial methods. This package is available in both R and Stata statistical platforms, and further installation details and replication scripts can be found at https://sites.google.com/site/nprobust/. We also offered a comprehensive simulation study comparing the finite sample performance of the package visàvis other packages available in R for kernelbased nonparametric estimation and inference. In particular, we found that the package nprobust offers on par point estimation methods with superior performance in terms of inference.
6 Acknowledgments
We thank David Drukker, Yingjie Feng, Guido Imbens, Michael Jansson, Xinwei Ma, Jeffrey Racine, Rocio Titiunik, Gonzalo VazquezBare, and two anonymous reviewers for thoughtful comments and suggestions. Cattaneo gratefully acknowledges financial support from the National Science Foundation through grant SES1459931. Farrell gratefully acknowledges financial support from the Richard N. Rosett and John E. Jeuck Fellowships.
Appendix A Syntax Comparison across R Packages
For illustration and comparison purposes, we present the R syntax for all the packages considered to yield approximately the same point estimates. Data setup:
Syntax for locfit:
Syntax for locpol:
Syntax for lpridge:
Syntax for np:
Syntax for nprobust:
References
 Abadie and Imbens (2008) Abadie A, Imbens GW (2008). “Estimation of the Conditional Variance in Paired Experiments.” Annales d’Economie et de Statistique, (91/92), 175–187.
 Calonico et al. (2018) Calonico S, Cattaneo MD, Farrell MH (2018). “On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference.” Journal of the American Statistical Association, 113(522), 767–779.
 Calonico et al. (2019) Calonico S, Cattaneo MD, Farrell MH (2019). “Coverage Error Optimal Confidence Intervals for Local Polynomial Regression.” arXiv:1808.01398.
 Cattaneo et al. (2019a) Cattaneo MD, Crump RK, Farrell MH, Feng Y (2019a). “Binscatter Regressions.” arXiv:1902.09615.
 Cattaneo and Farrell (2013) Cattaneo MD, Farrell MH (2013). “Optimal Convergence Rates, Bahadur Representation, and Asymptotic Normality of Partitioning Estimators.” Journal of Econometrics, 174, 127–143.
 Cattaneo et al. (2019b) Cattaneo MD, Farrell MH, Feng Y (2019b). “lspartition: PartitioningBased Least Squares Regression.” Working paper.
 Cattaneo et al. (2019c) Cattaneo MD, Jansson M, Ma X (2019c). “lpdensity: Local Polynomial Density Estimation and Inference.” Working paper.
 Eddelbuettel and Balamuta (2017) Eddelbuettel D, Balamuta JJ (2017). “Extending R with C++: A Brief Introduction to Rcpp.” PeerJ Preprints, 5, e3188v1. ISSN 21679843. doi:10.7287/peerj.preprints.3188v1. URL https://doi.org/10.7287/peerj.preprints.3188v1.
 Efron and Feldman (1991) Efron B, Feldman D (1991). “Compliance as an Explanatory Variable in Clinical Trials.” Journal of the American Statistical Association, 86(413), 9–17.
 Fan and Gijbels (1996) Fan J, Gijbels I (1996). Local Polynomial Modelling and Its Applications. Chapman & Hall/CRC, New York.
 Hayfield and Racine (2008) Hayfield T, Racine J (2008). “Nonparametric Econometrics: The np Package.” Journal of Statistical Software, 27(1), 1–32.
 Li and Racine (2007) Li Q, Racine S (2007). Nonparametric Econometrics. Princeton University Press, New Yersey.
 Loader (2013) Loader C (2013). locfit: Local Regression, Likelihood and Density Estimation. R package version 1.59.1, URL http://CRAN.Rproject.org/package=locfit.
 Long and Ervin (2000) Long JS, Ervin LH (2000). “Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model.” The American Statistician, 54(3), 217–224.
 MacKinnon (2012) MacKinnon JG (2012). “Thirty years of heteroskedasticityrobust inference.” In X Chen, NR Swanson (eds.), Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis. Springer.
 Muller and Stadtmuller (1987) Muller HG, Stadtmuller U (1987). “Estimation of Heteroscedasticity in Regression Analysis.” Annals of Statistics, 15(2), 610–625.
 Ruppert et al. (2009) Ruppert D, Wand MP, Carroll R (2009). Semiparametric Regression. Cambridge University Press, New York.
 Wand and Jones (1995) Wand M, Jones M (1995). Kernel Smoothing. Chapman & Hall/CRC, Florida.
 Wickham (2016) Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. SpringerVerlag New York. ISBN 9783319242774. URL https://ggplot2.tidyverse.org.
Package  Last  Polynomial  Derivative  BW Selection  Standard  Statistical 

Name  Update  Order  Order  Method ()  Errors  Inference 
locpol  20180524  NA  NA  NA  NA  
lpridge  20180612  NA  NA  NA  
locfit  20130420  NA  NN  HOM  US  
np  20181025  CV  HOM  US  
nprobust  IMSE/MSE/CE  HET  US/RBC 
Notes: (i) NA means that either the feature is not available or, at least, not documented; (ii) NN means Nearest Neighbor based estimation; (iii) CV means crossvalidation based; (iv) HOM means homoskedastic consistent standard errors; (v) HET means heteroskedastic consistent standard errors; (vi) US means undersmoothing inference (not valid when using CV/MSE/IMSE type bandwidth selection); and (vii) RBC means robust bias corrected inference (valid when using any MSE optimization type bandwidth selection). See the Appendix for a comparison of R syntax across these packages.
Population bandwidth  Estimated bandwidth  

Bias  Var  MSE  EC  IL  Bias  Var  MSE  EC  IL  
0  
locpol  0.347  0.164  0.026  0.053  1.000  3.917  0.130  0.018  0.085  0.085  1.000  3.830 
lpridge  0.347  0.112  0.023  0.036  0.887  0.601  na  na  na  na  na  na 
locfit  0.347  0.120  0.030  0.044  0.897  0.692  na  0.312  0.015  0.113  0.293  0.488 
np  0.347  0.115  0.012  0.025  0.635  0.309  0.057  0.025  0.079  0.079  0.692  0.564 
nprobust  
0.347  0.164  0.026  0.053  0.938  0.928  0.298  0.035  0.060  0.061  0.904  1.074  
0.234  0.063  0.039  0.043  0.929  1.131  0.182  0.030  0.053  0.054  0.925  1.281  
0.254  0.078  0.036  0.042  0.929  1.084  0.135  0.019  0.079  0.080  0.910  1.520  
0.25  
locpol  0.253  0.116  0.005  0.018  1.000  3.938  0.130  0.040  0.012  0.013  1.000  3.875 
lpridge  0.253  0.116  0.005  0.018  0.610  0.271  na  na  na  na  na  na 
locfit  0.253  0.091  0.006  0.014  0.773  0.296  na  0.130  0.004  0.021  0.457  0.246 
np  0.253  0.096  0.004  0.013  0.770  0.283  0.057  0.037  0.013  0.014  0.934  0.433 
nprobust  
0.253  0.116  0.005  0.018  0.942  0.389  0.163  0.059  0.007  0.011  0.938  0.488  
0.234  0.104  0.005  0.016  0.946  0.405  0.182  0.071  0.007  0.012  0.944  0.459  
0.185  0.073  0.006  0.012  0.945  0.455  0.169  0.055  0.009  0.012  0.926  0.491  
0.5  
locpol  0.175  0.178  0.007  0.039  1.000  3.929  0.130  0.106  0.014  0.025  1.000  3.890 
lpridge  0.175  0.178  0.007  0.039  0.419  0.326  na  na  na  na  na  na 
locfit  0.175  0.132  0.009  0.026  0.686  0.354  na  0.428  0.005  0.188  0.000  0.255 
np  0.175  0.632  0.004  0.403  0.000  0.256  0.057  0.100  0.015  0.025  0.762  0.406 
nprobust  
0.175  0.178  0.007  0.039  0.941  0.468  0.159  0.150  0.009  0.031  0.943  0.491  
0.234  0.295  0.005  0.092  0.931  0.405  0.182  0.192  0.008  0.045  0.941  0.459  
0.128  0.101  0.010  0.020  0.943  0.547  0.150  0.134  0.012  0.030  0.938  0.508  
0.75  
locpol  0.270  0.095  0.005  0.014  1.000  3.955  0.130  0.031  0.011  0.012  1.000  3.880 
lpridge  0.270  0.095  0.005  0.014  0.705  0.263  na  na  na  na  na  na 
locfit  0.270  0.078  0.005  0.011  0.810  0.287  na  0.088  0.004  0.012  0.706  0.246 
np  0.270  0.045  0.004  0.006  0.838  0.219  0.057  0.029  0.012  0.013  0.931  0.410 
nprobust  
0.270  0.095  0.005  0.014  0.938  0.380  0.159  0.045  0.007  0.009  0.947  0.494  
0.234  0.081  0.005  0.012  0.945  0.406  0.182  0.056  0.007  0.010  0.946  0.461  
0.198  0.064  0.006  0.010  0.947  0.442  0.169  0.043  0.008  0.010  0.933  0.493  
1  
locpol  0.491  0.238  0.020  0.076  1.000  4.108  0.130  0.001  0.085  0.085  1.000  3.836 
lpridge  0.491  0.203  0.018  0.059  0.651  0.509  na  na  na  na  na  na 
locfit  0.491  0.195  0.022  0.060  0.757  0.597  na  0.225  0.016  0.067  0.556  0.489 
np  0.491  0.613  0.011  0.387  0.000  0.225  0.057  0.006  0.079  0.079  0.701  0.564 
nprobust  
0.491  0.238  0.020  0.076  0.937  0.783  0.324  0.010  0.059  0.060  0.895  1.043  
0.234  0.042  0.040  0.042  0.936  1.136  0.182  0.013  0.052  0.052  0.930  1.287  
0.360  0.139  0.026  0.046  0.936  0.915  0.137  0.006  0.076  0.076  0.917  1.524 
Population bandwidth  Estimated bandwidth  

Bias  Var  MSE  EC  IL  Bias  Var  MSE  EC  IL  
0  
locpol  na  na  na  na  na  na  na  na  na  na  na  na 
lpridge  0.108  0.086  0.022  0.030  0.911  0.587  na  na  na  na  na  na 
locfit  na  na  na  na  na  na  na  na  na  na  na  na 
np  0.108  0.255  0.010  0.075  0.294  0.396  0.053  0.093  0.027  0.035  0.870  0.582 
nprobust  
0.108  0.086  0.022  0.030  0.927  1.122  0.070  0.048  0.042  0.044  0.919  1.436  