Higher-order testing

On Assumption-Free Tests and Confidence Intervals for Causal Effects estimated by Machine Learning

Lin Liu Rajarshi Mukherjee  and  James M. Robins
Abstract.

For many causal effect parameters of interest doubly robust machine learning (DR-ML) (Chernozhukov et al., 2018a) estimators are the state-of-the-art, incorporating the benefits of the low prediction error of machine learning (ML) algorithms; the decreased bias of doubly robust estimators; and.the analytic tractability and bias reduction of sample splitting with cross fitting. Nonetheless, even in the absence of confounding by unmeasured factors, when the vector of potential confounders is high dimensional, the associated Wald confidence intervals may still undercover even in large samples, because the bias of the estimator may be of the same or even larger order than its standard error of order .

In this paper, we introduce novel tests that (i) can have the power to detect whether the bias of is of the same or even larger order than its standard error of order , (ii) can provide a lower confidence limit on the degree of under coverage of the interval and (iii) strikingly, are valid under essentially no assumptions whatsoever. We also introduce an estimator with bias generally less, and often much less, than that of , yet whose standard error is not much greater than ’s. The tests, as well as the estimator , are based on a U-statistic that is the second-order influence function for the parameter that encodes the estimable part of the bias of . For the definition and theory of higher order influence functions see Robins et al. (2008, 2017). When the covariance matrix of the potential confounders is known, is an unbiased estimator of its parameter. When the covariance matrix is unknown, we propose several novel estimators of that perform almost as well as the known covariance case in simulation experiments.

Our impressive claims need to be tempered in several important ways. First no test, including ours, of the null hypothesis that the ratio of the bias to its standard error can be consistent [without making additional assumptions (e.g. smoothness or sparsity) that may be incorrect]. Furthermore the above claims only apply to parameters in a particular class. For the others, our results are unavoidably less sharp and require more careful interpretation.

Harvard University: linliu@fas.harvard.edu, ram521@mail.harvard.edu, robins@hsph.harvard.edu
\setdeletedmarkup

#1

Keywords. Higher-order influence functions, Assumption-free inference, Confidence intervals, Valid inference

1. Introduction

Valid inference (i.e. valid tests and confidence intervals) for causal effects are of importance in many subject matter areas. For example, in medicine it is critical to evaluate whether a non-null treatment effect estimate could differ from zero simply because of sampling variability and, conversely, whether a null treatment effect estimate is compatible with a clinically important effect.

In observational studies, control of confounding is a necessary condition for valid inference. Historically, and assuming no confounding by unmeasured covariates, two statistical approaches have been used to control confounding by potential measured confounders, both of which require – as a component – the building of non-causal purely predictive algorithms:

  • One approach builds an algorithm to predict the conditional mean of the outcome of interest given data on potential confounders and (usually) treatment (referred to as the outcome regression);

  • The other approach builds an algorithm to predict the conditional probability of treatment given data on potential confounders (referred to as the propensity score).

The validity of a nominal Wald confidence interval at sample size for a parameter of interest centered at a particular estimator quite generally requires that the bias of is much less than than its estimated standard error . A nominal interval is said to be valid if the actual coverage rate under repeated sampling is greater than or equal to . Under either of the above approaches, obtaining estimators with small bias generally depends on good performance of the corresponding prediction algorithm. This has motivated the application of modern machine learning (ML) methods to these prediction problems for the following reason. When the vector of potential confounding factors is high-dimensional, as is now standard owing to the “big data revolution”, it has become clear that, in general, so-called machine learning algorithms (e.g. deep neural nets (Krizhevsky et al., 2012), support vector machines (Cortes and Vapnik, 1995), boosting (Freund and Schapire, 1997), regression trees and random forests (Breiman, 2001, 2017), etc., especially when combined with algorithm selection by cross-validation) can do a much better job of prediction than traditional parametric or non-parametric approaches (e.g. kernel or orthogonal series regression). However, even the best machine learning methods may fail to give predictions that are sufficiently accurate to provide nearly unbiased causal effect estimates and, thus, may fail to control bias due to confounding.

To partially guard against this possibility, so-called doubly-robust machine-learning (Chernozhukov et al., 2018a) estimators have been developed that can be nearly unbiased for the causal effect , even when both of the above approaches fail. DR-ML estimators employ ML estimators of both the outcome regression and the propensity score . DR-ML estimators are the current state of the art for estimation of causal effects, combining the benefits of sample splitting, machine learning, and double robustness (Scharfstein et al., 1999a, b; Robins and Rotnitzky, 2001; Bang and Robins, 2005). By sample splitting we mean that the data is randomly divided into two (or more) samples - the estimation sample and the training sample. The machine learning estimators and of the outcome regression and the propensity score are fit using the training sample data. The estimator of our causal parameter of interest is computed from the estimation sample treating the machine learning regression estimators as fixed functions. This approach is required because the ML estimates of the regression functions generally have unknown statistical properties and, in particular, may not lie in a so-called Donsker class - a condition often needed for valid inference when sample splitting is not employed. Under conditions given in creftypecap 1.3 below, the efficiency lost due to sample splitting can be recovered by cross-fitting. The cross-fit estimator averages with its ‘twin’ obtained by exchanging the roles of the estimation and training sample. In the semiparametric statistics literature, the possibility of using sample-splitting with cross-fitting to avoid the need for Donsker conditions has a long history (Schick, 1986; van der Vaart, 1998, Page 391), although the idea of explicitly combining cross-fitting with machine learning was not emphasized until recently. Ayyagari (2010) Ph.D. thesis (subsequently published as Robins et al. (2013)) and Zheng and van der Laan (2011) are early examples of papers that emphasized the theoretical and finite sample advantages of DR-ML estimators. See also Belloni et al. (2012).

However, even the use of DR-ML estimators is not guaranteed to provide valid inferences owing to the possibility that the two ML prediction models are still not sufficiently accurate for the bias to be small compared to the standard error. In particular, if the bias of the DR-ML estimator is of the same (or greater) order than its standard error, the actual coverage of nominal confidence intervals for the causal effect will be smaller (and often much smaller) than the claimed (i.e. nominal) level, thereby producing misleading inferences.

An important point is that the aforementioned ML prediction algorithms have largely unknown statistical properties and, in that sense, can be viewed as statistical black boxes. The ML algorithms such as random forests (Athey et al., 2016; Wager and Athey, 2018) or neural networks (Farrell et al., 2018) for which statistical properties have been proved are either very simplified versions of the actual algorithms used in practice and/or the statistical properties are proved under restrictive modeling assumptions (such as smoothness or sparsity) that may not hold in practice.

The discussion above leads to the following questions: Can -level tests be developed that have the ability to detect whether the bias of a DR-ML estimator or is of the same or greater order than its standard error? In particular, can we provide a lower confidence bound on the degree of under-coverage of a nominal confidence interval . If so, when such excess bias is detected, can we construct new estimators that are less biased. Furthermore is it possible to construct such tests and estimators without :i) refitting, modifying, or even having knowledge of the ML algorithms that have been employed and ii) requiring any assumptions at all (aside from a few standard, quite weak assumptions given later) - in particular, without making any assumptions about the smoothness or sparsity of the true outcome regression or propensity score function?

In this paper, we show that, perhaps surprisingly, the answer to these questions can be “yes” by using higher-order influence function tests and estimators introduced in Robins et al. (2008, 2017); Mukherjee et al. (2017).

However this claim needs to be tempered in several important ways. First the claims in the preceding paragraph only apply to parameters of interest that are in a particular class that we characterize in Section 2.1. For other parameters, our results have a similar flavor, but are unavoidably less sharp and require more careful interpretation as we discuss in Section 2.1. Second as explained in creftypecap 1.1, there is an unavoidable limitation to what can be achieved with our or any other method: No test, including ours, of the null hypothesis that the bias of a DR-ML estimator is negligible compared to its standard error can be consistent [without making additional assumptions (e.g. smoothness or sparsity) that may be incorrect about the true but unknown propensity score and outcome regression functions]. Thus, when our -level test rejects the null for  small, we can have strong evidence that the estimators  and have bias at least the order of its standard error; nonetheless when the test does not reject, we cannot conclude that there is good evidence that the bias is less than the standard error, no matter how large the sample size. In fact, because we make (essentially) no assumptions whatsoever we can never empirically rule out that the bias of  and  is as large as order 1 and thus  times greater than !

Put another way, because we make essentially no assumptions, no methodology can (non-trivially) upper bound the bias of any estimator or lower bound the coverage of any confidence interval. What is novel about our methodology is that it can reject the claim that a particular DR-ML estimator has bias small relative to its standard error under the actual law generating the data (without claiming that the bias-corrected estimator  has bias less than order 1.)

We now describe our approach at a high level. Throughout, we let denote the treatment indicator, the outcome of interest, and the vector of potential confounders with compact support. Again let and denote a DR-ML estimator of and associated Wald confidence interval for a particular parameter . In this paper, for didactic purposes only, we will choose to be (components) of the so-called variance-weighted-average causal effect of a binary treatment on a binary outcome given a vector of confounding variables. However, the methods developed herein can be applied essentially unchanged to many other causal effect parameters (e.g. the average treatment effect and the effect of treatment on the treated) regardless of the state spaces of and , as well as to many non-causal parameters. In fact, our results extend straightforwardly to the entire class of functionals with the mixed bias property of Rotnitzky et al. (2019a, b). This class of functionals strictly contain both the class of doubly robust functionals covered in Robins et al. (2008) and Chernozhukov et al. (2018b).

We assume that we have access to the data set used to obtain both the estimate and the estimated regression functions outputted by some ML prediction algorithms. We do not require any knowledge of or access to the ML algorithms used, other than the functions that they outputted.

All DR-ML estimators are based on the first order influence function of the parameter (van der Vaart, 1998, 2002). Our proposed approach begins by computing a second order influence function estimator of estimable part of the conditional bias of given the training sample data denoted by . Our bias corrected estimator is . The statistic is a second-order U-statistic that depends on a choice of (with for reasons explained in creftypecap 2.7), a vector of basis functions of the high dimensional vector of potential confounders and an estimator of the inverse expected outer product . Both and will be asymptotically normal when, as in our asymptotic set-up, and as (If did not increase with , the asymptotic distribution of would be the so-called Gaussian chaos distribution (Rubin and Vitale, 1980)). Furthermore, the variance of and are of order and respectively.

The degree of the bias corrected by depends critically on (i) the choice of , (ii) the accuracy of the estimator of , and (iii) the particular -vector of (basis) functions selected from a much larger, possibly countably infinite, dictionary of candidate functions. Data adaptive choices of and are discussed in Appendix I. We have developed several data- adaptive methods to choose  that will be the subject of a separate paper.

Occasionally one has -semisupervised data available; that is, a data set in which the number of subjects with complete data on is many fold less than the number of subjects on whom only data on the covariates are available. In that case, assuming the subjects with complete data are effectively a random sample of all subjects, we can estimate by the empirical covariance matrix based on all subjects; and then treat and thus as known in an analysis based on the subjects with complete data (Chapelle et al., 2010; Chakrabortty and Cai, 2018). Thus with -semisupervised data, the discussion of the estimators of Section 3 is irrelevant.

For further motivation and before going into technical details, we now summarize some finite sample results from simulation studies that are described in full detail in Appendix K. Depending on the study, we either simulated 200 or 1000 estimation samples each with sample size . The same training sample, also of size 2500, and thus the same estimates of the regression functions were used in each simulation study. Thus the results are conditional on that training sample. However, additional unconditional simulation studies wherein we simulated both 200 estimation and training samples reported in Table 11 and Table 12 of Section K.3 produced similar results. We took to be significantly less than the sample size for the following three reasons: is necessary i) for bias-corrected confidence intervals (CI) centered at to have length approximately equal to CIs centered at , ii) for ’s standard error of order to be smaller than the order of the standard error of , thereby creating the possibility of detecting, for any given , that the (absolute value) of the ratio of the bias of to its standard error exceeds , provided the sample size is sufficiently large and iii) to be able to estimate accurately without imposing the additional (possibility incorrect) smoothness or sparsity assumptions required for accurate estimation when .

In all our simulation studies we chose a class of data generating processes for which the minimax rates of estimation were known, in order to be able to better evaluate the properties of our proposed procedures. Specifically, both the true propensity score and outcome regression functions in our simulation studies were chosen to lie in particular Hölder smoothness classes to insure that the DR-ML estimator had significant asymptotic bias. Furthermore, in most of our simulation studies, we estimated these regression functions using nonparametric kernel regression estimators that are known to obtain the minimax optimal rate of convergence for these smoothness classes (Tsybakov, 2009), thereby guaranteeing that performed nearly as well as any other DR-ML estimators. The basis functions were chosen to be particular Daubechies wavelets that Robins et al. (2008, 2009, 2017) showed to be minimax optimal for estimation of by for the chosen smoothness classes. Thus, in our simulations, we used optimal versions of both and to insure a fair comparison. [Out of interest, in Table 13 and Table 14 of Section K.3, we also report additional simulation results in which the propensity score and outcome regression were estimated with convolutional neural networks (Farrell et al., 2018). Qualitatively similar results were obtained.]

k MC Coverage ( 90% Wald CI) MC Coverage ( 90% Wald CI)
0 (0) 0 0.197 (0.034) 0 0 0 0.197 (0.034) 0
0.0063 (0.0034) 0 0.191 (0.034) 0 0.0063 (0.0034) 0 0.191 (0.034) 0
0.081 (0.013) 0.034 0.116 (0.036) 96.5% 0.085 (0.012) 0.094 0.112 (0.036) 97.5%
0.094 (0.023) 0.146 0.103 (0.041) 97.4% 0.102 (0.021) 0.321 0.095 (0.040) 98.9%
0.150 (0.037) 0.817 0.047 (0.050) 100% 0.119 (0.032) 0.735 0.078 (0.046) 99.2%
0.191 (0.062) 0.988 0.006 (0.071) 99.9% 0.198 (0.057) 0.991 0.0012 (0.070) 96.0%

We reported the MCav of point estimates and standard errors (first column in each panel) of and , together with the coverage probability of 90 % confidence intervals (second column in each panel) of and , the MCav of the bias and standard errors (third column in each panel) of and and the empirical rejection rate based on and with (fourth column in each panel). For complete details, see Section K.1.

Table 1.

Table 1 reports results from one of the simulation studies to demonstrate the empirical behavior of and . We examined the empirical behavior of our data adaptive estimator of as varies by comparing the estimators and that use to the oracle estimators and that use the true inverse covariance matrix . The target parameter of this simulation study is in the class for which our results are sharpest as explained later in Section 2.1.

Note the estimator is included as the first row of Table 1 as, by definition, it equals for . Also by definition, and are zero. As seen in row 1, column 2 of Table 1, nominal 90 % Wald confidence intervals centered at had empirical coverage of in 1000 simulations! However, as seen in column 2 of both the left and right panels of the last row, 90 % Wald confidence intervals centered at either or at had empirical coverage exceeding 90 %, even though their standard errors did not greatly exceed that of .

In more detail, the left panel of Table 1 displays the Monte Carlo average (MCav) of the point estimates and estimated standard errors (in parentheses) of in the first column; the empirical probability that a nominal 90 % Wald confidence interval centered at covered the true parameter value in the second column; the MC bias (i.e. MCav of ) and MCav of the estimated standard errors of in the third column; and, in the fourth column, the empirical rejection rate of a one sided level test (defined in eq. 2.6 of Section 2) of the null hypothesis that the bias of is smaller than of its standard error. The test rejects when the ratio is large. Similarly, the right panel displays these same summary statistics but with the data adaptive estimator in place of . Thus, for example the difference between the MC bias of and is an estimate of the additional bias due to the estimation of by . (The uncertainty in the estimate of the bias itself is not given in the table but it is negligible as it approximately equals times the standard error given in the table.)

Reading from the first row of Table 1, we see that the MC bias of was 0.197. The MC bias of and decreased with increasing , becoming nearly zero at . The observation that the bias decreases as increases is predicted by the theory developed in Section 2. The decrease in bias reflects the increase in the absolute value of with . Note further that both the MCavs of and are quite close, as are the MCavs of their estimated standard errors implying that our estimator performs similarly to the true . The actual coverages of 90 % Wald confidence intervals centered at and both increase from 0 % at to over 95% at . From simulation experiments, we show in Appendix H that the conservative coverage rate at is due to the estimated standard errors of and being somewhat upwardly biased. In Section H.1, we demonstrate that bootstrap resampling method can provide better variance estimator empirically, with additional computational cost by computing bootstrapped . Also, reading from the third column, we see that the standard error (0.070) of is only 2 times the standard error (0.034) of , confirming that the dramatic difference in coverage rates of their associated confidence intervals is due to the bias of . Reading from the 4th column of each panel, we see that the rejection rates of both and are already above 96 % when , indicating that the bias of is much greater than of its standard error. Indeed, reading from row 1 of column 3, we see that the ratio of the MCav 0.197 of the bias of to the MCav 0.034 of its estimated standard error is nearly ! In creftypecap 2.3 of Section 2, we show that this ratio is close to that predicted by theory.

Figure 3 of Section L.1 provides a histogram over the 1000 estimation samples of upper confidence bounds (defined in eq. 2.9 of Section 2) and (defined in Section 3) for the actual conditional asymptotic coverage of the nominal interval . To clarify the meaning of , define the conditional actual coverage , given the training sample to be

Then, by definition, a conditional upper confidence bound is a random variable satisfying

For example if for , then the actual coverage of the nominal 90 % interval is no more than 14 % with confidence at least . More precisely, the random interval is guaranteed to include the actual coverage of at least 90 % of the time in repeated sampling of the estimation sample with the training sample fixed. Recall from row 1, column 2 of the right panel of Table 1, that the actual Monte Carlo coverage of the nominal 90 % interval was 0 %. Hence our nominal 90% upper confidence bounds and are trivially conservative. More interestingly, they were both very close to 0 % (more precisely, less than 1 %) in more than 15% of the 1000 simulated estimation samples.

Organization of the paper

The remainder of the paper is organized as follows. In Section 1.1 to Section 1.3 we describe our data structure, our parameters of interest , the state of the art DR-ML estimators, and the statistical properties of these estimators.

In Section 2, we study the theoretical statistical properties of the oracle estimator of the bias of the DR-ML estimator , the biased-corrected estimator , the -level oracle test and the upper confidence bound for the actual conditional asymptotic coverage of the nominal Wald interval associated with .

In Section 3, we define several estimators of and, for each, compare both the finite sample properties (through simulation) and aspects of the theoretical asymptotic behavior of to those of the oracle . To choose among these candidate estimators, we construct a data-adaptive estimator (see Appendix I) where the candidate choice depends on both and the data. was the estimator used in the simulation study reported in the right panel in Table 1.

Although motivated by asymptotic results, our data-adaptive estimator was ultimately chosen based on its finite sample performance in simulations. This choice reflects the fact that for certain estimators with known theoretical properties, we found that, at values of required by the oracle to correct most of the bias of , the Monte Carlo variance and mean of exploded, becoming orders of magnitude larger than that predicted by the asymptotics. (A proposed explanation for this discrepancy is given in Appendix B.) For this reason, it is likely that future theoretical development and simulation experiments may lead us to somewhat modify .

Section 4 considers the case where , rather than being less than , includes but with . We leave the unknown case to a separate paper, because estimation of with requires additional assumptions that may not hold. The motivation to study the case is as follows. As discussed above, the estimator with less than but near has standard error not much larger than the standard error of , but can have much smaller bias. This suggests foregoing the estimation of an upper bound on the actual coverage of a nominal Wald confidence interval centered on ; rather always report, when is known, the nominal Wald confidence interval for for with just less than . However doing so naturally raises the question of whether the interval with covers at its nominal rate. Like for , we examine this by testing the null hypothesis that the ratio of the bias of to its standard error is smaller than a fraction . In Section 4.1, we construct such a test for any . In addition, we develop a sequential multiple testing procedure that tests this null hypothesis at level for each of different with with . Our procedure tests this null hypothesis sequentially beginning with and stops at the first for which the test does not reject and then accepts the null for . The sequential procedure protects the level of all tests (Rosenbaum, 2008). (Since all of our tests are inconsistent for reasons described earlier, “accepting the null” simply means protecting the level of the test if, in fact, the null were indeed true.)

In Section 5, we will conclude by discussing several open problems.

The following common asymptotic notations are used throughout the paper: (equivalently ) denotes that there exists some constant such that , means there exist some constants such that . or is equivalent to . For a random variable with law possibly depending on the sample size , denotes that is bounded in -probability, and means that for every positive real number .

1.1. Parameter of interest

In this part we begin to make precise the issues discussed above. For didactic purposes, we will restrict our discussion to the variance weighted average treatment effect (defined below) for a binary treatment and binary outcome given a vector of baseline covariates. To perform inference, we shall have access to samples from the joint distribution of .

We parametrize the joint distribution of by the variation independent parameters , where,

are respectively the regression of on and the regression of on , is the marginal density of , and is the conditional odds ratio. Often, is referred to as the propensity score. Throughout the paper, we use , and with subscript to indicate the expectation, variance, and covariance over the probability measure indexed by . We assume a non-parametric infinite dimensional model where indexes all possible subject to weak regularity conditions given later.

Under the assumption that the vector of measured covariates suffices to control confounding, the variance-weighted average treatment effect is identified as where is the conditional treatment effect given and .

Some algebra shows that

Henceforth, we shall restrict attention to the estimation of

The denominator of is simply the special case of in which w.p.1. If we can construct asymptotically unbiased and normal estimators of and , we also can construct the same for by the functional delta method.

We shall see that the statistical guarantees of our bias correction methodology differ depending on whether the parameter of interest is versus . In fact, the insight into our methodology offered by this difference is the reason we chose the variance weighted average treatment effect rather than the average treatment effect as the causal effect of interest in this paper.

In the next section, we describe the current state-of-the-art DR-ML estimator and . They will depend on estimators and of and , which may have been outputted by machine learning algorithms for estimating conditional means, with completely unknown statistical properties.

Remark 1.1.

The methods in Robins et al. (2009) and Ritov et al. (2014) can be straightforwardly combined to show that, without further unverifiable assumptions, for some , no consistent -level test of the null hypothesis versus the alternative cov exists, whenever some components of have a continuous distribution. See creftypecap 4.1 for a heuristic non-technical explanation. An analogous result holds for .

1.2. State-of-the-art estimators and and their asymptotic properties

The state-of-the-art DR-ML estimator uses sample splitting, because and have unknown statistical properties and, in particular, may not lie in a so-called Donsker class (see e.g. van der Vaart and Wellner (1996, Chapter 2)) - a condition often needed for valid inference when we do not split the sample. The cross-fit estimator is a DR-ML estimator that can recover the information lost by due to sample splitting, provided that is asymptotically unbiased.

The following algorithm defines and :

  • The study subjects are randomly split into 2 parts: an estimation sample of size and a training (nuisance) sample of size with . Without loss of generality we shall assume that corresponds to the estimation sample.

  • Estimators are constructed from the training sample data using ML methods.

  • Compute

    from subjects in the estimation sample and

    where is but with the training and estimation sample reversed.

1.3. Asymptotic properties of and

The following theorem (creftypecap 1.2) gives the asymptotic properties of , conditional on the training sample.

Theorem 1.2.

Conditional on the training sample, is asymptotically normal with conditional bias

(1.1)

where to the right of the conditioning bar indicates conditioning on the training sample through the estimated components of .

Proof.

Since conditionally and are fixed functions, is the sum of i.i.d. bounded random variables and thus is asymptotically normal. A straightforward calculation shows is the conditional bias. ∎

We note that is, by definition, doubly robust (Bang and Robins, 2005) because if either or with -probability 1. Finally, before proceeding, we summarize unconditional statistical properties of the DR-ML estimator in the following theorem (creftypecap 1.3). Recall that is random only through its dependence on the training sample data via and .

Theorem 1.3.

If a) is and b) and converge to and in , then

  1. is the first order influence function of . Further converges conditionally and unconditionally to a normal distribution with mean zero; is a regular, asymptotically linear estimator; i.e., converges to a normal distribution with mean zero and variance equal to the semiparametric variance bound (Newey, 1990).

  2. The nominal Wald confidence intervals (CIs)

    are asymptotic CI for . Here with

Proof.

The proof of creftypecap 1.3 is standard; See for example Chernozhukov et al. (2018a). ∎

Remark 1.4.

We could have divided the whole sample of size into random samples, say 5, with , , calculated using sample as estimation sample and the remaining samples as training sample. Then . When, for each , is of smaller order than its standard error, the asymptotic distribution of does not depend on . Nonetheless, or generally performs better than in finite samples. We restrict to for notational convenience only.

Remark 1.5.

Had we chosen the average treatment effect rather than the variance weighted average treatment effect as our parameter of interest, the outcome regression function appearing in the first order influence function would be for rather than .

Remark 1.6 (Training sample squared error loss cross-validation).

How can we choose among the many (say, ) available machine learning algorithms if our goal is to minimize the conditional mean squared error ? One approach is to let the data decide by applying cross-validation restricted to the training sample. Specifically, we randomly split the training sample into subsamples of size . For each subsample , we fit the ML algorithms to the other subsamples to obtain outputs , for . Next we calculate, for each , the squared error loss with , and finally select the ML algorithm that minimizes .

Although a standard result, creftypecap 1.3 is of minor interest to us in this paper for several reasons. First, because of their asymptotic nature, there is no finite sample size at which any test could empirically reject either the hypothesis or the hypothesis that is asymptotically normal with mean zero. Rather, as discussed in Section 1, our interest, instead, lies in testing and rejecting hypotheses such as, at the actual sample size , the actual asymptotic coverage of the interval , conditional on the training sample, is less than a fraction of its nominal coverage. Second, since the ML estimators and have unknown statistical properties and we are interested in essentially assumption-free inference, our inferential statements regard the training sample as fixed rather than random. In particular (with the exception of Section 3), the only randomness referred to in any theorem is that of the estimation sample. In fact, our inferences rely on being in ‘asymptopia’, only to be able to posit that, at our sample size of , the quantiles of the finite sample distribution of a conditionally asymptotically normal statistic (e.g. ) are close to the quantiles of a normal. Indeed, by using finite sample tail inequalities, our reliance on asymptotics could be totally eliminated at the expense of decreased power and increased confidence interval width.

Remark 1.7.

In Section K.3, we report results of an unconditional simulation study, (i.e. both the training and estimation sample were redrawn 200 times). The study gave similar results to ones in which the initial training sample was held fixed and only the estimation sample redrawn. This similarity only serves as evidence that the initial training sample was not an outlier. It provides no evidence concerning the order in probability of the conditional bias .

Before starting to explain our methodology in detail, we collect some frequently used notations.

Notations

For a (random) vector denotes its norm conditioning on the training sample, denotes its norm and denotes its norm. For any matrix , will be used for its operator norm. Given a , the random vector , denotes the population linear projection operator onto the space spanned by conditioning on the training sample: with , is the projection onto the orthogonal complement of in the Hilbert space . Hence, for a random variable ,

By defining conditionally, we can allow the vector to depend on the training sample data. denotes a generic estimator of . When referring to a particular estimator of , an identifying superscript will often be attached.

We also denote the following commonly used residuals as , , , and for , where and are estimated from the training sample.