Alternative Asymptotics and the Partially Linear Model with Many RegressorsThe authors thank comments from Alfonso Flores-Lagunes, Lutz Kilian, seminar participants at Bristol, Brown, Cambridge, Exeter, Indiana, LSE, Michigan, MSU, NYU, Princeton, Rutgers, Stanford, UCL, UCLA, UCSD, UC-Irvine, USC, Warwick and Yale, and conference participants at the 2010 Joint Statistical Meetings and the 2010 LACEA Impact Evaluation Network Conference. The first author gratefully acknowledges financial support from the National Science Foundation (SES 1122994). The second author gratefully acknowledges financial support from the National Science Foundation (SES 1124174) and the research support of CREATES (funded by the Danish National Research Foundation). The third author gratefully acknowledges financial support from the National Science Foundation (SES 1132399).

# Alternative Asymptotics and the Partially Linear Model with Many Regressors

## Abstract

Non-standard distributional approximations have received considerable attention in recent years. They often provide more accurate approximations in small samples, and theoretical improvements in some cases. This paper shows that the seemingly unrelated “many instruments asymptotics” and “small bandwidth asymptotics” share a common structure, where the object determining the limiting distribution is a V-statistic with a remainder that is an asymptotically normal degenerate U-statistic. We illustrate how this general structure can be used to derive new results by obtaining a new asymptotic distribution of a series estimator of the partially linear model when the number of terms in the series approximation possibly grows as fast as the sample size, which we call “many terms asymptotics”.

JEL classification: C13, C31.

Keywords: non-standard asymptotics, partially linear model, many terms, adjusted variance.

## 1 Introduction

Many instrument asymptotics, where the number of instruments grows as fast as the sample size, has proven useful for instrumental variables (IV) estimators. Kunitomo (1980) and Morimune (1983) derived asymptotic variances that are larger than the usual formulae when the number of instruments and sample size grow at the same rate, and Bekker (1994) and others provided consistent estimators of these larger variances. Hansen, Hausman, and Newey (2008) showed that using many instrument standard errors provides a theoretical improvement for a range of number of instruments and a practical improvement for estimating the returns to schooling. Thus, many instrument asymptotics and the associated standard errors have been demonstrated to be a useful alternative to the usual asymptotics for instrumental variables.

Instrumental variable estimators implicitly depend on a nonparametric series estimator. Many instrument asymptotics has the number of series terms growing so fast that the series estimator is not consistent. Analogous asymptotics for kernel-based density-weighted average derivative estimators has been considered by Cattaneo, Crump, and Jansson (2010, 2014b). They show that when the bandwidth shrinks faster than needed for consistency of the kernel estimator, the variance of the estimator is larger than the usual formula. They also find that correcting the variance provides an improvement over standard asymptotics for a range of bandwidths.

The purpose of this paper is to show that these results share a common structure, and to illustrate how this structure can be used to derive new results. The common structure is that the object determining the limiting distribution is a V-statistic, which can be decomposed into a bias term, a sample average, and a “remainder” that is an asymptotically normal degenerate U-statistic. Asymptotic normality of the remainder distinguishes this setting from other ones involving V-statistics. Here the asymptotically normal remainder comes from the number of series terms going to infinity or bandwidth shrinking to zero, while the behavior of a degenerate U-statistic tends to be more complicated in other settings. When the number of terms grows as fast as the sample size, or the bandwidth shrinks to zero at an appropriate rate, the remainder has the same magnitude as the leading term, resulting in an asymptotic variance larger than just the variance of the leading term. The many instrument and small bandwidth results share this structure. In keeping with this common structure, we will henceforth refer to such results under the general heading of “alternative asymptotics”.

The alternative asymptotics that we discuss in this paper applies to statistics that take a specific V-statistic representation, or may be approximated by it sufficiently accurately, and therefore it does not apply broadly to all possible semiparametric settings. Nonetheless, as we illustrate below, this structure arises naturally in several interesting problems in Economics and Statistics. In particular, we show formally that applying this common structure to a series estimator of the partially linear model leads to new results. These results allow the number of terms in the series approximation to grow as fast as the sample size. The asymptotic distribution of the estimator is derived and it is shown to have a larger asymptotic variance than the usual formula, which is in fact a natural and generic consequence of the specific structure that we highlight in this paper. We also find that under homoskedasticity, the classical degrees of freedom adjusted homoskedastic standard error estimator from linear models is consistent even when the number of terms is “large” relative to the sample size. This result offers a large sample, distribution free justification for the degrees of freedom correction when many series terms are employed. Constructing automatic consistent standard error estimator under (conditional) heteroskedasticity of unknown form in this setting turns out to be quite challenging. In Cattaneo, Jansson, and Newey (2015), we present a detailed discussion of heteroskedasticity-robust standard errors for general linear models with increasing dimension, which covers the partially linear model with many terms studied herein as a special case.

The rest of the paper is organized as follows. Section 2 describes the common structure of many instrument and small bandwidth asymptotics, and also shows how the structure leads to new results for the partially linear model. Section 3 formalizes the new distributional approximation for the partially linear model. Section 4 reports results from a small simulation study aimed to illustrate our results in small samples. Section 5 concludes. The appendix collects the proofs of our results.

## 2 A Common Structure

To describe the common structure of many instrument and small bandwidth asymptotics, let denote independent random vectors. We consider an estimator of a generic parameter of interest satisfying

where is a function that can depend on , , and . We allow to depend on to account for number of terms or bandwidths that change with the sample size. Also, we allow to vary with and to account for dependence on variables that are being conditioned on in the asymptotics, and so treated as nonrandom.

We assume throughout this section that there exists a sequence of non-random matrices satisfying for the identity matrix, and hence we focus on the V-statistic . (All limits are taken as unless explicitly stated otherwise.) This V-statistic has a well known (Hoeffding-type) decomposition that we describe here because it is an essential feature of the common structure. For notational implicitly we will drop the and arguments and set and .

Letting denote the Euclidean norm, and if  for all , then

 Sn=Bn+Ψn+Un, (2)

where

 ψni(Wi)=unii−E[unii]+∑1≤j≤n,j≠iE[~unij|Wi],
 Extra open brace or missing close brace

so that , , and . This decomposition of a V-statistic is well known (e.g., van der Vaart (1998, Chapter 11)), and shows that can be decomposed into a sum of independent terms, a U-statistic remainder that is a martingale difference sum and uncorrelated with , and a pure bias term .2 The decomposition is important in many of the proofs of asymptotic normality of semiparametric estimators, including Powell, Stock, and Stoker (1989), with the limiting distribution being determined by , and being treated as a “remainder” that is of smaller order under a particular restriction on the tuning parameter sequence (e.g., when the bandwidth shrinks slowly enough).

An interesting feature of the decomposition (2) in semiparametric settings is that is asymptotically normal at some rate when the number of series terms grow or the bandwidth shrinks to zero. To be specific, under regularity conditions and appropriate tuning parameter sequences that we make precise below, it turns out that

 [V[Ψn]−1/2ΨnV[Un]−1/2Un]→dN(0,I2d).

In other settings, where the underlying kernel of the U-statistic does not vary with the sample size, the asymptotic behavior of is usually more complicated: because it is a degenerate U-statistic, it would converge to a weighted sum of independent chi-square random variables (e.g., van der Vaart (1998, Chapter 12)). However, in semiparametric-type settings as those considered here, the kernel of the underlying U-statistic forming changes with the sample size and hence, under particular tuning parameter configurations, the individual contributions to can be made small enough to satisfy a Lindeberg-Feller condition and thus obtain a Gaussian limiting distribution (usually employing the martingale property of ). For an interesting discussion of this phenomenon, see de Jong (1987). The asymptotic normality property of has been shown for certain classes of both series and kernel based estimators, as further explained below.

Alternative asymptotics occurs when the number of series terms grows or the bandwidth shrinks fast enough so that and have the same magnitude in the limit. Because of uncorrelatedness of and , the asymptotic variance will be larger than the usual formula which is (assuming the limit exists). As a consequence, consistent variance estimation under alternative asymptotics requires accounting for the contribution of to the (asymptotic) sampling variability of the statistic. Accounting for the presence of should also yield improvements when numbers of series terms and bandwidths do not satisfy the knife-edge conditions of alternative asymptotics, since is part of the semiparametric statistic. For instance, if the number of series terms grows just slightly slower than the sample size then accounting for the presence of should still give a better large sample approximation. Hansen, Hausman, and Newey (2008) show such an improvement for many instrument asymptotics. It would be good to consider such improved approximations more generally, though it is beyond the scope of this paper to do so.

Distribution theory under alternative asymptotics may be seen as a generalization of the conventional large sample distributional approximation approach in the sense that under conventional sequences of tuning parameters the asymptotic variances emerging from both approaches coincide. But, the alternative asymptotic approximation also allows for other tuning parameter sequences and, in this case, the limiting asymptotic variance is seen to be larger than usual. Thus, in general, there is no reason to expect that the usual standard error formulas derived under conventional asymptotics will remain valid more generically. From this perspective, alternative asymptotics are useful to provide theoretical justification for new standard error formulas that are consistent under more general sequences of tuning parameters, that is, under both conventional and alternative asymptotics. We refer to the latter standard error formulas as being more robust than the usual standard error formulas available in the literature. For instance, using these ideas, the need for new, more robust standard errors formulas was made before for many instrument asymptotics in IV models (Hansen, Hausman, and Newey (2008)) and small bandwidth asymptotics in kernel-based semiparametrics (Cattaneo, Crump, and Jansson (2014b)).

To illustrate these ideas, we show next that both many instrument asymptotics and small bandwidth asymptotics have the structure described above, and we also employ this approach to derive new results in the case of a series estimator of the partially linear model, which we refer to as “many terms asymptotics”.

### Example 1: “Many Instrument Asymptotics”

The first example is concerned with the case of many instrument asymptotics. For simplicity we focus on the JIVE2 estimator of Angrist, Imbens, and Krueger (1999), but the idea applies to other IV estimators such as the limited information maximum likelihood estimator. See Chao, Swanson, Hausman, Newey, and Woutersen (2012) for more details, including regularity conditions under which the following discussion can be made rigorous.

Let , , be a random sample generated by the model

 yi=x′iβ0+εi, E[εi|zi]=0, (3)

where is a scalar dependent variable, is a vector of endogenous variables, is a disturbance, and is a vector of instrumental variables.

To describe the JIVE2 estimator of in (3), let denote the -th element of , where . After centering and scaling, the JIVE2 estimator satisfies

 √n(^β−β0)=(1n∑1≤i,j≤n,j≠iQijxix′j)−1(1√n∑1≤i,j≤n,j≠iQijxiεj).

Conditional on has the structure in (1) with and

 Missing dimension or its units for \rule

where is the indicator function.

For , and

 E[unij(Wi,Wj)|Wi,Z]=QijxiE[εj|Z]=0, E[unji(Wj,Wi)|Wi,Z]=QijΥjεi/√n,

where can be interpreted as the reduced form for observation . As a consequence, (2) is satisfied with ,

 ψni(Wi)=(∑1≤j≤n,j≠iQijΥj)εi=Υi(1−Qii)εi/√n−(Υi−∑1≤j≤nQijΥj)εi/√n,
 Dni(Wi,...,W1)=∑1≤j≤n,j

Because is the -th residual from regressing the reduced form observations on , by appropriate definition of the reduced form this can generally be assumed to vanish as the sample size grows. In that case,

 Ψn=1√n∑1≤i≤nΥi(1−Qii)εi+op(1).

Furthermore, under standard asymptotics will go to zero, so the limiting variance of the leading term in corresponds to the usual asymptotic variance for IV. The degenerate U-statistic term is

 Un=1√n∑1≤i,j≤n,j

Chao, Swanson, Hausman, Newey, and Woutersen (2012) apply a martingale central limit theorem to show that this will be asymptotically normal when and certain regularity conditions hold. The conditions of the martingale central limit theorem are verified by showing that certain linear combinations with coefficients depending on the elements of go to zero as . In the proof, this makes individual terms asymptotically negligible, with a Lindeberg-Feller condition being satisfied. Alternative asymptotics occurs when grows as fast as , resulting in and having the same magnitude in the limit.

### Example 2: “Small Bandwidth Asymptotics”

The second example shows that small bandwidth asymptotics for certain kernel-based semiparametric estimators also has the structure outlined above. To keep the exposition simple we focus on an estimator of the integrated squared density, but the structure of this estimator is shared by the density-weighted average derivative estimator of Powell, Stock, and Stoker (1989) treated in Cattaneo, Crump, and Jansson (2014b) and more generally by estimators of density-weighted averages and ratios thereof (see, e.g., Newey, Hsieh, and Robins (2004, Section 2) and references therein).

Suppose , , are i.i.d. continuously distributed -dimensional random vectors with smooth p.d.f. and consider estimation of the integrated squared density

 β0=∫Rpf0(x)2dx=E[f0(xi)].

A leave-one-out kernel-based estimator is

 ^β=∑1≤i,j≤n,i≠jKh(xi−xj)/n(n−1),

where is a symmetric kernel and . This estimator has the V-statistic form of (1) with and

 ^Γn=1, unij(Wi,Wj)=1(i≠j){Kh(xi−xj)−β0}/√n(n−1).

Let and d. By symmetry of ,

 E[unij(Wi,Wj)|Wi]=E[unji(Wj,Wi)|Wi]={fh(xi)−β0}/√n(n−1),
 E[unij(Wi,Wj)]={βh−β0}/√n(n−1),

so the terms in the decomposition (2) are of the form

 Un=2√n(n−1)∑1≤i,j≤n,j

Here, is an approximation to the well known influence function for estimators of the integrated squared density. Under regularity conditions, converges to in mean square as , so that

 Ψn=1√n∑1≤i≤n2{f0(xi)−β0}+op(1).

A martingale central limit theorem can be applied as in Cattaneo, Crump, and Jansson (2014b) to show that the degenerate U-statistic term will be asymptotically normal as and , provided that . It is easy to show that d (under regularity conditions). Alternative asymptotics occurs when shrinks as fast as , resulting in and having the same magnitude in the limit.

### Example 3: “Many Terms Asymptotics”

The previous two examples show how several estimators share the common structure outlined above. To illustrate how this structure can be applied to derive new results, the third example studies series estimation in the context of the partially linear model. The results will shed light on the asymptotic behavior of this estimator, and the associated inference procedures, when the number of terms are allowed to grow as fast as the sample size.

Let , , be a random sample of generated by the partially linear model

 yi=x′iβ0+g(zi)+εi,E[εi|xi,zi]=0, (4)

where is a scalar dependent variable, and are explanatory variables, is a disturbance, is an unknown function, and is of full rank.

A series estimator of is obtained by regressing on and approximating functions of . To describe the estimator, let be approximating functions, such as polynomials or splines, and let be a -dimensional vector of such functions. Letting denote the -th element of where , a series estimator of in (4) is given by

 ^β=(∑1≤i,j≤nMijxix′j)−1(∑1≤i,j≤nMijxiyj).

Donald and Newey (1994) gave conditions for asymptotic normality of this estimator using standard asymptotics. See also for example Linton (1995), references therein, for related asymptotic results when using kernel estimators.

Conditional on , has the structure outlined earlier:

 √n(^β−β0)=^Γ−1nSn, (5)

with

 ^Γn=1n∑1≤i,j≤nMijxix′j,Sn=1√n∑1≤i,j≤nxiMij(gj+εj),

where In other words, has the V-statistic form of (1) with and .

By we have . Therefore, letting as we have done previously, we have

 ~unij=Mij(vjgi+vigj+xjεi+xiεj)/√n,E[~unij|Wi,Z]=Mij(vigj+hjεi)/√n,

for , where and . In this case, the bias term in (2) is

 Bn=1√n∑1≤i,j≤nMijhigj,

which will be negligible under regularity conditions, as shown in the next section. Moreover,

 Ψn=1√n∑1≤i≤nMiiviεi+Rn,Rn=1√n∑1≤i,j≤nMij(vigj+hiεj),

where has mean zero and converges to zero in mean square as grows, as further discussed below. Under standard asymptotics will go to one and hence the limiting variance of the leading term in corresponds to the usual asymptotic variance.

Finally, we find that the degenerate U-statistic term is

 Un=1√n∑1≤i,j≤n,j

Remarkably, this term is essentially the same as the degenerate U-statistic term for JIVE2 that was discussed above. Consequently, the central limit theorem of Chao, Swanson, Hausman, Newey, and Woutersen (2012) is applicable to this problem. We will employ it to show that is asymptotically normal as , even when does not converge to zero.

This example highlights a new approach to studying the asymptotic distribution of semi-linear regression under many terms asymptotics. The alternative asymptotic approximation is useful, for instance, when the number of covariates entering the nonparametric part is large relative to the sample size, as is often the case in empirical applications.

## 3 Many Terms Asymptotics

In this section we make precise the discussion given in Example 3, and also discuss consistent standard error estimation under homoskedasticity. The estimator described in Example 3 can be interpreted as a two-step semiparametric estimator with tuning parameter , the first step involving series estimation of the the unknown (regression) functions and . Donald and Newey (1994) gave conditions for asymptotic normality of this estimator when . Here we generalize their findings by obtaining an asymptotic distributional result that is valid even when is bounded away from zero.

The analysis proceeds under the following assumption.

Assumption PLM

(Partially Linear Model)

(a) , , is a random sample.

(b) There is a such that and .

(c) There is a such that and .

(d) (a.s.) and there is a such that .

(e) For some , there is a such that

 minηg∈RKE[|g(zi)−η′gpK(zi)|2]≤CK−2αg,minηh∈RK×dE[∥h(zi)−η′hpK(zi)∥2]≤CK−2αh.

Because , an implication of part (d) is that , but crucially Assumption PLM does not imply that . Part (e) is implied by conventional assumptions from approximation theory. For instance, when the support of is compact commonly used basis of approximation, such as polynomials or splines, will satisfy this assumption with and , where and denotes the number of continuous derivatives of and , respectively. Further discussion and related references for several basis of approximation may be found in Newey (1997), Chen (2007) and Belloni, Chernozhukov, Chetverikov, and Kato (2015), among others.

### 3.1 Asymptotic Distribution

From equation (5), and the discussion in the previous section, we see that the asymptotic distribution of will be determined by the behavior of and . The following lemma approximates without requiring that .

###### Lemma 1

If Assumption PLM is satisfied and if , then

Because , it follows from this result that in the homoskedastic case (i.e., when ) is close to

 Γn=(1−K/n)Γ,Γ=E[viv′i],

in probability. More generally, with heteroskedasticity, will be close to the weighted average . Importantly, this result includes standard asymptotics as a special case when , where , the law of large numbers and iterated expectations imply

 Γn =1nn∑i=1E[viv′i|zi]−1nn∑i=1(1−Mii)E[viv′i|zi]+op(1) =1nn∑i=1E[viv′i|zi]+op(1)=Γ+op(1).

Next, we study

 Sn=1√n∑1≤i,j≤nMijviεj+Bn+Rn.

The following lemma quantifies the magnitude of the bias term as well as the additional variability arising from the (remainder) term .

###### Lemma 2

If Assumption PLM is satisfied and if then and .

Like the previous lemma, this lemma does not require . Interestingly, the bias term involves approximation of both unknown functions and , implying an implicit trade-off between smoothness conditions for and . The implied bias condition only requires that be large enough, but not necessarily that and separately be large. It follows that if this bias condition holds, then

 Sn=1√n∑1≤i,j≤nMijviεj+op(1),

as claimed in Example 3 above.

Having dispensed with asymptotically negligible contributions to , we turn to its leading term. This term is shown below to be asymptotically Gaussian with asymptotic variance given by

 Σn=1nV[∑1≤i,j≤nMijviεj|Z]=1n∑1≤i≤nM2iiE[viv′iε2i|zi]+1n∑1≤i,j≤n,j≠iM2ijE[viv′iε2j|zi,zj].

Here, the first term following the second equality corresponds to the usual asymptotic approximation, while the second term adds an additional term that accounts for large . Once again it is interesting to consider what happens in some special cases. Under homoskedasticity of (i.e., when ),

 Σn=σ2εn∑1≤i,j≤nM2ijE[viv′i|zi]=σ2εn∑1≤i≤nMiiE[viv′i|zi]=σ2εΓn,σ2ε=E[ε2i],

because . If, in addition, , then . Also, if , then by and the law of large numbers, we have

 Σn=1n∑1≤i≤nM2iiE[viv′iε2i|zi]+op(1)=E[viv′iε2i]+op(1),

which corresponds to the standard asymptotics limiting variance.

The following theorem combines Lemmas 1 and 2 with a central limit theorem for quadratic forms to show asymptotic normality of .

###### Theorem 1

If Assumption PLM is satisfied and if , then

If, in addition, , then .

This theorem shows that is asymptotically normal when need not converge to zero. An implication of this result is that inconsistent series-based nonparametric estimators of the unknown functions and may be employed when forming , that is, is allowed (increasing the variability of the nonparametric estimators), provided that (to remove nonparametric smoothing bias). This asymptotic distributional result does not rely on asymptotic linearity, nor on the actual convergence of the matrices and , and leads to a new (larger) asymptotic variance that captures terms that are assumed away by the classical result. The asymptotic distribution result of Donald and Newey (1994) is obtained as a special case where . More generally, when does not converge to zero, the asymptotic variance will be larger than the usual formula because it accounts for the contribution of “remainder”  in equation (2). For instance, when both and are homoskedastic, the asymptotic variance is

 Γ−1nΣnΓ−1n=σ2εΓ−1n=σ2εΓ−1(1−K/n)−1,

which is larger than the usual asymptotic variance by the degrees of freedom correction

### 3.2 Asymptotic Variance Estimation under Homoskedasticity

Consistent asymptotic variance estimation is useful for large sample inference. If the assumptions of Theorem 1 are satisfied and if , then

implying that valid large-sample confidence intervals and hypothesis tests for linear and nonlinear transformations of the parameter vector can be based on .3 Under (conditional) heteroskedasticity of unknown form, constructing a consistent estimator turns out to be very challenging if . Intuitively, the problem arises because the estimated residuals entering the construction of are not consistent unless , implying that in general. Solving this problem is beyond the scope of this paper. Under homoskedasticity of , however, the asymptotic variance simplifies and admits a correspondingly simple consistent estimator. To describe this result, note that if then , where by Lemma 1. It therefore suffices to find a consistent estimator of . Let

 s2=1n−d−K∑1≤i≤n^ε2i,^εi=∑1≤j≤nMij(yj−^β′xj),

denote the usual OLS estimator of incorporating a degrees of freedom correction. The following theorem shows that is a consistent estimator, even when the number of terms is “large” relative to the sample size.

###### Theorem 2

Suppose the conditions of Theorem 1 are satisfied. If , then and , where .

This theorem provides a distribution free, large sample justification for the degrees-of-freedom correction required for exact inference under homoskedastic Gaussian errors. Intuitively, accounting for the correct degrees of freedom is important whenever the number of terms in the semi-linear model is “large” relative to the sample size.

## 4 Small Simulation Study

We conducted a Monte Carlo experiment to explore the extent to which the asymptotic theoretical results obtained in the previous section are present in small samples. Using the notation already introduced, we consider the following partially linear model:

 yi=x′iβ+g(zi)+εi, E[εi|xi,zi]=0, E[ε2i|xi,zi]=σ2ε, xi=h(zi)+vi, E[vi|zi]=0, E[v2i|zi]=σ2v(zi),

where , , , with i.i.d. , . The unknown regression functions are set to , which are not additive separable in the covariates . The simulation study is based on replications, each replication taking a random sample of size with all random variables generated independently. We consider data generating processes (DGPs) as follows:

Data Generating Process for Monte Carlo Experiment
– Distributions
Gaussian Asymmetric Bimodal
Model 1 Model 3 Model 5
Model 2 Model 4 Model 6

Specifically, Models 1, 3 and 5 correspond to homoskedastic (in ) DGPs, while Models 2, 4 and 5 correspond to heteroskedastic (in ) DGPs. For the latter models, the constant was chosen so that . The three distributions considered for the unobserved error terms and are: the standard Normal (labelled “Gaussian”) and two Mixture of Normals inducing either an asymmetric or a bimodal distribution; their Lebesgue densities are depicted in Figure 1. We explored other specifications for the regression functions, heteroskedasticity form, and distributional assumptions, but we do not report these additional results because they were qualitative similar to those discussed here.

The estimators considered in the Monte Carlo experiment are constructed using power series approximations. We do not impose additive separability on the basis, though we do restrict the interaction terms to not exceed degree 5. To be specific, we consider the following polynomial basis expansion:

Polynomial Basis Expansion: and
first-order interactions
second-order interactions