# Inference in Additively Separable Models with a High Dimensional Set of Conditioning Variables

Damian Kozbur
First version: September 2013. This version is of July 5, 2019. d   Correspondence: Haldeneggsteig 4, 8092 Zürich, Center for Law and Economics, D-GESS, ETH Zürich, damian.kozbur@gess.ethz.ch
###### Abstract.

This paper considers estimation and inference of nonparametric conditional expectation relations with a high dimensional conditioning set. Rates of convergence and asymptotic normality are derived for series estimators for models where conditioning information enters in an additively separable manner and satisfies sparsity assumptions. Conditioning information is selected through a model selection procedure which chooses relevant variables in a manner that generalizes the post-double selection procedure proposed in (?) to the nonparametric setting. The proposed method formalizes considerations for trading off estimation precision with omitted variables bias in a nonparametric setting. Simulation results demonstrate that the proposed estimator performs favorably in terms of size of tests and risk properties relative to other estimation strategies.

Key Words: nonparametric models, high dimensional-sparse regression, inference under imperfect model selection. JEL Codes: C1.

I thank Christian Hansen, Tim Conley, Matt Taddy, Azeem Shaikh, Dan Nguyen, Emily Oster, Martin Schonger, Eric Floyd, and seminar participants at University of Wester Ontario, University of Pennsylvania, Rutgers University, Monash University, Center for Law and Economics at ETH Zurich for helpful comments. I gratefully acknowledge financial support from the ETH Postdoctoral Fellowship

## 1. Introduction

Nonparametric estimation in economic and statistical problems is common because it is appealing in applications for which functional forms are unavailable. In many problems, the primary quantities of interest can be computed from the conditional expectation function of an outcome variable given a regressor of interest . Nonparametric methods are often attractive for estimating such conditional expectations since assuming an incorrect simple parametric model between the variables of interest will lead to incorrect inference.

In many econometric models, it is important to take into account conditioning information, . When is not randomly assigned, estimates of partial effects of on will often be incorrect if is ignored and at the same time, partly influences both and . However, if can be considered as approximately randomly assigned given , then the conditional expectation of given and can be used to calculate causal effects of on and to evaluate counterfactuals. Therefore, properly accounting for conditioning information is of primary importance.

When conditioning information is important to the problem, it is necessary to replace the simple objective of learning the conditional mean function with the new objective of learning a family of conditional mean functions

 E[y|x,z]=gz(x) (1.1)

indexed by . Properly accounting for conditioning information in may be done in several ways and leaves the researcher with important modeling decisions. For the sake of illustration, four potential ways to account for conditioning information are (1) by specifying a partially linear model , (2) by specifying an additive model , (3) by specifying a multiplicative model , or (4) by specifying a fully nonparametric model . The fully nonparametric model suffers from the curse of dimensionality even for moderately many covariates, while the partially linear model may be too rigid and may miss important conditioning information.

Many applications have potentially large conditioning sets. A large conditioning set in economics may arise because the researcher wishes to control for many measured characteristics, like demographics, at the observation level. In the extreme cases, the dimension of can be large enough so that all four example specifications for listed above require an infeasible amount of data in order to avoid statistical overfitting and ensure good inference.

This paper is restricted to studying the partially linear model and the additive model shown above.111These models have a particular structure which make them convenient for the problem of selecting a conditioning set. The two alternative models are likely to require conditions considerably different than the ones considered here. Therefore, the interest is the specialization of model (1.1) to the case

 E[y|x,z]=g(x)+h(z). (1.2)

When additive the model provides a good approximation to the underlying data generating structure, it is useful since many quantities describing the relationship of and conditional on can be learned with a good understanding of alone. In addition, it provides a clear description of how the conditional relation between and changes as changes.

An important structure that has been used in recent econometrics is approximate sparsity. See for example (?), (?), and (?). In the context of this paper, approximate sparsity informally refers to the condition that the conditional expectation function can be approximated by a family of functions which depend only on a small (though a priori unknown) subset of the conditioning information contained in . Sparsity is useful because in principal, a researcher can address statistical overfitting problems by locating and controlling for only correct conditioning information. The focus of this paper is on providing a formal model selection technique over models for the conditioning set which retrieve the relevant conditioning information. The retrieval of relevant conditioning information will be done in such a way so that standard estimation techniques performed after model selection will provide correct inference for nonparametric partial effects of .

This paper takes model (1.2) as a starting point and assumes that , a complicated function of many conditioning variables, has sparse structure. The strategy for identifying a sparse structure is to search for a small subset of relevant terms within a long series expansion for . To formalize this, a high dimensional framework, allowing the long series expansion for to have more terms than the sample size is particularly convenient. Once a simple model for is found, the focus returns to the estimation of . The paper contributes to the nonparametrics literature by establishing rates of convergence of estimates and asymptotic normality for functionals of after formal model selection has been performed to simplify the way the conditioning variable affects the conditional expectation relation between and .

In addition to addressing questions about flexible selection of conditioning set in a nonparametric setting, this paper contributes to a broader program aimed at conducting inference in the context of high-dimensional models. Statistical methods in high dimensions have been well developed for the purpose of prediction. Two widely used methods for estimating high dimensional predictive relationships and are important for the present paper are Lasso and Post Lasso. The Lasso is a shrinkage procedure which estimates regression coefficients by minimizing a loss function plus a penalty for the size of the coefficient. Post-Lasso fits an ordinary least squares regression on variables with non-identically-zero estimated Lasso coefficients. For theoretical and simulation results about the performance of these two methods, see (?) (?), (?) (?) (?), (?), (?), (?), (?) (?), (?), (?), (?), (?), (?), (?), (?), (?), (?), (?), (?), (?), (?), (?), (?), among many more. Regularized estimation buys stability through reduction in estimate variability at the cost of a modest bias in estimates. Regularized estimators like the Lasso where many parameter values are set identically to zero, also favor parsimony. Recently, several authors have begun the task of assessing uncertainties or estimation error of model parameter estimates in a wide variety of models models with high dimensional regressors (see, for example, (?); (?); (?); (?); (?); (?); (?); and (?)).

Quantifying estimation precision has been shown to be difficult theoretically (for formal statements, see (?), (?)) because model selection mistakes and regularization typically bias estimates to the same order of magnitude (relative to the sample size) as estimation variability. This paper builds on methodology found in (?) (named Post-Double-Selection) which gives robust statistical inference for the slope parameter of a treatment variable with high-dimensional confounders in the context of a partially linear model .222Their formulation is slightly more general because it allows the equality to be an approximate equality. The method selects elements of in two steps: step 1 selects the terms in that are most useful for predicting , and step 2 selects elements of most useful for predicting in a second step. The use of two model selection steps is motivated partially by the intuition that two necessary conditions for omitted variables bias to occur: an omitted variable exists which is (1) correlated with the treatment , and (2) correlated with the outcome . Each selection step addresses one of the two concerns. In their paper, they prove that under the regularity right conditions, the two described model selection steps can be used to obtain asymptotically normal estimates of and in turn to construct correctly sized confidence intervals. This paper generalizes the approach from estimating a linear treatment model to estimating a component in nonparametric additively separable models. The main technical contribution is providing conditions under which nonparametric estimates for functionals of are uniformly asymptotically normal after model selection on a conditioning set over a large set of data generating processes.

## 2. A high dimensional additively separable model

This section provides an intuitive discussion of the additively separable nonparametric model explored in this paper. Recall the additive conditional expectation model described in the introduction:

 E[y|x,z]=gz(x)=g(x)+h(z).

The interest is in recovering the function which describes the conditional relationship between the treatment variable of interest, , and the outcome . The component functions and belong to ambient spaces which restricted sufficiently to allow and to be uniquely identified.333For example, it necessary to require additional conditions like , otherwise, at best, and are identified up to addition by constants The function and the variable will be allowed to depend on to facilitate a high-dimensional thought experiment for the conditioning set. As an example, this allows estimation of models of the form with . Dependence on will be supressed for ease of notation. The formulation will allow and to share variables to some extent. For instance, the setup will allow for additive interaction models like those found in Andrews and Whang (1991) so that the model where is an unknown scalar.

The estimation of proceeds by a series approximation with a dictionary that is partially data-dependent. As a review, a standard series estimator of the conditional expectation function without conditioning set, , is obtained with the aid of a dictionary of transformations . The dictionary consists of a set of functions of with the property that a linear combination of the can approximate to an increasing level of precision that depends on . is permitted to depend on and may include splines, fourier series, orthogonal polynomials or other functions which may be useful for approximating . The series estimator is simple and implemented with standard least squares regression. Given data for , a series estimator for is takes the form: for , and . Traditionally, the number of series terms, chosen in a way to simultaneously reduce bias and increase precision, must be small relative to the sample size. Thus the function of interest must be sufficiently smooth or simple in order for nonparametric estimation to work well. The econometric theory for nonparametric regression estimation using an approximating series expansion is well-understood under standard regularity conditions; see, for example, (?),(?), (?).

Series estimation is particularly convenient for estimating models of the form of (1.2) because they can be approached using two dictionaries, consisting of and terms which individually approximate and . The dictionaries can simply be combined into one larger dictionary. To describe the estimation procedure in this paper, suppose such a dictionary, , compatible with the additively separable decomposition exists and is known. In what follows, dependence on and is suppressed in the notation so that and .

The two dictionaries differ in nature. The first dictionary, is traditional, and follows standard conditions imposed on series estimators, for example,(?), requiring among other conditions, that . The first dictionary must be chosen to approximate the function sufficiently well so that if were known exactly, could be estimated in the traditional nonparametric way and inferences on functionals of would be reliable.

The second dictionary, , is afforded much more flexibility. This is convenient and appropriate when entertaining a high dimensional conditioning set . When the problem of interest is in recovering and performing inference for , the second component may be considered a nuisance parameter. In particular, this paper will not be concerned with constructing confidence intervals for , and therefore the requirements on the magnitude of bias in estimating will less stringent. As a consequence, model selection bias of estimates of , when done according to the method below, will have negligible impact on the coverage probabilities of condifence sets. Increased flexibility in modeling by allowing can make subsequent inference for more robust, but requires additional structure on . The key additional conditions are sparse approximation conditions. The first sparsity requirement is that there is a small number of components of that adequately approximate the function . The second sparsity requirement is that information about functions conditional on can be suitably approximated using a small number of terms in . The identities of the contributing terms, however, can be unknown to the researcher a priori.

Aside from estimating an entire component of conditional expectation function itself, a goal of this paper is to obtain asymptotically normal estimates of certain functionals of . Given a functional , that satisfies certain regularity conditions, the model selection procedure on will deliver a model such that subsequent plug-in estimates will be asymptotical normal around . Such functionals include integrals of , weighted average derivatives of , evaluation of at a point , and .

## 3. Estimation

When the number of free parameters is larger than the sample size, model selection or regularization is necessary. There are a variety of different model selection techniques available to researchers. A popular approach is via the Lasso estimator given by (?) and (?) which in the context of regression, simultaneously performs regularization and model selection. The Lasso is used in many areas of science and image processing and has demonstrated good predictive performance. Lasso allows the estimation of regression coefficients even when the sample size is smaller than the number of parameters by adding to the quadratic objective function a penalty term which mechanically favors regression coefficients that contain zero elements. By taking advantage of ideas in regularized regression, this paper demonstrates that quality estimation of can be attained even when , the effective number of parameters, exceeds the sample size . Estimating proceeds by a model selection step that effectively reduces the number of parameters to be estimated. There are many other sensible candidates for model selection devices in the statistics and econometrics literature. The appropriate choice of model selection methodology can be tailored to the application. In addition to the Lasso, variants of Lasso like the group-Lasso, the Scad (see (?)), the BIC, the AIC all feasible examples. In the exposition of the results, the model selection procedure used will be specifically the Lasso because it is simple and widely used. The section 3.2 below provides a brief review of Lasso, especially those that arise in econometric applications.

Estimation of will be based on a reduced dictionary comprised of a subset of the series terms in and . Because the primary object of interest is , it is natural to include all terms belonging to in the reduced dictionary, giving . Therefore, the main selection step involves choosing a subset of terms from . Given a model selection procedure which provides a new reduced dictionary , containing series terms, the post-model selection estimate of is defined by

 ˆg(x)=p(x)′ˆβ

where .

Since estimation of is of secondary concern, only the components of predictive of and need to be estimated. These two predictive goals will guide the choice of model selection procedure as described in the upcoming sections. The results will demonstrate that under standard regularity, the post-model selection estimates give convergence rates for which are the same as in classic nonparametric estimation as well as asymptotic normality results for plug in estimators, of nonlinear functionals of the underlying conditional expectation function.

### 3.1. Nonparametric Post-Double Selection in the Additive Model

The main challenge in statistical inference after model selection is in attaining robustness to model selection errors. When coefficients are small relative to the sample size (ie statistically indistinguishable from zero), model selection mistakes are unavoidable.444Under some restrictive conditions, for example beta-min conditions which constrain nonzero coefficients to have large magnitudes, perfect model selection can be attained. When such errors are not accounted for, subsequent inference has been shown to be potentially severely misldeading. This intuition is formally developed in (?) and (?). Offering solutions to this problem is the focus of a number of recent papers; see, for example, (?), (?), (?), (?), (?), (?), (?), (?), and (?).555Citations are ordered by date of first appearance on arXiv. This section extends the approach of (?) to the nonparametric setting.

Informally, model selection in the additively separable model proceeds in two steps. The two selection steps are based on the observation that the functional relation can be learned with knowledge of the conditional expectations

 E[φ(x)|z] (3.3)
 E[y|z] (3.4)

for a robust enough family of test functions , for instance, smooth functions with compact support. Equivalently, the relation can be learned by projecting out the variable and working with residuals. In the additively separable model, the two selection steps are summarized as follows:

(1) First Stage Model Selection Step - Select those terms in which are relevant for predicting terms in .

(2) Reduced Form Model Selection Step - Select those terms in which are relevant for predicting .

To further describe the selection stages, it is convenient to ease notation by introducing an operator on functions that belong to :

 Tφ(z)=E[φ(x)|z]

This notion is convenient for understanding the validity behind post double selection in the additively separable model. The operator measures dependence between functions in the ambient spaces which house the functions and the conditioning is understood to be on all function .

If the operator can be suitably well approximated, then the post double selection methodology generalizes to the nonparametric additively separable case. The operator on will be approximated as a linear combination, given by of basis terms so that

 Tφ(z)≈q(z)′Γφ.

Meanwhile, is approximated with linear combinations of , . The final selected model consists of the union of terms selected during the first stage model selection step and the reduced form model selection step. A practical implementation algorithm is provided in Section3.3

(?) develop and discuss the post-double-selection method in detail for partially linear model. They note that including the union of the variables selected in each variable selection step helps address the issue that model selection is inherently prone to errors unless stringent assumptions are made. As noted by (?), the possibility of model selection mistakes precludes the possibility of valid post-model-selection inference based on a single Lasso regression within a large class of interesting models. The chief difficulty arises with covariates whose effects in (3.3) are small enough that the variables are likely to be missed if only (3.3) is considered but have large effects in (3.4). The exclusion of such variables may lead to substantial omitted variables bias if they are excluded which is likely if variables are selected using only (3.3).666The same is true if only (3.4) is used for variable selection exchanging the roles of (3.3) and (3.4). Using both model selection steps guards against such model selection mistakes and guarantees that the variables excluded in both model selection steps have a neglible contribution to omitted variables bias under the conditions listed below.

### 3.2. Brief overview of Lasso methods

The following description of the Lasso estimator is a review of the particular implementation given in (?). Consider the conditional expectation and assume that is an approximating dictionary for the function , so that , with dimension . The Lasso estimates for and are defined by

 ˆϑ∈argmint∈RMn∑i=1(yi−ϱ(wi)′t)2+λM∑j=1|ˆΨjtj|
 ˆfLasso(w)=ϱ(w)′ˆϑ

where and are tuning parameters named the penalty level and the penalty loadings. (?) provided estimation methodology as well as results guaranteeing performance for the Lasso estimator under conditions which are common in econometrics including heteroskedastic and non-Gaussian disturbances. Tuning parameters are chosen to balance regularization and bias considerations.777 For the simple heteroskedastic Lasso above,(?) recommend setting with sufficiently slowly, and . The choices and are acceptable. The exact values are unobserved, and so a crude preliminary estimate is used to give . Estimates of can be iterated on as suggested by (?). The validity of the of the crude preliminary estimate as well as iterative estimates are detailed in the appendix. Performance bounds for the Lasso, including rates at which approach zero, are derived on the what is called the Regularization event. The Regularization event is defined by , for a fixed constant where is the partial derivative of the least squares part of the objective in the th direction. Informally, under the regularization event, the penalty level is high enough so that coefficients which cannot be statistically separated from zero are mechanically set to identically zero as a consequence of the aboslute values in the above objective function. Because performance bounds are directly proportional to , Lasso can be shown to perform well for small values of which are nevertheless large enough so that the regularization event occurs with high probability.

Lasso performs particularly well relative to some more traditional regularization schemes (eg. ridge regression) under sparsity: the parameter satisfies for some sequence . A feature of the nature of the Lasso penalty that has granted Lasso success is that it sets some components of to exactly zero in many cases. Under general conditions,

 ˆI=|j:ˆϑj≠0|⩽Cs  with % probability 1−o(1)

for a constant that depends on the problem. The Post-Lasso estimator is defined as the least squares series estimator that considers only terms selected by Lasso (ie terms with nonzero coefficients):

 ˆfPost-Lasso(w)=ϱ(w)′ˆϑPost-Lasso;  ϑPost-Lasso∈argmin{t:tj=0, ∀j∉ˆI}n∑i=1(yi−ϱ(wi)′t)2

### 3.3. Lasso in post-double selection in the additively separable model

In this section, the use of Lasso is applied directly to the first and second stage problems described in section 3.2 Starting with constructing an approximation to the operator . Each component of is regressed onto the dictionary giving an approximation for as a linear combination of elements for . If this can be done with all for each , then applied to a linear combination of , namely , can also be approximated by a linear combination of elements of . The estimation can be summarized with one optimzation problem which is equivalent to separate Lasso problems. All nonzero components of the solution to the optimization are collected and included as elements of the refined dictionary .

 ˆΓ=argminΓ∈RL×KK∑k=1n∑i=1(pk(xi)−q(zi)′Γk)2+λFSK∑k=1L∑j=1|ˆΨFSjkΓkj|.

Note that the estimate approximates in the sense that approximates . The first stage tuning parameters are chosen similarly to the method outlined above but account for the need to estimate effectively different regressions. Set

 λFS=2c√nΦ−1(1−γ/2KL),
 ΨFSjk= ⎷n∑i=1qj(zi)2(pk(xi)−Tpk(xi))2/n.

As before, the are not directly observable and so estimates are used in their place. The mechanical implementation for calculating is described in the appendix. Details for the constants involved in choosing are also given in the appendix. The appearence of term in ensures that the performance of the Lasso model selection works uniformly well over over the different Lassos of the first stage.

Running the regression above will yield coefficient estimates of exactly zero for many of the . For each let . Then the first stage model selection step selects exactly those terms which belong in the union .

The reduced form selection step proceeds after the first stage model selection step. For this step, let

 ˆπ=argminπn∑i=1(yi−q(xi)′π)2+λRFL∑j=1|ˆΨRFjπj|

Where the reduced form tuning parameters are chosen according to the method outlined above with

 λRF=2c√nΦ−1(1−γ/2L),
 ΨRFj= ⎷n∑i=1qj(zi)2(yi−E[yi|zi])2/n.

Let be the outcome of the reduced form step of model selection.

Considering the set of dictionary terms selected in the first stage and reduced form model selection steps. Let be the union of all dictionary terms: . Then define the refined dictionary by . Let be the matrix with the observations of the refined dictionary stacked. The post-double-model selection estimate for is defined by

 ˆg(x)=p(x)′ˆβ (3.5)

where .

## 4. Regularity and Approximation Conditions

In this section, the model described above is written formally and conditions guaranteeing convergence and asymptotic normality of the Post-Double Selection Series Estimator are given.

###### Assumption 1.

(i) are i.i.d. random variables and satisfy with and for pre-specified classes of functions .

The first assumption specifies the model. The observations are required to be identically distributed, which is stronger than the treatment of i.n.i.d variables given in Belloni, Chernozhukov and Hansen (2011).

### 4.1. Regularity and approximation conditions concerning the first dictionary

The following few definitions help characterize smoothness properties of target function and approximating functions . Let be a function defined on the support of . Define the Sobolev norm . In addition, let where denotes the Euclidean norm. Throughout the exposition, all assumptions will be required to hold for each with the same set of implied constants.

###### Assumption 2.

There is an integer , a real number and vectors such that and as .

Assumption 2 is standard in nonparametric estimation. It requires that the dictionary can approximate at a pre-specified rate. Values of and can be derived for particular classes of functions. (?) gives approximation rates for several leading examples, for instance orthogonal polynomials, regression splines, etc.

###### Assumption 3.

For each , the smallest eigenvalue of the matrix

 E[(p(x)−Tp(x)(z))(p(x)−Tp(x)(z))′]

is bounded uniformly away from zero in . In addition, there is a sequence of constants satisfying and as .

This condition is a direct analogue of a combination of Assumption 2 from Newey (1997) and the necessary and sufficient conditions for estimation of partially linear models from (?). Requiring the eigenvalues of to be uniformly bounded away from zero is effectively an identifiability condition. It is an analogue of the standard condition that have eigenvalues bounded away from zero specialized to the residuals of after conditioning on . The second condition of Assumption 3 is a standard regularity condition on the first dictionary.

### 4.2. Sparsity Conditions

The next assumptions concern sparsity properties surrounding the second dictionary , used for approximating . As outlined above, sparsity will be required along two dimensions in the second dictionary: both with respect to the outcome equation (1) and with respect to the functional . Consider a sequence that controls the number of nonzero coefficients in a vector. A vector is sparse if . The following give formal restrictions regarding the sparsity of the outcome equation relative to the second approximating dictionary as well as a sparse approximation of the operator described above.

###### Assumption 4.

Sparsity Conditions: there is a sequence and such that

(i) Approximate sparsity in the outcome equation: there is a sequence of vectors that are -sparse and the approximation holds. In addition, .

(ii) Approximate sparsity in the first stage. There are sparse such that In addition, .

(iii) s=o(n)

The assumption above imposes only a mild condition on the sparsity and in a sense may be thought of as definitional. In the discussion that follows, additional conditions on the size of the sparsity level will be imposed. As a preview, the conditions listed in Assumption 7 will require that , among other conditions.

The first statement requires that the second dictionary can approximate using a small number of terms. The average squared approximation error from using a sparse must be smaller than the conjectured estimation error when the subset of the correct terms is known. This restriction on the approximation error follows the convention used by (?). The second restriction on the maximum approximation error is used to simplify the proofs. The second statement of Assumption 4 generalizes the first approximate sparsity requirement. It requires that each component of the dictionary can be approximated by a linear combination of a small set of terms in .

Additional discussion of the sparsity assumptions are given in Section 6 which addresses issues arising in the implementation of non-parametric post-double estimates.

### 4.3. Regularity conditions concerning the second dictionary

The following conditions restrict the sample Gram matrix of the second dictionary. A standard condition for nonparametric estimation is that for a dictionary , the Gram matrix eventually has eigenvalues bounded away from zero uniformly in with high probability. If , then the matrix will be rank deficient. However, in the high-dimensional setting, to assure good performance of Lasso, it is sufficient to only control certain moduli of continuity of the empirical Gram matrix. There are multiple formalizations of moduli of continuity that are useful in different settings, see (?), (?) for explicit examples. This paper focuses on a simple condition that seems appropriate for econometric applications. In particular the assumption that only small submatrices of have well-behaved eigenvalues will be sufficient for the results that follow. In the sparse setting, it is convenient to define the following sparse eigenvalues of a positive semi-definite matrix :

 φmin(m)(M):=min1⩽∥δ∥0⩽mδ′Mδ∥δ∥2,φmax(m)(M):=max1⩽∥δ∥0⩽mδ′Mδ∥δ∥2

In this paper, favorable behavior of sparse eigenvalues is taken as a high level condition and the following is imposed.

###### Assumption 5.

For every constant there are constants which may depend on such that with probability , the sparse eigenvalues obey

 κ′⩽φmin(CsK)(Q′Q/n)⩽φmax(CsK)(Q′Q/n)⩽κ′′.

Assumption 5 requires only that certain “small” submatrices of the large empirical Gram matrix are well-behaved. This condition seems reasonable and will be sufficient for the results that follow. Informally it states that now small subset of covariates in suffer a multicollinearity problem. The could be shown to hold under more primitive conditions by adapting arguments found in (?) which build upon results in (?) and (?); see also (?).

### 4.4. Moment Conditions

The next conditions are high level conditions about moments and the convergence of certain sample averages which ensure good performance of the Lasso as a model selection device. They allow the use of moderate deviation results given in (?) which ensures good performance of Lasso under non-Gaussian and heteroskedastic errors. (?) discuss plausibility of these types of moment conditions for various models for the case = 1. For common approximating dictionaries for a single variable, the condition can be readily checked in a similar manner.

###### Assumption 6.

Set . For each let and define . Let be constants that do not depend on . The following conditions are satisfied with probability for each

 (i) c

### 4.5. Global Convergence

The first result is a preliminary result which gives bounds on convergence rates for the estimator . They are used in the course of the proof of Theorem 1 below, the main inferential result of this paper. The proposition is a direct analogue of the rates given in Theorem 1 of (?) which considers estimation of a conditional expectation without model selection over a conditioning set. The rates obtained in Proposition 1 match the rates in (?).

###### Proposition 1.

Under assumptions listed above, the post-double-model-selection estimates for the function given in equation 3.5 satisfy

 ∫(g(x)−ˆg(x))2dF(x)=Op(K/n+K−2α)
 |ˆg−g|d=OP(ζd(n)√K/√n+K−α).

## 5. Inference and asymptotic normality

In this section, formal results concerning inference are stated. Consider estimation of a functional on the class of functions . The quantity of interest, , is estimated by

 ˆθ=a(ˆg).

The following assumptions on the functional are imposed. They are regularity assumptions that imply that attains a certain degree of smoothness. For example, they imply that is Fréchet differentiable.

###### Assumption 7.

Either (i) is linear over or (ii) for as in Assumption 2, . In addition, there is a linear function that is linear in and such that for some constants and all with , , it holds that and

The function is related to the functional derivative of . The following assumption imposes further regularity on the continuity of the derivative. For shorthand, let .

###### Assumption 8.

Either (i) is scalar, . There is dependent on such that for , it holds that and or (ii) There is with finite and nonsingular with and for every . There is so that .

In order to use for inference on , an approximate expression for the variance is necessary. As is standard, the expression for the variance will be approximated using the delta method. An approximate expression for the variance of the estimator therefore requires an appropriate derivative of the function , (rather, an estimate). Let denote the derivatives of the functions belonging to the approximating dictionary, Let . The approximate variance, from the delta method is given by :

 V=AQ−1ΣQ−1A Ω=E[(p(x)−Tp(x))(p(x)−Tp(x))′] Σ=E[(p(x)−Tp(x))(p(x)−Tp(x))′(y−g(x))2]

These quantities are unobserved but can be estimated:

 ˆV=ˆAˆΩ−1ˆΣˆΩ−1ˆA ˆΩ=n∑i=1(p(xi)−ˆp(xi))(p(xi)−ˆp(xi))′/n ˆΣ=n∑i=1(p(xi)−ˆp(xi))(p(xi)−ˆp(xi))′(y−ˆg(xi))2/n

The elements are obtaining as the predictions from the least squares regresson of onto the selected . Then is used as an estimator of the asymptotic variance of and assumes a sandwich form.

The following assumptions are needed in order to bound :

###### Assumption 9.

Define the moments , . Let be constants. Then for

 (i) E[|ϵi|q|xi,zi]⩽C (ii) ϖqϵjk/√E[qj(zi)2ϵ2i],ϖWϵk/√E[W2kiϵ2i]⩽C (iii) n2/qKϕ/√n=o(1) (iv) (ζ0(K)√K/√n)(s√K/n+√nK−α)maxi,j|qj(zi)|=oP(1)

The assumption that moments of are bounded slightly strengthens the condition in (?) that fourth moments are bounded. The condition is necessary for the estimation of the final standard errors. Similarly, condition (iii) is stronger than the rate condition listed in Assumption 3. The more stringent condition is also useful for estimating standard errors. Finally, condition (ii) is analogous to Assumption 7, condition (iii) and is again useful for controlling the tail behavior of certain self-normalized sums.

The next result is the main result of the paper. It establishes the validity of standard inference procedure after model selection as well as validity of the plug in variance estimator.

###### Theorem 1.

Under the Assumptions 1-7,9 and Assumption 8(i), and in addition then and

If Assumptions 1-7,9 and Assumption 8(ii) hold with and in addition then for , the following convergences hold.

The theorem shows that the outlined procedure gives a valid method for performing inference for functionals after selection of series terms. Note that under assumption 8(i) the rate is not achieved because the functional does not have a mean square continuous derivative. By contrast, Assumption 8(ii) is sufficient for -consistency. Conditions under which the particular assumptions regarding the approximation of hold are well known. For example, conditions on for various common approximating dictionaries including power series or regression splines etc follow those directly derived in (?). Asymptotic normality of these types of estimates under the high dimensional additively separable setting should therefore be viewed as a corollary to the above result.

Consider one example with the functional of interest being evaluation of at a point : . In this case, is linear and for all functions . This particular example does not attain a convergence rate provided there is a sequence of functions in the linear span of such that converges to zero but is positive for each . Another example is the weighted average derivative for a weight function which satisfies regularity conditions. For example, the theorem holds if is differentiable, vanishes outside a compact set, and the density of is bounded away from zero wherever is positive. In this case, for by a change of variables provided that is continuously distributed with non vanishing density . These are one possible set of sufficient conditions under which the weighted average derivative does achieve -consistency.

## 6. Additional discussion of implementation

The theorem above states that for a fixed (nonrandom) sequence , asymptotically normal estimates are achieved for certain functionals of interest of under the right regularity conditions. In practice, the choice of is important and it is useful to have a data-driven means by which to choose . This section provides suggestions for choosing such . This paper leaves these suggestions as heuristics and does not derive formal theory for their asymptotic performance; however, the finite sample performance of these heuristics is explored in the simulations in Section 7. 888There is also the question of whether the penalty levels and in the Lasso optimizations can be chosen in a more data-driven way. There is less flexibility in these choices, since the Lasso bounds are predicated on the Regularization event. Using a smaller penalty level than suggested leads to over-selection of control variables, which can bias estimates. Further discussion of the effects of post-model-selection inference with over-selection can be found in (?).

Suppose that candidates for belong to the integer set . Suppose that is nonrandom as before and does not vary with . A simple proposal is as follows: for each , construct using the post-double model selection routine, a reduced second dictionary . This results in a set of selected dictionaries which are candidates for a final estimation step:

 {(pK––(x),~qLK––(z)),...,(p¯¯¯¯K(x),~qL¯¯¯¯K(z))}.

Then can be chosen by selecting from the dictionaries in the above set. In principal this can be done in many ways, including cross validation or BIC, possibly with additional over-smoothing (i.e. choosing larger than say the mean-square error optimal.)

The informal reasoning behind this proposal is that for each , the post-double model selection selects confounders in a way so that the omitted variables bias resulting from possibly omitting covariates predictive of elements of is small relative to sampling variability. But there can be another component of omitted variables bias: namely, the omitted variables biases resulting in excluding all confounders predictive only of signal in which is not accounted by . However, this second component of omitted variables bias will plausibly be small if is chosen appropriately, (i.e. whenever , a similar bound on such omitted variables bias may be expected to hold.)

In addition to issues arising from choosing , the sparsity conditions imposed in the previous sections are restrictive. However, without such a sparse structure, it is difficult to construct meaningful estimates of . Furthermore, at the current time, there are no widely used procedures to test the null hypothesis of a sparse model that the author is aware of.

A major restriction implicit in the sparsity assumption is that the sparse approximation errors are small for all terms in . For instance, if , it is much less demanding to ask that there exists a good sparse predictor of based on , than it is to ask that , , all have quality sparse predictors. On the other hand, it is possible that the transformations or might have much better sparse representations in terms of than and have individually.

Therefore, an alternative strategy for the first stage is to have a model selection step for many distinct linear combinations given by . The strategy is outlined as follows. First, gather the set into an extended first stage dictionary: . Select a reduced conditioning dictionary with the nonparametric post-double selection method described above, except using in the first stage model selection step. Finally, in the post-model-selection estimation step, estimate using . This strategy is potentially useful since it further reduces the possibility for omitted variables bias. A clear tradeoff with using a distinct first stage dictionary is that due to the additional model selection steps introduced, more variables from can potentially be selected, leading to higher variability of the final estimate of .

As with the data-driven choice of , this suggestion is kept at the level of a heuristic at this moment. However, arguments in the proofs of the main results can easily be extended to allow an extended to allow a first stage dictionary provided that it has a subdictionary, for which Assumptions 1-4 hold, and that the number of selected conditioning variables remains with high probability. For example, if then no substantive modification to the proof are necessary.

The finite sample performance of these heuristics is explored in the simulations in Section 7.

## 7. Simulation study

The results stated in the previous section suggest that post double selection type series estimation should exhibit good inference properties for additively separable conditional expectation models when the sample size is large. The following simulation study is conducted in order to illustrate the implementation and performance of the outlined procedure. Results from several other candidate estimators are also calculated to provide a comparison between the post-double method and other methods. Estimation and inference for two functionals of the conditional expectation function are considered. Two simulation designs are considered. In one design, the high dimensional component over which model selection is performed is a large series expansion in four variables. In the other design, the high dimensional component is a linear function of a large number of different covariates.

### 7.1. Low Dimensional Additively Separable Design

Consider the following model of continuous variables of form:

 E[y|x]=E[y|x,z]=g(x)+h(z)

where in this simulation, the true function of interest, , and the conditioning function are given by :

 g(x) =logistic(x)−12 h(z) =logistic⎛⎝dim(z)∑j=1zj⎞⎠−12

where and the terms in the expression for is used to ensure identifiability via . Ex post, the function is simple, however, for the sake of the simulation, knowledge of the logistic form is assumed unknown. Importantly, the logistic function will not belong exactly in the span of any finite series expansion used in the below simulation. The second function is similar, being defined by a combination of a logistic function of a linear combination of the variables. The logistic part can potentially require many interaction terms unknown in advance to produce an accurate model. The component functions and will be used throughout the simulation. The remaining parameters, eg. dictating the data generating processes for will be changed across simulation to give an illustration of performance across different settings.

The objective is to estimate a population average derivative, , and a function evaluation, given by

(i)

(ii)

and