[

# [

Victor Chernozhukov vchern@mit.edu
Massachusetts Institute of Technology \ANDWhitney Newey wnewey@mit.edu
Massachusetts Institute of Technology \ANDJames Robins robins@hsph.harvard.edu
Harvard Universiy
###### Abstract

We provide adaptive inference methods for linear functionals of sparse linear approximations to the conditional expectation function. Examples of such functionals include average derivatives, policy effects, average treatment effects, and many others. The construction relies on building Neyman-orthogonal equations that are approximately invariant to perturbations of the nuisance parameters, including the Riesz representer for the linear functionals. We use -regularized methods to learn the approximations to the regression function and the Riesz representer, and construct the estimator for the linear functionals as the solution to the orthogonal estimating equations. We establish that under weak assumptions the estimator concentrates in a neighborhood of the target with deviations controlled by the normal laws, and the estimator attains the semi-parametric efficiency bound in many cases. In particular, either the approximation to the regression function or the approximation to the Rietz representer can be “dense” as long as one of them is sufficiently “sparse”. Our main results are non-asymptotic and imply asymptotic uniform validity over large classes of models.

DML with Riesz Representers]Double/De-Biased Machine Learning Using Regularized Riesz Representers

[ Victor Chernozhukov vchern@mit.edu
Massachusetts Institute of Technology
Whitney Newey wnewey@mit.edu
Massachusetts Institute of Technology
James Robins robins@hsph.harvard.edu
Harvard Universiy

Keywords: Approximate Sparsity, Double/De-biased Machine Learning, Regularized Riesz Representers, Linear Functionals

## 1 Introduction

Consider a random vector with distribution and finite second moments, where the outcome takes values in and the covariate taking values , a Borel subset of . Denote the conditional expectation function map by . We consider a function , given by , as a sparse linear approximation to , where is a -dimensional vector, a dictionary of basis functions, mapping to . The dimension here can be large, potentially much larger than the sample size.

Our goal is to construct high-quality inference methods for a real-valued linear functional of given by:

 θ0=Em(X,γ0)=∫m(x,γ0)dF(x),

where is the distribution of under , for example the average derivative and other functionals listed below. (See Section 2 below regarding formal requirements on ). When the approximation error is small, our inference will automatically re-focus on a more ideal target – the linear functional of the conditional expectation function,

 θ∗0=Em(X,γ∗0).
###### Example 1 (Average Derivative)

Consider an average derivative in the direction :

 θ0=Ea′∇xγ0(X),a∈Rd.

This functional corresponds to an approximation to the effect of policy that shifts the distribution of covariates via the map , so that

 ∫(γ0(x+a)−γ0(x))dF(x)≈∫a′∇xγ0(x)dF(x).
###### Example 2 (Policy Effect from a Distribution Shift)

Consider the effect from a counterfactual change of covariate distribution from to :

 θ0=∫γ0(x)d(F1(x)−F0(x))=∫γ0(x)[d(F1(x)−F0(x))/dF(x)]dF(x).
###### Example 3 (Average Treatment Effect)

Consider the average treatment effect under unconfoundedness. Here and , where is the indicator of the receipt of the treatment, and

 θ0=∫(γ0(1,z)−γ0(0,z))dF(z)=∫γ0(d,z)(1(d=1)−1(d=0))dF(x).

We consider as a continuous linear functional on , the linear subspace of spanned by the given dictionary . Such functionals can be represented via the Riesz representation:

 Em(X,γ)=Eγ0(X)α0(X),

where the Riesz representer is identified by the system of equations:

 Em(X,b)=Eb(X)b(X)′ρ0,

where is a componentwise application of linear functional to . Having the Riesz representer allows us to write the following “doubly robust” representation for :

 θ0=E[m(X,γ0)+α0(X)(Y−γ0(X))].

This representation is approximately invariant to small perturbations of parameters and around their true values (see Lemma 2 below for details), a property sometime referred to as the Neyman-type orthogonality (see, e.g., Chernozhukov et al. (2016a)), making this representation a good one to use as a basis for estimation and inference in modern high-dimensional settings. (See also Proposition 5 in Chernozhukov et al. (2016b) for a formal characterization of scores of this sort as having the double robustness property in the sense of Robins and Rotnitzky (1995)).

Our estimation and inference will explore an empirical analog of this equation, given a random sample generated as i.i.d. copies of . Instead of the unknown and we will plug-in estimators obtained using regularization. We shall use sample-splitting in the form of cross-fitting to obtain weak assumptions on the problem, requiring only approximate sparsity of either or , with some weak restrictions on the sparsity indexes. For example, if both parameter values are sparse, then the product of effective dimensions has to be much smaller than , the sample size. Moreover, one of the parameter values, but not both, can actually be “dense” and estimated at the so called “slow” rate, as long as the other parameter is sparse, having effective dimension smaller than .

We establish that that the resulting “double” (or de-biased) machine learning (DML) estimator concentrates in a neigborhood of the target with deviations controlled by the normal laws,

 supt∈R∣∣P(√nσ−1(^θ−θ0)≤t)−Φ(t)∣∣≤εn,

where the non-asymptotic bounds on are also given.

As the dimension of grows, suppose that the subspace becomes larger and approximates the infinite-dimensional linear subspace , that contains the true . We assume the functional is continuous on in this case. If the approximation bias is small,

 √n(θ0−θ∗0)→0,

our inference will automatically focus on the “ideal” target . Therefore our inference can be interpreted as targeting the functionals of the conditional expectation function in the regimes where we can successfully approximate them. However our approach does not hinge on this property and retains interpretability and good properties under misspecification.

It is interesting to note that in the latter case will approximate the true Riesz representer for the linear functionals on , identified by the system of equations:

 Em(X,γ)=Eγ(X)α∗0(X),∀γ∈Γ∗.

Note that has a “doubly robust” representation:

 θ∗0=E[m(X,γ∗0)−α∗0(X)(Y−γ∗0(X))], (1)

which is invariant to perturbations of and . Hence, our approach can be viewed as approximately solving the empirical analog of these equations, in the regimes where does approximate . In such cases our estimator attains the semi-parametric efficiency bound, because its influence function is in fact the efficient score for ; see van der Vaart (1991); Newey (1994).

When and the functional is continuous on , the Riesz representer belongs to and can be stated explicitly in many examples:

• in Example 1: , where ,

• in Example 2: ,

• in Example 3: .

However, such closed-form solutions are not available in many other examples, or when is smaller than , which is probably the most realistic situation occurring in practice.

Using closed-form solutions for Riesz representers in several leading examples and their machine learning estimators, Chernozhukov et al. (2016a) defined DML estimators of in high-dimensional settings and established their good properties. Compared to this approach, the new approach proposed in this paper has the following advantages and some limitations:

1. It automatically estimates the Riesz representer from the empirical analog of equations that implicitly characterize it.

2. It does not rely on closed-form solutions for , which generally are not available.

3. When closed-form solutions for are available, it avoids directly estimating . For example, it avoids estimating derivatives of densities in Example 1 or inverting estimated propensity scores in Example 3. Rather it estimates the projections of on the subspace , which is a much simpler problem when the dimension of is high.

4. Our approach remains interpretable under misspecification – when approximation errors are not small, we simply target inference on instead of .

5. While the current paper focuses only on sparse regression methods, the approach readily extends to cover other machine learning estimators as long as we can find (numerically) dictionaries that (approximately span) the realizations of , where is the probability limit of .

6. The current approach is limited to linear functionals, but in an ongoing work we are able to extend the approach by first performing a linear expansion and then applying our new methods to the linear part of the expansion.

The paper also builds upon ideas in classical semi-parametric learning theory with low-dimensional , which focused inference on ideal using traditional smoothing methods for estimating nuisance parameters and [van der Vaart (1991); Newey (1994); Bickel et al. (1998); Robins and Rotnitzky (1995); van der Vaart (1998)], that do not apply to the current high-dimensional setting. Our paper also builds upon and contributes to the literature on the modern orthogonal/debiased estimation and inference [Zhang and Zhang (2014); Belloni et al. (2014a, b, 2015); Javanmard and Montanari (2014a, b, 2015); van de Geer et al. (2014); Ning and Liu (2014); Chernozhukov et al. (2015); Neykov et al. (2015); Ren et al. (2015); Jankova and Van De Geer (2015, 2016a, 2016b); Bradic and Kolar (2017); Zhu and Bradic (2017b, a)], which focused on inference on the coefficients in high-dimensional linear and generalized linear regression models, without considering the general linear functionals analyzed here.

Notation. Let be a random vector with law on the sample space , and denote the i.i.d. copies of . All models and probability measure can be indexed by , a sample size, so that the models and their dimensions can change with , allowing any of the dimensions to increase with .

We use the notation from the empirical process theory, see Van Der Vaart and Wellner (1996). Let denote the empirical average of over :

 EIf:=EIf(W)=|I|−1∑i∈If(Wi).

Let denote the empirical process over and , namely

 GIf:=GIf(W):=|I|−1/2∑i∈I(f(Wi)−Pf),

where . Denote by the norm of a measurable function mapping the support of to the real line and also the norm of random variable by . We use to denote norm on .

For a differentiable map , mapping to , we use to abbreviate the partial derivatives , and we correspondingly use the expression to mean , etc. We use to denote the transpose of a column vector .

## 2 The DML with Regularized Riesz Representers

### 2.1 Sparse Approximations for the Regression Function and the Riesz Representer

We work with the set up above. Consider a conditional expectation function such that and a -vector of dictionary terms such that . The dimension of the dictionary can be large, potentially much larger than .

We approximate as

 γ∗0=γ0+rγ:=b′β0+rγ,

where is the approximation error, and is the “best sparse linear approximation” defined via the following Dantzig Selector type problem (Candes and Tao (2007)).

###### Definition 1 (Best Sparse Linear Predictor)

Let be a minimal -norm solution to the approximate best linear predictor equations

 β0∈argmin∥β∥1:∥E[b(X)(Y−b(X)′β)]∥∞≤λβ0.

When , becomes the best linear predictor parameter (BLP).

We refer to the resulting approximation as “sparse”, since solutions often are indeed sparse. Note that since , the approximation error is approximately orthogonal to :

 ∥E[b(X)(Y−b(X)′β0)∥∞=∥E[b(X)(γ∗0(X)−b(X)′β0)]∥∞=∥E[b(X)rγ(X)]∥∞≤λβ0.

Consider a linear subspace that contains , the linear subspace generated by . In some of the asymptotic results that follow, we can have as .

• For each , consider a linear map from to , such that for each , the map from to is measurable, and the functional is continuous on with respect to the norm.

Under the continuity condition in C, this functional admits a Riesz representer . We approximate the Riesz representer via:

 α∗0(X)=b(X)′ρ0+rα(X),

where is the approximation error, and is the best sparse linear approximation defined as follows.

###### Definition 2 (Best Sparse Linear Riesz Representer)

Let be a minimal -norm solution to the approximate Riesz representation equations:

 ρ0∈argmin∥ρ∥1:∥Em(X,b)−Eb(X)b(X)′ρ∥∞≤λρ0,

where is a regularization parameter. When , we obtain , a Riesz representer for functionals when

As before, we refer to the resulting approximation as “sparse”, since the solutions to the problem would often be sparse. Since , we conclude that the approximation error is approximately orthogonal to :

 ∥E(α∗0(X)−b(X)′ρ0)b(X)∥∞=∥E[rα(X)b(X)]∥∞≤λρ0.

The estimation will be carried out using the sample analogs of the problems above, and is a special case of the following Dantzig Selector-type problem.

###### Definition 3 (Regularized Minimum Distance (RMD))

Consider a parameter , such that is a convex set with . Consider the moment functions and the estimated function , mapping to :

 g(t)=Gt+M;^g(t)=^Gt+^M,

where and are by non-negative-definite matrices and and are -vectors. Assume that is the target parameter that is well-defined by:

 t0∈argmin∥t∥1:∥g(t)∥∞≤λ0,t∈T. (2)

Define the RMD estimator by solving

 ^t∈argmin∥t∥1:∥^g(t)∥∞≤λ0+λ1,t∈T,

where is chosen such that with probability at least .

We define the estimators of and over subset of data, indexed by a non-empty subset of .

###### Definition 4 (RMD Estimator for Sparse BLP)

Define as the RMD estimator with parameters , a convex set with , and

 ^G=EAbb′,G=Ebb′,^M=EAYb,M=EYb(X).
###### Definition 5 (RMD Estimator for Sparse Riesz Representer)

Define as the RMD estimator with parameters , a convex set with , and

 ^G=EAbb′,G=Ebb′,^M=−EAm(X,b),M=−Em(X,b).

### 2.2 Properties of RMD Estimators

Consider sequences of constants , , , and , indexed by .

• We have that with , and the empirical moments obey the following bounds with probability at least :

 √n∥^G−G∥∞≤ℓ1n and √n∥^M−M∥∞≤ℓ2n.

Note that in many applications the factors and can be chosen to grow slowly, like , using self-normalized moderate deviation bounds (Jing et al. (2003); Belloni et al. (2014b)), under mild moment conditions, without requiring sub-Gaussianity.

Define the identifiablity factors for as :

 s−1(t0):=infδ∈R(t0)|δ′Gδ|/∥δ∥21,

where is the restricted set:

 R(t0):={δ:∥t0+δ∥1≤∥t0∥1,t0+δ∈T},

where if . The restricted set contains the estimation error for RMD estimators with probability at least . We call the inverse of the identifiability factor, , the “effective dimension”, as it captures the effective dimensionality of ; see remark below.

###### Remark 1 (Identifiability and Effective Dimension)

The identifiability factors were introduced in Chernozhukov et al. (2013) as a generalization of the restricted eigenvalue of Bickel et al. (2009). Indeed, given a vector , let denote a vector with the -th component set to if and if . Then or

 s(t0)≤2s/k,

where is the restricted eigenvalue: , , and The inequality follows since implies , so that . Here is the number of non-zero components of . Hence we can view as a measure of effective dimension of .

###### Lemma 1 (Regularized Minimum Distance Estimation)

Suppose that MD holds. Let

 ¯λ:=~ℓn/√n,~ℓn:=(ℓ1nBn+ℓ2n).

Define the RMD estimand via (2) with . Then with probability the estimator exists and obeys:

 (^t−t0)′G(^t−t0)≤(s(t0)(4¯λ)2)∧(8Bn¯λ).

There are two bounds on the rate, one is the “slow rate” bound and the other one is “fast rate”. The “fast rate” bound is tighter in regimes where the “effective dimension” is not too large.

###### Corollary 1 (RMD for BLP and Riesz Representer)

Suppose and belong to a convex set with . Consider a random subset of of size where is a fixed integer. Suppose that the dictionary and obey with probability at least

 maxj,k≤p|GAbk(X)bj(X)|≤ℓ1n,maxj≤p|GA(Ybj(X))|≤ℓ2n,maxj≤p|GAm(X,bj)|≤ℓ2n.

If we set for , then with probability at least , estimation errors and obey

 u∈R(β0),[u′Gu]1/2≤r1:=4[(~ℓn(s(β0)/n)1/2)∧(~ℓ1/2nBnn−1/4)],
 v∈R(ρ0),[v′Gv]1/2≤r2:=4[(~ℓn(s(ρ0)/n)1/2)∧(~ℓ1/2nBnn−1/4)].

The corollary follows because the stated conditions imply condition MD by Holder inequality.

### 2.3 Approximate Neyman-orthogonal Score Functions and the DML Estimator

Our DML estimator of will be based on using the following score function:

 ψ(Wi,θ;β,ρ)=θ−m(X,b(X))′β−ρ′b(X)(Y−b(X)′β).
###### Lemma 2 (Basic Properties of the Score)

The score function has the following properties:

 ∂βψ(X,θ;β,ρ)=−m(X,b(X))+ρ′b(X)b(X),∂ρψ(X,θ;β,ρ)=−b(X)(Y−b(X)β),
 ∂2ββ′ψ(X,θ;β,ρ)=∂2ρρ′ψ(X,θ;β,ρ)=0,∂2βρ′ψ(X,θ;β,ρ)=b(X)b(X)′.

This score function is approximately Neyman orthogonal at the sparse approximation , namely

 ∥E[∂βψ(X,θ0;β,ρ0)]∥∞=∥M−ρ0′G∥∞≤λρ0,
 ∥E[∂ρψ(X,θ0;β0,ρ)]∥∞=∥E[b(X)(Y−b(X)′β0)]∥∞≤λβ0.

The second claim of the lemma is immediate from the definition of and the first follows from elementary calculations.

The approximate orthogonality property above says that the score function is approximately invariant to small perturbations of the nuisance parameters and around their “true values” and . Note that the score function is exactly invariant, and becomes the doubly robust score, whence and . This approximate invariance property plays a crucial role in removing the impact of biased estimation of nuisance parameters and on the estimation of the main parameters . We now define the double/de-biased machine learning estimator (DML), which makes use of the cross-fitting, an efficient form of data splitting.

###### Definition 6 (DML Estimator)

Consider the partition of into blocks , with observations in each block (assume for simplicity that divides ). For each , and and , let estimator be defined as the root:

 EIψ(Wi,^θIk;^βIck,^ρIck)=0,

where and are RMD estimators of and that have been obtained using the observations with indices . Define the DML estimator as the average of the estimators obtained in each block :

 ^θ=1KK∑k=1^θIk.

### 2.4 Properties of DML with Regularized Riesz Representers

We will note state some sufficient regularity conditions for DML. Let denote the sample size. To give some asymptotic statements, let , , , , and and denote sequences of positive constants. Let , , and be positive constants such that , and let be a fixed integer. We consider sequence of integers such that divides (to simplify notation). Fix all of these sequences and the constants. Consider that satisfies the following conditions:

• Assume condition C holds and that in Definitions 1 and 2, it is possible to set and such that (a) and (b) the resulting parameter values and are well behaved with and , where is a convex set with .

• Given a random subset of of size , the terms in dictionary and outcome obey with probability at least

 maxj,k≤p|GAbk(X)bj(X)|≤ℓ1n,maxj≤p|GAYbj(X)|≤ℓ2n,maxj≤p|GAm(X,bj)|≤ℓ2n.
• Assume that for and and that the following continuity relations hold for all and ,

 √Var((m(X,b)+ρ0′b(X)b(X))′u)≤ℓ3n∥b(X)′u∥P,2
 √Var((Y−b(X)′β0)b(X)′v)≤ℓ3n∥b(X)′v∥P,2
 √Var(u′b(X)b(X)′v)≤ℓ3n(∥b(X)′u∥P,2+∥b(X)′v∥P,2).
• For ,

 r1:=4[(~ℓn(s(β0)/n)1/2)∧(~ℓ1/2nBnn−1/4)]r2:=4[(~ℓn(s(ρ0)/n)1/2)∧(~ℓ1/2nBnn−1/4)]∣∣ ∣∣  we have:  n1/2r1r2+ℓ3n(r1+r2)≤δn.

R.1 requires that sparse approximations of and with respect to the dictionary admit well-behaved parameters and . R.2 is a weak assumption that can be satisfied by taking under weak assumptions on moments, as follows from self-normalized moderate deviation bounds (Jing et al. (2003); Belloni et al. (2014b)), without requiring subgaussian tails bounds. R.3 imposes a modulus of continuity for bounding variances: if elements of the dictionary are bounded with probability one, , then we can select for many functionals of interest, so the assumption is plausible. R.4 imposes condition on the rates and of consistency of and , requiring in particular that , an upper bound on the bias of the DML estimator, is small; see also the discussion below.

Consider the oracle estimator based upon the true score functions:

 ¯θ=θ0+n−1n∑i=1ψ0(Wi),ψ0(Wi):=ψ(Wi,θ0;β0,ρ0).
###### Theorem 1 (Adaptive Estimation and Approximate Gaussian Inference)

Under R.1-R.4, we have the adaptivity property, namely the difference between the DML and the oracle estimator is small: for any ,

 |√n(^θ−¯θ)|≤Rn:=¯Cδn/Δn

with probability at least for , where is an absolute constant. As a consequence, concentrates in a neighborhood of , with deviations approximately distributed according to the Gaussian law, namely:

 supz∈R∣∣PP(σ−1√n(^θ0−θ0)≤z)−Φ(z)∣∣≤¯C′(n−1/2+Rn)+Πn, (3)

where , , and only depends on .

The constants can be chosen such that right-hand side of (3) converges to zero as , yielding the following asymptotic result.

###### Corollary 2 (Uniform Asymptotic Adaptivity and Gaussianity)

Fix constants and sequences of constants specified at the beginning of Section 2.4. Let be the set of probability laws that obey conditions R.1-R.4 uniformly for all . Then DML estimator is uniformly asymptotically equivalent to the oracle estimator , that is uniformly in as . Moreover, is asymptotically Gaussian uniformly in :

 limn→∞supP∈Psupz∈R∣∣PP(σ−1√n(^θ0−θ0)≤z)−Φ(z)∣∣=0.

Hence the DML estimator enjoys good properties under the stated regularity conditions. Less primitive regularity conditions can be deduced from the proofs directly.

###### Remark 7 (Sharpness of Conditions)

The key regularity condition imposes that the bounds on estimation errors and are small and that the product is small. Ignoring the impact of ”slow” factors ’s and assuming is bounded by a constant , this requirement is satisfied if

 one of the effective dimensions is smaller than √n, either s(β0)≪√n  or s(ρ0)≪√n.

The latter possibility allows one of the parameter values to be “dense”, having unbounded effective dimension, in which case this parameter can be estimated at the “slow” rate . These types of conditions are rather sharp, matching similar conditions used in Javanmard and Montanari (2015) in the case of inference on a single coefficient in Gaussian sparse linear regression models, and those in Zhu and Bradic (2017a) in the case of point testing general linear hypotheses on regression coefficients in linear Gaussian regression models.

Proceeding further, we notice that the difference between our target and the ideal target is:

 |θ0−θ∗0|=|Erα(X)rγ(X)|≤∥rα∥P,2∥rγ∥P,2.

Hence we obtain the following corollary.

###### Corollary 3 (Inference targets θ∗0 when approximation errors are small)

Suppose that, in addition to R.1-R.4, the product of approximation errors and is small,

 √n|Erα(X)rγ(X)|≤δn.

Then conclusions of Theorem 1 and Corollary 1 hold with replaced by and the constants and replaced by and .

The plausibility of being small follows from the fact that many rich functional classes admit sparse linear approximations with respect to conventional dictionaries . For instance, Tsybakov (2012) and Belloni et al. (2014b) give examples of Sobolev and rearranged Sobolev balls, respectively, as the function classes and elements of the Fourier basis as the dictionary , in which sparse approximations have small errors.

###### Remark 8 (Sharpness of Conditions in the Context of ATE)

In the context of vanishing approximation errors, and in the context of Example 3 on Average Treatment Effects, our estimator implicitly estimates the inverse of the propensity score directly rather than inverting a propensity score estimator as in most of the literature. The approximate residual balancing estimator of Athey et al. (2016) can also be thought of as implicitly estimating the inverse propensity score. An advantage of the estimator here is its DML form allows us to tradeoff rates at which the mean and the inverse propensity score are estimated while maintaining root-n consistency. Also, we do not require that the conditional mean be linear and literally sparse with the sparsity index ; in fact we can have a completely dense conditional mean function when the approximation to the Rietz representer has the effective dimension . More generally, when the approximation errors don’t vanish, our analysis also explicitly allows for misspecification of the regression function.

## 3 Proofs

Proof of Lemma 1. Consider the event such that

 ∥^g(t0)∥≤λ0+λ1 and supt∈T∥^g(t)−g(t)∥∞≤¯λ (4)

holds. This event holds with probability at least . Indeed, by the choice of and , we have with probability at least :

 ∥^g(t0)∥∞≤∥^g(t0)−g(t0)∥∞+∥g(t0)∥≤λ1+λ0.

Hence on the event we have

 ∥^t∥1≤∥t0∥1∥^g(^t)∥∞≤λ0+λ1.

This implies, by , by , and by (4), that

 ∥G(^t−t0)∥∞=∥g(^t)−g(t0)∥∞≤2λ0