Regularized Orthogonal Machine Learning for Nonlinear Semiparametric Models
Abstract
This paper contributes to the literature on highdimensional sparse estimation by allowing the loss function to depend on a functional nuisance parameter, which we estimate by modern machine learning tools. For a class of singleindex conditional moment restrictions (CMRs), we explicitly derive the loss function. We first adjust the moment function so that the gradient of the future estimator loss is insensitive (formally, Neymanorthogonal) with respect to the firststage regularization bias. We then take the loss function to be an indefinite integral of the adjusted moment function with respect to the singleindex. The proposed regularized estimator achieves oracle convergence rate, where oracle knows the nuisance parameter and solves only the parametric problem. Our framework nests a novel approach to modeling heterogeneous treatment effects with a binary dependent variable. In addition, we apply our results to conditional moment models with missing data and static games of incomplete information. Finally, we generalize our results to generic extremum estimation with a nuisance component.
Online_Supplement \externaldocumentAppendixA2
1 Introduction and Motivation
Conditional moment restrictions (CMRs) often emerge as natural restrictions summarizing conditional independence, exclusion or structural assumptions in economic models. In such setting, a major challenge is to allow for a datadriven selection among a (very) large number of covariates. The sparsity assumption, which requires the number of relevant covariates to be small relative to the sample size, has appeared to be an interpretable and easily justifiable alternative. If the model is sparse, the regularized estimator converges fast even if the total number of covariates exceeds sample size under plausible restricted identifiability conditions (Negahban et al. (2012), Loh (2017)). In this paper, we derive the estimator loss function for a class of CMRs that have singleindex property.
Singleindex property requires the target parameter to enter the moment function only via the onedimensional index , where is a conditioning vector and is a known index function. Many economic problems can be represented as a singleindex CMR. For example, the partially linear model (Robinson (1988)) and singleagent dynamic discrete choice (Hotz and Miller (1993)) with a renewal choice property are special cases. The key insight of this paper is how to derive an estimator loss for a singleindex CMR. A simple candidate – an indefinite integral of the moment function with respect to the singleindex – turns out to be a valid estimator loss. Indeed, the gradient of the proposed estimator loss is the product of the moment function and the index function, and therefore, it is a valid unconditional moment equation for the target parameter.
In addition to this insight, this paper contributes to the literature on estimation by allowing the loss function to depend on a functional nuisance parameter, such as the propensity score, the conditional choice probability, the conditional density, or alike, that can be substantially more highdimensional than the target parameter itself. A natural approach would be to plugin a machine learning/regularized estimate of the nuisance parameter into the moment function and integrate it with respect to the singleindex. To achieve consistency in a highdimensional setting, we must employ modern regularized methods whose regularization bias converges slowly. As a result, plugging such estimates into the loss function produces a biased, lowquality estimate.
We propose an estimator that has two steps. We first adjust the moment function so that the gradient of the future loss function is insensitive (i.e., Neymanorthogonal) with respect to firststage regularization bias, preserving the singleindex property. We then take the loss function to be an indefinite integral of the adjusted moment function with respect to the singleindex. If the original moment function is monotone in the singleindex, we report the global minimum of regularized estimator loss, following Negahban et al. (2012). Otherwise, we report the local minimum, searching for the estimator in a small neighborhood of some preliminary suboptimal estimator. Under mild conditions, the proposed estimator converges at the oracle rate, where oracle knows the true value of the nuisance parameter and solves only the parametric problem.
We demonstrate the method’s utility with three applications. We first derive the orthogonal estimator loss function for conditional moment models with missing data, as studied in Carroll et al. (1995), Carroll and Wand (1991), Chen et al. (2008), Lee and Sepanski (1995), Sepanski and Carroll (1993), agent utilities in games of incomplete information (e.g. see Bajari et al. (2010) and Bajari et al. (2013) among others). Finally, we introduce a novel approach to heterogeneous treatment effects with a nonlinear link function, targeting binary dependent variables. In all these settings, we derive an orthogonal estimator loss function and concrete primitive conditions on the nuisance parameters under which the proposed estimator exhibits oracle convergence.
This paper is organized as follows. Section 2 introduces our main examples of interest and gives a nontechnical overview of the results. Section 3 formally states our result. Section 4 derives the concrete conditions for our applications. Section 5 generalizes our results to extremum estimation beyond estimation. Section 6 gives MonteCarlo simulations for heterogeneous treatment effects with nonlinear link function. Appendix A states proofs. Appendix B verifies the highlevel firststage convergence assumption in highdimensional sparse models.
1.1 Literature Review
This paper is related to three lines of research: singleindex models, highdimensional sparse models, and orthogonal/debiased inference based on machine learning methods.
The first line of research concerns with the estimation of singleindex models (Manski (1975), Powell (1984), Manski (1985), Ichimura (1993), Klein and Spady (1993), Hardle et al. (1993)). This work focuses on a lowdimensional target parameter that can be treated as fixed. Focusing on the smooth models, we allow the parameter’s dimension to grow with sample size and even exceed it. We show that the singleindex property helps to prevent overfitting in highdimensional settings, extending the generalization bounds in machine learning literature (see e.g., ShalevShwartz and BenDavid (2014)) to singleindex CMRs.
The second line of research establishes the finitesample bounds for a highdimensional sparse parameter (Belloni et al. (2011), Negahban et al. (2012), Loh and Wainwright (2013), van der Geer et al. (2014), ?, Loh (2017), Zhu (2017), Zhu et al. (2019),?). Our contribution is to allow the loss function to depend on a functional nuisance parameter. In the convex case, we establish global convergence of the regularized estimator, following Negahban et al. (2012). In the nonconvex case, we establish local convergence, building on Loh and Wainwright (2013) and Loh (2017). Focusing on the double robustness property, Tan (2020b) and Tan (2020a) establish convergence guarantees in highdimensional models that are potentially misspecified. After we released the working paper version of this article (arxiv.ID 1702.06240), many methods have been proposed similar estimation approaches. Focusing on missing data, Chakrabortty et al. (2019) develops an estimator, relying on a classical Robins and Rotnitzky (1995) orthogonal score. In contrast, we derive an orthogonal loss function for an arbitrary CMR that has singleindex property. Finally, the followup paper by Foster and Syrgkanis (2019) extends our result to estimators with decomposable regularizers, including penalty as a special case.
The third line of research obtains a consistent and asymptotically normal estimator of a lowdimensional target parameter in the presence of a nonparametric nuisance parameter (Neyman (1959), Neyman (1979), Hardle and Stoker (1989),Bickel et al. (1993), Newey and Stoker (1993), Andrews (1994),Newey (1994), Robins and Rotnitzky (1995), Ai and Chen (2003)). A statistical procedure is called Neymanorthogonal if it is locally insensitive with respect to the estimation error of the firststage nuisance parameter. Combining orthogonality and sample splitting, the Double Machine Learning framework of (Chernozhukov et al. (2016), Chernozhukov et al. (2018)) has derived a rootN consistent asymptotically normal estimator of the target parameter based on the firststage machine learning estimates. Extending this work, we establish convergence rates for regularized estimators whose loss function gradient is an orthogonal moment. Next, we also contribute to the literature that derives orthogonal moments starting from a nonorthogonal ones. Specifically, we construct a bias correction term for a nuisance parameter that is identified by a general conditional exogeneity restriction, covering conditional expectation (Newey (1994)) and conditional quantile (Ichimura and Newey (2015)) as leading special cases. Finally, we also contribute to the growing literature on orthogonal/doubly robust estimation based on machine learning methods (Sasaki and Ura (2018), Chiang et al. (2019), Sasaki et al. (2020), Chiang (2018)), in particular, heterogeneous treatment effects estimation (Nie and Wager (2017), Semenova et al. (2017), Oprescu et al. (2018), Fan et al. (2019), Zimmert and Lechner (2019), Colangelo and Lee (2020), Semenova and Chernozhukov (2018)). In contrast to this work, our main example features nonlinear link function where standard bias correction ideas do not work. To the best of our knowledge, our approach to heterogeneous treatment effects with a nonlinear link function is completely new.
2 SetUp
2.1 Example: Nonlinear Treatment Effects
Our framework nests a novel approach to modeling heterogeneous treatment effects with a binary dependent variable. Consider a binary response model
(2.1) 
where is a onedimensional base treatment, is a vector of controls, is a binary outcome, and is the data vector. We assume that the treatment and the controls affect the outcome via an additively separable index
(2.2) 
entering the logistic link function . The base treatment enters the index (2.2) through its interactions with the control vector . In addition, the control vector enters the index (2.2) through an unknown function . Our main object of interest is the heterogeneous treatment effects vector .
We allow the total number of heterogeneous treatment interactions
to exceed the sample size (i.e., ), assuming that the number of the relevant ones
is sufficiently small (i.e., ). The researcher does not know which interactions are relevant and thus estimates the dimensional parameter . As a special case, the model (2.1) nests the models considered in van der Geer (2008) and Belloni et al. (2016), both of which assume constant treatment effect () and require the confounding function to be a linear sparse function of the controls
(2.3) 
In contrast to this literature, we allow the effect of the base treatment on to be heterogeneous and do not impose assumption (2.3).
A standard approach to estimating would be to jointly estimate and under a sparsity requirement on . Define the logistic loss function as
(2.4) 
and an regularized estimator of as
(2.5) 
We argue that the standard approach should not be used. First, this estimator (2.5) relies on a functional form assumption (2.3), ruling out other structural assumptions on (2.1), such as approximability by trees and neural networks, that could be more economically plausible. Second, even if the assumption (2.3) holds, the complexity of the control function is likely to be much larger than the that of heterogeneous interactions
As a result, an admissible penalty parameter sufficient to dominate the noise in the problem (2.5) substantially exceeds the oracle penalty , where oracle knows the parameter and estimates only . This regularization bias of directly impacts :
As a result, the estimator does not converge at the oracle rate, which we show to be attainable, and, therefore, is not an optimal one.
We propose a novel approach to estimating , extending the partialling out idea of Robinson (1988) to the nonlinear link functions. We represent (2.1) as a conditional moment restriction whose nuisance parameter is identified separately from . The new CMR takes the form
(2.6) 
where is the conditional expectation of the treatment
(2.7) 
and is the conditional expectation of the link function’s argument
(2.8) 
In contrast to equation (2.1), the nuisance parameter in equation (2.6) is identified by separate conditions (2.7)(2.8) that do not depend on . Because and involve only conditional expectation functions, they can be estimated by flexible supervised machine learning techniques. If assumption (2.3) holds, reduces to
and one can estimate by plugging the estimate (2.5) and a preliminary estimate of .
We now construct the estimator loss function. We first adjust the moment function (2.6) to make it insensitive (formally, Neymanorthogonal) with respect to and . The adjusted CMR takes the form
(2.9) 
where the true value of function is the conditional variance
(2.10) 
Integrating (2.9) with respect to the singleindex , we obtain the weighted logistic loss function
(2.11) 
so that
and the true value of the nuisance parameter is
The gradient of the proposed loss function is orthogonal with respect to perturbations of each component of the nuisance parameter
(2.12)  
The final estimate is based on the weighted logistic loss function (2.11), regularized estimate and a moderate choice of penalty parameter. If the firststage regularization bias converges sufficiently fast, the oracle penalty parameter is admissible, and the estimate converges at the same rate as the oracle estimator, where oracle knows and solves only a parametric problem.
2.2 General Framework
This paper considers a class of singleindex conditional moment restrictions (CMRs):
(2.13) 
where is the data vector, is the conditioning vector, and is the target parameter that enters (2.13) via singleindex . Both the index function and the moment function depend on a nuisance vectorvalued function . Examples of the nuisance function
include the propensity score, the conditional choice probability, and the regression function, or a combination of these functions. Our goal is to construct an estimator loss function for
so that its gradient
is an orthogonal moment with respect to the perturbations of .
2.3 Examples
In the examples below, we construct the orthogonal loss function for Nonlinear Treatment Effects, General Moment Problems with Missing Data, and Static Games of Incomplete Information.
Example 2.1 (Nonlinear Treatment Effects).
Suppose the treatment effect is identified by the conditional moment restriction (2.1), where the link function is a known monotone link function that may not necessarily be logistic. As discussed in the introduction, the CMR (2.1) can be rearranged into CMR (2.9), which is a special case of (2.13) with , , and defined in (2.7), (2.8) and (2.10). As discussed in Section 1, the orthogonal loss function is an arbitrary solution to the ODE
(2.14) 
Corollary 4.1 establishes the mean square convergence rate for the Regularized Orthogonal Estimator based on the loss function (2.14).
Remark 2.1 (Linear Link).
Consider Example 2.1 with . The function in (2.10) simplifies to
As a result, the nuisance parameter simplifies to and . The moment function (2.13) takes the form
The orthogonal loss function is the least squares loss function
The gradient of the composed loss function is Robinson (1988)type orthogonal score
The least squares loss function has been previously used in Semenova et al. (2017) and Nie and Wager (2017).
Remark 2.2 (Logistic Link).
Example 2.2 (Missing Data).
Suppose a researcher is interested in the parameter identified by a CMR:
(2.15) 
where is a partially observed outcome variable and is a covariate vector. Let indicate whether is observed, be the observed outcome, and be the data vector. A standard way to make progress is to assume that is as good as randomly assigned conditional on .
Assumption 2.1 (Observed Confounders (OC)).
The presence indicator is independent of conditional on .
Define the conditional probability of observing as
Let be
The moment function takes the form
(2.16) 
where , and . The orthogonal loss function takes the form:
(2.17) 
where is an arbitrary solution to an ODE
(2.18) 
Corollary 4.2 establishes mean square convergence rate for the Regularized Orthogonal Estimator based on the loss function (2.17).
Remark 2.3 (Quantile Regression with Missing Data).
Example 2.3 (Static Games of Incomplete Information).
Consider a twoplayer binary choice static game of incomplete information. Focusing on player , suppose his utility of action is given by
(2.21) 
where is the covariate vector, is the action of player (player ’s opponent), is mean independent private shock that follows Gumbel distribution. The target vector consists of the covariate effect and interaction effect . The utility of action is normalized to zero. Assuming players’ choices correspond to BayesNash equilibrium, can be determined as
Denote the data vector as and the conditioning vector as . Denote the conditional probability of player 2 action as
and let be
The moment function takes the form
where , , and . The orthogonal loss function takes the form
(2.22) 
Corollary 4.3 establishes mean square convergence rate for the Regularized Orthogonal Estimator based on the loss function (2.22).
2.4 Overview of Main Results.
Given a singleindex CMR (2.13), we construct an estimator loss function as a solution of a onedimensional Ordinary Differential Equation
(2.23) 
Then, the gradient of the loss function at and is a valid moment equation for :
(2.24) 
As discussed in introduction, we will require equation (2.24) to be an orthogonal moment with respect to the perturbations of . A sufficient, though not necessary, condition for orthogonality of (2.24) is conditional orthogonality of :
(2.25) 
In Theorem 3.3, we derive a conditionally orthogonal moment function starting from a nonorthogonal one, for a class of nuisance parameters.
We construct the Regularized Orthogonal Estimator , the twostage estimator of , as follows. In the first stage, we construct an estimate of the nuisance parameter , using a highquality machine learning estimator capable of dealing with the highdimensional covariate vector . In the second stage, we estimate the target parameter as described below. We use different samples to estimate in the first stage and in the second stage in a form of crossfitting, defined below.
Definition 2.1 (CrossFitting).

For a random sample of size , denote a fold random partition of the sample indices by , where is the number of partitions and the sample size of each fold is . For each define .

For each , construct an estimator of the nuisance parameter using only the data . For any observation , define .
Definition 2.2 (Regularized Orthogonal Estimator).
Given , the search set and the penalty parameter , define
(2.26) 
Definition 2.2 is a highlevel definition of Regularized Orthogonal Estimator that depends on the firststage estimate , the search set and the penalty parameter . If the orthogonal loss function is convex in , we invoke the Regularized Orthogonal Estimator of Definition 2.2 with the search set and the crossfitting estimator of the firststage parameter . Otherwise, we take the search set to be a small neighborhood of a preliminary suboptimal estimator, constructed on an auxiliary sample (Definition 3.1). In both cases, the parameter is chosen to dominate the sampling noise and the firststage regularization bias whose effect is secondorder due to orthogonality.
If the firststage regularization bias converges sufficiently fast , the Regularized Orthogonal Estimator delivers a highquality estimate of that obeys the following properties:
(2.27) 
In particular, achieves the same rate of convergence as the oracle estimator, where oracle knows the true value of and solves only a parametric problem.
Our results accommodate highdimensional/highly complex modern machine learning (ML) methods to estimate , such as random forests, neural networks, and shrinkage estimators. The only requirement we impose on the estimation of is that it converges to the true nuisance parameter at a fast enough rate . This requirement is satisfied under structural assumptions on , such as approximate sparsity of with respect to some dictionary, or if is well approximated by trees or by sparse neural and deep neural nets. In Appendix B, we give sufficient primitive conditions for in highdimensional sparse models. For a broader overview of lowlevel primitive conditions that covers neural networks and random forest see, e.g., Jeong and Namkoong (2020), Appendix 1.
3 Theoretical results
In this section, we state our main theoretical result. Section 3.1 states the result under a highlevel rate assumption on the firststage nuisance parameter (Assumption 3.2), convexity of the loss function (Assumption 3.5) and orthogonality of its gradient (Assumption 3.6). Appendix B verifies Assumption 3.2 for highdimensional sparse models. Section 3.2 relaxes Assumption 3.5. Section 3.3 relaxes Assumption 3.6.
We will use the following notation. For two sequences of random variables means . For two sequences of numbers , means . For a fixed vector, its norm is denoted by , the norm is denoted by , the is denoted by , and is denoted by . Let be the set of coordinates of that are not equal to zero, and let be the complement of . For a vector , let for each and for . Define the restricted cone as
(3.1) 
For a vectorvalued function , denote its and norms as
(3.2) 
Furthermore, assume that the index function is sufficiently smooth with respect to , so that the gradient and the Hessian of each coordinate are welldefined.
3.1 Main Result
Denote the population covariance matrix of the index function as
(3.3) 
Assumption 3.1 is a standard condition for to be identified by the conditional moment restriction (2.13).
Assumption 3.1 (Identification).
There exists so that .
Assumption 3.2 formalizes the convergence of the nuisance parameter’s estimator. It introduces a sequence of nuisance realization sets that contain the true value and the estimator with probability . As the sample size increases, the sets shrink. The shrinkage speed is measured by the rate and is referred to as the firststage rate.
Assumption 3.2 (Firststage rate).
There exist sequences of numbers and and a sequence of sets so that w.p. at least and . The sets shrink at the rate
where is either or norm as defined in equation (3.2).
Assumption 3.2 is satisfied by many machine learning estimators under structural assumptions on the model in and/or norm. For example, it holds for Lasso (Belloni et al. (2011), Belloni and Chernozhukov (2013)) in linear and generalized linear models, for boosting in sparse models (Luo and Spindler (2016)), for neural network (Chen and White (1999)), and for random forest in lowdimensional (Wager and Walther (2015)) and highdimensional sparse (Syrganis and Zampetakis (2020)) models. Lemmas B.1 and B.2 in Appendix B establish Assumption 3.2 under lowlevel sparsity conditions.
Assumption 3.3 ensures that the moment function is sufficiently smooth in its second and third arguments. Let be an open bounded set containing the support of the data vector . Likewise, let be an open set containing the support of . Finally, let be an open bounded set containing the support of vector , when . In what follows, we assume that does not change with .
Assumption 3.3 (Smooth and bounded design).
The following assumptions hold.

There exists a constant so that and for any vector , number and vector the following conditions hold:
Assumptions 3.1 and 3.4 ensure that is the unique minimizer of the population estimator loss function. Assumption 3.4 is a sufficient condition for restricted strong convexity of Negahban et al. (2012).
Assumption 3.4 (Bounded derivative on ).
There exists a constant so that the following bound holds:
Assumption 3.5 ensures that the loss function is a convex function of for any and .
Assumption 3.5 (Monotonicity in single index).
The moment function is increasing in for any and any .
Assumption 3.6 introduces an orthogonality condition for the loss function gradient.
Assumption 3.6 (Orthogonality).
The gradient of the loss function is an orthogonal moment equation with respect to the perturbations of :
Assumption 3.6 simplifies our exposition, but is not required for our results. For a class of nuisance parameters, we derive a CMR obeying Assumption 3.6 from an arbitrary nonorthogonal (original) moment restriction in Section 3.3. Define the following set as
(3.4) 
Theorem 3.1 (Regularized Orthogonal Estimator: Convex Case).
Theorem 3.1 is our first main result. It establishes the convergence rate of the Regularized Orthogonal Estimator. In general case, this rate is the largest of the two rates: the sampling error rate and the effect of the firststage regularization bias . In addition, if the loss function is orthogonal (i.e., Assumption 3.6 holds), the effect of the firststage regularization bias is rather than . In addition, if , the Regularized Orthogonal Estimator attains the oracle rate, where the oracle knows the nuisance parameter and solves only a parametric problem. In our applications, we invoke Theorem 3.1 (1) to construct a preliminary estimator of based on a nonorthogonal convex loss function. We then invoke Theorem 3.1 (2) to construct the final estimator of based on an orthogonal loss function that depends on the preliminary estimate .
Theorem 3.1 shows that the firststage bias has no effect on the curvature of the convex loss function. By convexity, the minimizer of regularized loss belongs to a restricted cone in equation (3.1), where the firststage bias scales with the sparsity index , rather than the total dimension , and therefore is negligible. This result has been previously shown for the linear link function, considered in Example 2.1, in Semenova et al. (2017). Extending this result to a general class of singleindex CMRs is the novelty of this work.
3.2 Relaxation of Assumption 3.5
In this section, we drop the Assumption 3.5. As a result, the loss function defined in equation (2.23) is no longer convex. The Regularized Orthogonal Estimator is the output of Definition 2.2 with a moderate penalty level and the search set being equal to a neighborhood of a slowly converging preliminary estimator . The preliminary estimator itself is the output of Definition 2.2 with an aggressive choice of the penalty and the search set taken to be an ball around zero.
Definition 3.1 (Regularized Orthogonal Estimator with Preliminary Step).

Partition sample into three samples each of size each

Estimate the nuisance parameter on the sample .

Let the search set be
(3.8) and the penalty be chosen as . Define as the output of Definition 2.2 with estimated on and the loss computed on .

Let the search set be
(3.9) where
(3.10) and the penalty be chosen as . Define as the output of Definition 2.2 with estimated on and the loss computed on .