Regularized Orthogonal Machine Learning for Nonlinear Semiparametric Models

Regularized Orthogonal Machine Learning for Nonlinear Semiparametric Models

Abstract

This paper contributes to the literature on high-dimensional sparse -estimation by allowing the loss function to depend on a functional nuisance parameter, which we estimate by modern machine learning tools. For a class of single-index conditional moment restrictions (CMRs), we explicitly derive the loss function. We first adjust the moment function so that the gradient of the future -estimator loss is insensitive (formally, Neyman-orthogonal) with respect to the first-stage regularization bias. We then take the loss function to be an indefinite integral of the adjusted moment function with respect to the single-index. The proposed -regularized -estimator achieves oracle convergence rate, where oracle knows the nuisance parameter and solves only the parametric problem. Our framework nests a novel approach to modeling heterogeneous treatment effects with a binary dependent variable. In addition, we apply our results to conditional moment models with missing data and static games of incomplete information. Finally, we generalize our results to generic extremum estimation with a nuisance component.

\externaldocument

Online_Supplement \externaldocumentAppendixA2

1 Introduction and Motivation

Conditional moment restrictions (CMRs) often emerge as natural restrictions summarizing conditional independence, exclusion or structural assumptions in economic models. In such setting, a major challenge is to allow for a data-driven selection among a (very) large number of covariates. The sparsity assumption, which requires the number of relevant covariates to be small relative to the sample size, has appeared to be an interpretable and easily justifiable alternative. If the model is sparse, the -regularized -estimator converges fast even if the total number of covariates exceeds sample size under plausible restricted identifiability conditions (Negahban et al. (2012), Loh (2017)). In this paper, we derive the -estimator loss function for a class of CMRs that have single-index property.

Single-index property requires the target parameter to enter the moment function only via the one-dimensional index , where is a conditioning vector and is a known index function. Many economic problems can be represented as a single-index CMR. For example, the partially linear model (Robinson (1988)) and single-agent dynamic discrete choice (Hotz and Miller (1993)) with a renewal choice property are special cases. The key insight of this paper is how to derive an -estimator loss for a single-index CMR. A simple candidate – an indefinite integral of the moment function with respect to the single-index – turns out to be a valid -estimator loss. Indeed, the gradient of the proposed -estimator loss is the product of the moment function and the index function, and therefore, it is a valid unconditional moment equation for the target parameter.

In addition to this insight, this paper contributes to the literature on -estimation by allowing the loss function to depend on a functional nuisance parameter, such as the propensity score, the conditional choice probability, the conditional density, or alike, that can be substantially more high-dimensional than the target parameter itself. A natural approach would be to plug-in a machine learning/regularized estimate of the nuisance parameter into the moment function and integrate it with respect to the single-index. To achieve consistency in a high-dimensional setting, we must employ modern regularized methods whose regularization bias converges slowly. As a result, plugging such estimates into the loss function produces a biased, low-quality estimate.

We propose an -estimator that has two steps. We first adjust the moment function so that the gradient of the future loss function is insensitive (i.e., Neyman-orthogonal) with respect to first-stage regularization bias, preserving the single-index property. We then take the loss function to be an indefinite integral of the adjusted moment function with respect to the single-index. If the original moment function is monotone in the single-index, we report the global minimum of -regularized -estimator loss, following Negahban et al. (2012). Otherwise, we report the local minimum, searching for the estimator in a small neighborhood of some preliminary suboptimal estimator. Under mild conditions, the proposed estimator converges at the oracle rate, where oracle knows the true value of the nuisance parameter and solves only the parametric problem.

We demonstrate the method’s utility with three applications. We first derive the orthogonal -estimator loss function for conditional moment models with missing data, as studied in Carroll et al. (1995), Carroll and Wand (1991), Chen et al. (2008), Lee and Sepanski (1995), Sepanski and Carroll (1993), agent utilities in games of incomplete information (e.g. see Bajari et al. (2010) and Bajari et al. (2013) among others). Finally, we introduce a novel approach to heterogeneous treatment effects with a nonlinear link function, targeting binary dependent variables. In all these settings, we derive an orthogonal -estimator loss function and concrete primitive conditions on the nuisance parameters under which the proposed -estimator exhibits oracle convergence.

This paper is organized as follows. Section 2 introduces our main examples of interest and gives a non-technical overview of the results. Section 3 formally states our result. Section 4 derives the concrete conditions for our applications. Section 5 generalizes our results to extremum estimation beyond -estimation. Section 6 gives Monte-Carlo simulations for heterogeneous treatment effects with nonlinear link function. Appendix A states proofs. Appendix B verifies the high-level first-stage convergence assumption in high-dimensional sparse models.

1.1 Literature Review

This paper is related to three lines of research: single-index models, high-dimensional sparse models, and orthogonal/debiased inference based on machine learning methods.

The first line of research concerns with the estimation of single-index models (Manski (1975), Powell (1984), Manski (1985), Ichimura (1993), Klein and Spady (1993), Hardle et al. (1993)). This work focuses on a low-dimensional target parameter that can be treated as fixed. Focusing on the smooth models, we allow the parameter’s dimension to grow with sample size and even exceed it. We show that the single-index property helps to prevent overfitting in high-dimensional settings, extending the generalization bounds in machine learning literature (see e.g., Shalev-Shwartz and Ben-David (2014)) to single-index CMRs.

The second line of research establishes the finite-sample bounds for a high-dimensional sparse parameter (Belloni et al. (2011), Negahban et al. (2012), Loh and Wainwright (2013), van der Geer et al. (2014), ?, Loh (2017), Zhu (2017), Zhu et al. (2019),?). Our contribution is to allow the loss function to depend on a functional nuisance parameter. In the convex case, we establish global convergence of the -regularized -estimator, following Negahban et al. (2012). In the non-convex case, we establish local convergence, building on Loh and Wainwright (2013) and Loh (2017). Focusing on the double robustness property, Tan (2020b) and Tan (2020a) establish convergence guarantees in high-dimensional models that are potentially misspecified. After we released the working paper version of this article (arxiv.ID 1702.06240), many methods have been proposed similar -estimation approaches. Focusing on missing data, Chakrabortty et al. (2019) develops an -estimator, relying on a classical Robins and Rotnitzky (1995) orthogonal score. In contrast, we derive an orthogonal loss function for an arbitrary CMR that has single-index property. Finally, the follow-up paper by Foster and Syrgkanis (2019) extends our result to -estimators with decomposable regularizers, including penalty as a special case.

The third line of research obtains a -consistent and asymptotically normal estimator of a low-dimensional target parameter in the presence of a nonparametric nuisance parameter (Neyman (1959), Neyman (1979), Hardle and Stoker (1989),Bickel et al. (1993), Newey and Stoker (1993), Andrews (1994),Newey (1994), Robins and Rotnitzky (1995), Ai and Chen (2003)). A statistical procedure is called Neyman-orthogonal if it is locally insensitive with respect to the estimation error of the first-stage nuisance parameter. Combining orthogonality and sample splitting, the Double Machine Learning framework of (Chernozhukov et al. (2016), Chernozhukov et al. (2018)) has derived a root-N consistent asymptotically normal estimator of the target parameter based on the first-stage machine learning estimates. Extending this work, we establish convergence rates for -regularized -estimators whose loss function gradient is an orthogonal moment. Next, we also contribute to the literature that derives orthogonal moments starting from a non-orthogonal ones. Specifically, we construct a bias correction term for a nuisance parameter that is identified by a general conditional exogeneity restriction, covering conditional expectation (Newey (1994)) and conditional quantile (Ichimura and Newey (2015)) as leading special cases. Finally, we also contribute to the growing literature on orthogonal/doubly robust estimation based on machine learning methods (Sasaki and Ura (2018), Chiang et al. (2019), Sasaki et al. (2020), Chiang (2018)), in particular, heterogeneous treatment effects estimation (Nie and Wager (2017), Semenova et al. (2017), Oprescu et al. (2018), Fan et al. (2019), Zimmert and Lechner (2019), Colangelo and Lee (2020), Semenova and Chernozhukov (2018)). In contrast to this work, our main example features nonlinear link function where standard bias correction ideas do not work. To the best of our knowledge, our approach to heterogeneous treatment effects with a nonlinear link function is completely new.

2 Set-Up

2.1 Example: Nonlinear Treatment Effects

Our framework nests a novel approach to modeling heterogeneous treatment effects with a binary dependent variable. Consider a binary response model

(2.1)

where is a one-dimensional base treatment, is a vector of controls, is a binary outcome, and is the data vector. We assume that the treatment and the controls affect the outcome via an additively separable index

(2.2)

entering the logistic link function . The base treatment enters the index (2.2) through its interactions with the control vector . In addition, the control vector enters the index (2.2) through an unknown function . Our main object of interest is the heterogeneous treatment effects vector .

We allow the total number of heterogeneous treatment interactions

to exceed the sample size (i.e., ), assuming that the number of the relevant ones

is sufficiently small (i.e., ). The researcher does not know which interactions are relevant and thus estimates the -dimensional parameter . As a special case, the model (2.1) nests the models considered in van der Geer (2008) and Belloni et al. (2016), both of which assume constant treatment effect () and require the confounding function to be a linear sparse function of the controls

(2.3)

In contrast to this literature, we allow the effect of the base treatment on to be heterogeneous and do not impose assumption (2.3).

A standard approach to estimating would be to jointly estimate and under a sparsity requirement on . Define the logistic loss function as

(2.4)

and an -regularized -estimator of as

(2.5)

We argue that the standard approach should not be used. First, this estimator (2.5) relies on a functional form assumption (2.3), ruling out other structural assumptions on (2.1), such as approximability by trees and neural networks, that could be more economically plausible. Second, even if the assumption (2.3) holds, the complexity of the control function is likely to be much larger than the that of heterogeneous interactions

As a result, an admissible penalty parameter sufficient to dominate the noise in the problem (2.5) substantially exceeds the oracle penalty , where oracle knows the parameter and estimates only . This regularization bias of directly impacts :

As a result, the estimator does not converge at the oracle rate, which we show to be attainable, and, therefore, is not an optimal one.

We propose a novel approach to estimating , extending the partialling out idea of Robinson (1988) to the nonlinear link functions. We represent (2.1) as a conditional moment restriction whose nuisance parameter is identified separately from . The new CMR takes the form

(2.6)

where is the conditional expectation of the treatment

(2.7)

and is the conditional expectation of the link function’s argument

(2.8)

In contrast to equation (2.1), the nuisance parameter in equation (2.6) is identified by separate conditions (2.7)-(2.8) that do not depend on . Because and involve only conditional expectation functions, they can be estimated by flexible supervised machine learning techniques. If assumption (2.3) holds, reduces to

and one can estimate by plugging the estimate (2.5) and a preliminary estimate of .

We now construct the -estimator loss function. We first adjust the moment function (2.6) to make it insensitive (formally, Neyman-orthogonal) with respect to and . The adjusted CMR takes the form

(2.9)

where the true value of function is the conditional variance

(2.10)

Integrating (2.9) with respect to the single-index , we obtain the weighted logistic loss function

(2.11)

so that

and the true value of the nuisance parameter is

The gradient of the proposed loss function is orthogonal with respect to perturbations of each component of the nuisance parameter

(2.12)

The final estimate is based on the weighted logistic loss function (2.11), regularized estimate and a moderate choice of penalty parameter. If the first-stage regularization bias converges sufficiently fast, the oracle penalty parameter is admissible, and the estimate converges at the same rate as the oracle estimator, where oracle knows and solves only a parametric problem.

2.2 General Framework

This paper considers a class of single-index conditional moment restrictions (CMRs):

(2.13)

where is the data vector, is the conditioning vector, and is the target parameter that enters (2.13) via single-index . Both the index function and the moment function depend on a nuisance vector-valued function . Examples of the nuisance function

include the propensity score, the conditional choice probability, and the regression function, or a combination of these functions. Our goal is to construct an -estimator loss function for

so that its gradient

is an orthogonal moment with respect to the perturbations of .

2.3 Examples

In the examples below, we construct the orthogonal loss function for Nonlinear Treatment Effects, General Moment Problems with Missing Data, and Static Games of Incomplete Information.

Example 2.1 (Nonlinear Treatment Effects).

Suppose the treatment effect is identified by the conditional moment restriction (2.1), where the link function is a known monotone link function that may not necessarily be logistic. As discussed in the introduction, the CMR (2.1) can be re-arranged into CMR (2.9), which is a special case of (2.13) with , , and defined in (2.7), (2.8) and (2.10). As discussed in Section 1, the orthogonal loss function is an arbitrary solution to the ODE

(2.14)

Corollary 4.1 establishes the mean square convergence rate for the Regularized Orthogonal Estimator based on the loss function (2.14).

Remark 2.1 (Linear Link).

Consider Example 2.1 with . The function in (2.10) simplifies to

As a result, the nuisance parameter simplifies to and . The moment function (2.13) takes the form

The orthogonal loss function is the least squares loss function

The gradient of the composed loss function is Robinson (1988)-type orthogonal score

The least squares loss function has been previously used in Semenova et al. (2017) and Nie and Wager (2017).

Remark 2.2 (Logistic Link).

Consider Example 2.1 with . The moment function (2.13) takes the form

The orthogonal loss function is weighted logistic likelihood

The gradient of the composed loss function is

Example 2.2 (Missing Data).

Suppose a researcher is interested in the parameter identified by a CMR:

(2.15)

where is a partially observed outcome variable and is a covariate vector. Let indicate whether is observed, be the observed outcome, and be the data vector. A standard way to make progress is to assume that is as good as randomly assigned conditional on .

Assumption 2.1 (Observed Confounders (OC)).

The presence indicator is independent of conditional on .

Define the conditional probability of observing as

Let be

The moment function takes the form

(2.16)

where , and . The orthogonal loss function takes the form:

(2.17)

where is an arbitrary solution to an ODE

(2.18)

Corollary 4.2 establishes mean square convergence rate for the Regularized Orthogonal Estimator based on the loss function (2.17).

Remark 2.3 (Quantile Regression with Missing Data).

Consider the setting of Example 2.2 with

(2.19)

where is a quantile level. The moment function (2.19) identifies a quantile treatment effect parameter . The orthogonal loss function (2.17) takes the form

(2.20)
Example 2.3 (Static Games of Incomplete Information).

Consider a two-player binary choice static game of incomplete information. Focusing on player , suppose his utility of action is given by

(2.21)

where is the covariate vector, is the action of player (player ’s opponent), is mean independent private shock that follows Gumbel distribution. The target -vector consists of the covariate effect and interaction effect . The utility of action is normalized to zero. Assuming players’ choices correspond to Bayes-Nash equilibrium, can be determined as

Denote the data vector as and the conditioning vector as . Denote the conditional probability of player 2 action as

and let be

The moment function takes the form

where , , and . The orthogonal loss function takes the form

(2.22)

Corollary 4.3 establishes mean square convergence rate for the Regularized Orthogonal Estimator based on the loss function (2.22).

2.4 Overview of Main Results.

Given a single-index CMR (2.13), we construct an -estimator loss function as a solution of a one-dimensional Ordinary Differential Equation

(2.23)

Then, the gradient of the loss function at and is a valid moment equation for :

(2.24)

As discussed in introduction, we will require equation (2.24) to be an orthogonal moment with respect to the perturbations of . A sufficient, though not necessary, condition for orthogonality of (2.24) is conditional orthogonality of :

(2.25)

In Theorem 3.3, we derive a conditionally orthogonal moment function starting from a non-orthogonal one, for a class of nuisance parameters.

We construct the Regularized Orthogonal Estimator , the two-stage estimator of , as follows. In the first stage, we construct an estimate of the nuisance parameter , using a high-quality machine learning estimator capable of dealing with the high-dimensional covariate vector . In the second stage, we estimate the target parameter as described below. We use different samples to estimate in the first stage and in the second stage in a form of cross-fitting, defined below.

Definition 2.1 (Cross-Fitting).

  1. For a random sample of size , denote a -fold random partition of the sample indices by , where is the number of partitions and the sample size of each fold is . For each define .

  2. For each , construct an estimator of the nuisance parameter using only the data . For any observation , define .

Definition 2.2 (Regularized Orthogonal Estimator).

Given , the search set and the penalty parameter , define

(2.26)

Definition 2.2 is a high-level definition of Regularized Orthogonal Estimator that depends on the first-stage estimate , the search set and the penalty parameter . If the orthogonal loss function is convex in , we invoke the Regularized Orthogonal Estimator of Definition 2.2 with the search set and the cross-fitting estimator of the first-stage parameter . Otherwise, we take the search set to be a small neighborhood of a preliminary suboptimal estimator, constructed on an auxiliary sample (Definition 3.1). In both cases, the parameter is chosen to dominate the sampling noise and the first-stage regularization bias whose effect is second-order due to orthogonality.

If the first-stage regularization bias converges sufficiently fast , the Regularized Orthogonal Estimator delivers a high-quality estimate of that obeys the following properties:

(2.27)

In particular, achieves the same rate of convergence as the oracle estimator, where oracle knows the true value of and solves only a parametric problem.

Our results accommodate high-dimensional/highly complex modern machine learning (ML) methods to estimate , such as random forests, neural networks, and -shrinkage estimators. The only requirement we impose on the estimation of is that it converges to the true nuisance parameter at a fast enough rate . This requirement is satisfied under structural assumptions on , such as approximate sparsity of with respect to some dictionary, or if is well approximated by trees or by sparse neural and deep neural nets. In Appendix B, we give sufficient primitive conditions for in high-dimensional sparse models. For a broader overview of low-level primitive conditions that covers neural networks and random forest see, e.g., Jeong and Namkoong (2020), Appendix 1.

3 Theoretical results

In this section, we state our main theoretical result. Section 3.1 states the result under a high-level rate assumption on the first-stage nuisance parameter (Assumption 3.2), convexity of the loss function (Assumption 3.5) and orthogonality of its gradient (Assumption 3.6). Appendix B verifies Assumption 3.2 for high-dimensional sparse models. Section 3.2 relaxes Assumption 3.5. Section 3.3 relaxes Assumption 3.6.

We will use the following notation. For two sequences of random variables means . For two sequences of numbers , means . For a fixed vector, its norm is denoted by , the norm is denoted by , the is denoted by , and is denoted by . Let be the set of coordinates of that are not equal to zero, and let be the complement of . For a vector , let for each and for . Define the restricted cone as

(3.1)

For a vector-valued function , denote its and norms as

(3.2)

Furthermore, assume that the index function is sufficiently smooth with respect to , so that the gradient and the Hessian of each coordinate are well-defined.

3.1 Main Result

Denote the population covariance matrix of the index function as

(3.3)

Assumption 3.1 is a standard condition for to be identified by the conditional moment restriction (2.13).

Assumption 3.1 (Identification).

There exists so that .

Assumption 3.2 formalizes the convergence of the nuisance parameter’s estimator. It introduces a sequence of nuisance realization sets that contain the true value and the estimator with probability . As the sample size increases, the sets shrink. The shrinkage speed is measured by the rate and is referred to as the first-stage rate.

Assumption 3.2 (First-stage rate).

There exist sequences of numbers and and a sequence of sets so that w.p. at least and . The sets shrink at the rate

where is either or norm as defined in equation (3.2).

Assumption 3.2 is satisfied by many machine learning estimators under structural assumptions on the model in and/or norm. For example, it holds for Lasso (Belloni et al. (2011), Belloni and Chernozhukov (2013)) in linear and generalized linear models, for -boosting in sparse models (Luo and Spindler (2016)), for neural network (Chen and White (1999)), and for random forest in low-dimensional (Wager and Walther (2015)) and high-dimensional sparse (Syrganis and Zampetakis (2020)) models. Lemmas B.1 and B.2 in Appendix B establish Assumption 3.2 under low-level sparsity conditions.

Assumption 3.3 ensures that the moment function is sufficiently smooth in its second and third arguments. Let be an open bounded set containing the support of the data vector . Likewise, let be an open set containing the support of . Finally, let be an open bounded set containing the support of vector , when . In what follows, we assume that does not change with .

Assumption 3.3 (Smooth and bounded design).

The following assumptions hold.

  1. There exists a constant so that and for any vector , number and vector the following conditions hold:

Assumptions 3.1 and 3.4 ensure that is the unique minimizer of the population -estimator loss function. Assumption 3.4 is a sufficient condition for restricted strong convexity of Negahban et al. (2012).

Assumption 3.4 (Bounded derivative on ).

There exists a constant so that the following bound holds:

Assumption 3.5 ensures that the loss function is a convex function of for any and .

Assumption 3.5 (Monotonicity in single index).

The moment function is increasing in for any and any .

Assumption 3.6 introduces an orthogonality condition for the loss function gradient.

Assumption 3.6 (Orthogonality).

The gradient of the loss function is an orthogonal moment equation with respect to the perturbations of :

Assumption 3.6 simplifies our exposition, but is not required for our results. For a class of nuisance parameters, we derive a CMR obeying Assumption 3.6 from an arbitrary non-orthogonal (original) moment restriction in Section 3.3. Define the following set as

(3.4)
Theorem 3.1 (Regularized Orthogonal Estimator: Convex Case).

Suppose that the nuisance space is equipped with either the or the norm, defined in equation (3.2). Suppose Assumptions 3.1, 3.2, 3.3, 3.4,3.5 hold with and as in equation (3.4). Then, for large enough, the following statements hold.

  1. If Assumption 3.6 does not hold, the Regularized Orthogonal Estimator of Definition 2.2 with the search set an aggressive penalty satisfying obeys

    (3.5)
  2. Suppose Assumption 3.6 holds. Then, the Regularized Orthogonal Estimator of Definition 2.2 with a moderate penalty satisfying obeys

    (3.6)

    If the oracle penalty choice is admissible. The Regularized Orthogonal Estimator converges at the oracle rate

    (3.7)

Theorem 3.1 is our first main result. It establishes the convergence rate of the Regularized Orthogonal Estimator. In general case, this rate is the largest of the two rates: the sampling error rate and the effect of the first-stage regularization bias . In addition, if the loss function is orthogonal (i.e., Assumption 3.6 holds), the effect of the first-stage regularization bias is rather than . In addition, if , the Regularized Orthogonal Estimator attains the oracle rate, where the oracle knows the nuisance parameter and solves only a parametric problem. In our applications, we invoke Theorem 3.1 (1) to construct a preliminary estimator of based on a non-orthogonal convex loss function. We then invoke Theorem 3.1 (2) to construct the final estimator of based on an orthogonal loss function that depends on the preliminary estimate .

Theorem 3.1 shows that the first-stage bias has no effect on the curvature of the convex loss function. By convexity, the minimizer of -regularized loss belongs to a restricted cone in equation (3.1), where the first-stage bias scales with the sparsity index , rather than the total dimension , and therefore is negligible. This result has been previously shown for the linear link function, considered in Example 2.1, in Semenova et al. (2017). Extending this result to a general class of single-index CMRs is the novelty of this work.

3.2 Relaxation of Assumption 3.5

In this section, we drop the Assumption 3.5. As a result, the loss function defined in equation (2.23) is no longer convex. The Regularized Orthogonal Estimator is the output of Definition 2.2 with a moderate penalty level and the search set being equal to a neighborhood of a slowly converging preliminary estimator . The preliminary estimator itself is the output of Definition 2.2 with an aggressive choice of the penalty and the search set taken to be an -ball around zero.

Definition 3.1 (Regularized Orthogonal Estimator with Preliminary Step).
  1. Partition sample into three samples each of size each

  2. Estimate the nuisance parameter on the sample .

  3. Let the search set be

    (3.8)

    and the penalty be chosen as . Define as the output of Definition 2.2 with estimated on and the loss computed on .

  4. Let the search set be

    (3.9)

    where

    (3.10)

    and the penalty be chosen as . Define as the output of Definition 2.2 with estimated on and the loss computed on .

Theorem 3.2 (Regularized Orthogonal Estimator: Non-convex Case).

Suppose that the nuisance space is equipped with either the or the norm, defined in equation (3.2). Suppose Assumptions 3.1, 3.2, 3.3, 3.4, 3.6 hold with , and as in (3.8). Then, the Regularized Orthogonal Estimator of Definition 3.1 obeys