Abstract.

# Improved Methods for Moment Restriction Models with Marginally Incompatible Data Combination and an Application to Two-sample Instrumental Variable Estimation

\counterwithout

figuresection \counterwithouttablesection \newcitesappendAppendix References

Improved Methods for Moment Restriction Models with Marginally Incompatible Data Combination and an Application to Two-sample Instrumental Variable Estimation

Heng Shu & Zhiqiang Tan1

Abstract. Combining information from multiple samples is often needed in biomedical and economic studies, but the differences between these samples must be appropriately taken into account in the analysis of the combined data. We study estimation for moment restriction models with data combination from two samples under an ignorablility-type assumption but allowing for different marginal distributions of common variables between the two samples. Suppose that an outcome regression model and a propensity score model are specified. By leveraging the semiparametric efficiency theory, we derive an augmented inverse probability weighted (AIPW) estimator that is locally efficient and doubly robust with respect to the outcome regression and propensity score models. Furthermore, we develop calibrated regression and likelihood estimators that are not only locally efficient and doubly robust, but also intrinsically efficient in achieving smaller variances than the AIPW estimator when the propensity score model is correctly specified but the outcome regression model may be misspecified. As an important application, we study the two-sample instrumental variable problem and derive the corresponding estimators while allowing for incompatible distributions of common variables between the two samples. Finally, we provide a simulation study and an econometric application on public housing projects to demonstrate the superior performance of our improved estimators.

Key words and phrases. Data Combination; Double robustness; Inverse probability weighting; Intrinsic efficiency; Local efficiency; Moment restriction models; Two-sample instrument variable estimation.

## 1 Introduction

Typically, empirical studies in biomedical and social sciences involve drawing inferences regarding a population. However, there are various situations where information need to be combined from two or more samples possibly for different populations from the target (e.g., Ridder & Moffitt 2007). For example, a single sample may not contain all the relevant variables, or some variables in the sample may be measured with errors. Even if all the relevant variables are collected from one sample, the sample size may be too small to achieve accurate estimation.

Suppose that two random samples are obtained: a primary sample from the target population and an auxiliary sample from another population possibly different from the target population. The primary sample provides the measurements of variables , and the auxiliary sample contains measurements of variables . That is, the variable is only available from the primary data, only from the auxiliary data, and from both data. We distinguish two different settings:

• The parameter of interest can be defined through a set of moment restrictions in , without involving , under the primary population (Chen et al. 2008).

• The parameter can also be defined through moment restrictions that are separable in and in under the primary population as studied in Graham et al. (2016).

Setting (I) is more basic than (II) because the inferential difficulty mainly lies in the lack of primary data on jointly. On the other hand, setting (I) can be subsumed under (II) with degenerate restrictions in . A special case of such settings is estimation of average treatment effects on the treated (ATT) (Hahn 1998). Identification of the parameter can be achieved provided that the conditional distributions of given are the same under the primary and auxiliary populations. The marginal distributions of may, however, differ between the two populations.

The foregoing setting (I), with only involved but not , is called the “verify-out-of-sample” case in Chen et al. (2008), because the auxiliary sample is obtained independently of the primary sample such that no individual units are linked between the two samples. This setting differs from missing-data and causal inference problems which are studied in Robins et al. (1994) and Tan (2010a, 2011) among others and called the “verify-in-sample” case in Chen et al. (2008) because the auxiliary sample is a subset of the primary sample by design or by happenstance. A particular example of the latter setting is estimation of average treatment effects in the overall population (ATE) (e.g., Hahn, 1998; Imbens, 2004). The current setting should also be contrasted with the analysis of linked data, where common units are linked between different samples by probabilistic record linkage (e.g., Lahiri & Larsen 2005).

A large body of works have been done on statistical theory and methods for estimation in moment restriction models with auxiliary data in the “verify-out-of-sample” case, in addition to the “verify-in-sample” case. The semiparametric efficiency bounds are studied by Hahn (1998) for ATT estimation, by Chen et al. (2008) for moment restriction models with only involved in setting (I), and by Graham et al. (2016) for moment restriction models that are separable in and in setting (II). Moreover, asymptotically globally efficient estimators are proposed in these cases by Hahn (1998), Hirano et al. (2003), and Chen et al. (2008) among others, using nonparametric series/sieve estimation on the propensity score (PS) or the outcome regression (OR) function. But the smoothness conditions typically assumed for such methods can be problematic in many practical situations with a high-dimensional vector of common variables (e.g., Robins & Ritov 1997). Recently, Graham et al. (2016) proposed a locally efficient and doubly robust method with separable moment restrictions, using parametric PS and OR models. But methods achieving local efficiency and double robustness alone may still suffer from large variances due to inverse probability weighting. Such a phenomenon is known in the “verify-in-sample” case of missing-data problems (Kang & Schafer 2007), and can be seen to motivate various recent methodological development (e.g., Tan 2010a; Cao et al. 2009).

We develop improved methods for moment restriction models with data combination and make three contributions. First, we derive augmented inverse probability weighted (AIPW) estimators in setting (I), by using efficient influence functions as estimating functions with the true outcome regression function and propensity score replaced by their fitted values. The idea of constructing estimating equations from influence functions (including efficient influence functions) is widely known, at least for missing-data problems in the “verify-in-sample” case (e.g., Tsiatis 2006; Graham 2011). But our application of this idea to the “verify-out-of-sample” case seems new and reveals subtle properties associated with the fact that the semiparametric efficient bounds vary under a nonparametric model, a correctly specified propensity score model, or known propensity scores in the “verify-in-sample” case, instead of staying the same in the “verify-out-of-sample” case (Hahn 1998; Chen et al. 2008).

On one hand, we show that the AIPW estimator based on the efficient influence function calculated under the nonparametric model is locally nonparametric efficient (i.e., achieves the nonparametric variance bound if both the OR and PS models are correctly specified), and doubly robust (i.e., remains consistent if either the OR model or the PS model is correctly specified). This AIPW estimator is simpler and more flexible than the related estimator of Graham et al. (2016), which is shown to be locally efficient and doubly robust only under the restrictions that the PS model is logistic regression, the OR model is linear, and all the regressors of the OR model are included in the linear span of those of the PS model.2 On the other hand, we find that the AIPW estimator based on the efficient influence function calculated with known propensity score is locally semiparametric efficient (i.e., achieves the semiparametric variance bound calculated under the PS model used if both the OR and PS models are correctly specified), but generally not doubly robust.

Second, we propose in setting (I) calibrated regression and likelihood estimators which are not only locally efficient and doubly robust, but also intrinsically efficient (i.e., asymptotically more efficient than the corresponding locally efficient and doubly robust AIPW estimator when the PS model is correctly specified but the OR model may be misspecified). Such improved estimators have been obtained in the “verify-in-sample” case of missing-data problems (e.g., Tan 2006, 2010a, 2010b; Cao et al. 2009). But due to the aforementioned difference between the locally nonparametric and semiparametric efficient AIPW estimators, a direct application of existing techniques would not yield an estimator with the desired properties in the “verify-out-of-sample” setting. We introduce a new idea to overcome this difficulty and develop estimators of the desired properties, by working with an augmented propensity score model which includes the fitted outcome regression functions as additional regressors.

Third, our theory and methods from setting (I) can be applied and extended to setting (II). As a concrete application, we study two-sample instrumental variable estimation and derive the improved estimators in setting (II). The two-sample instrumental variable (TSIV) estimator (Angrist & Krueger 1992) is generally consistent only when the marginal distributions of the common variables are the same in the two samples or, equivalently, the propensity score for selection into the samples is a constant in . The two-sample two-stage least squares (TS2SLS) estimator (Bjorklund & Jantti 1997) is consistent if either the propensity score is a constant in or the linear regression in the first stage is correctly specified. In contrast, our calibrated estimators are doubly robust, i.e., remain consistent if either a general OR model or a general PS model is correctly specified. Moreover, our estimators tend to achieve smaller variances than related doubly robust AIPW estimators when the PS model is correctly specified but the OR model may be misspecified. We present a simulation study and an econometric application on public housing projects, to demonstrate the advantage of our estimators compared with existing estimators.

## 2 Moment restriction models with auxiliary data

Throughout this section, consider setting (I) described in the Introduction, where we are interested in the estimation of parameters through the moment conditions

 E(1)Φ(X,U;θ0)=0, (1)

where denotes the expectation under a primary population, is a vector of known functions, and is a vector of unknown parameters. Suppose that are defined on an i.i.d. sample of size from the primary population, but are missing and only are observed. For a remedy, suppose that additional data are obtained on an i.i.d. sample of size from an auxiliary population, possibly different from the primary population. To draw valid inference about , we need to combine the data from the primary population and the data from the auxiliary population.

For technical convenience, we make the following assumption:

• The sample sizes are determined from binomial sampling: the combined set of units are independently drawn from either the primary or the auxiliary population with a fixed probability . As a result, converges in probability to a finite constant in as .

With some additional work (not pursued here), it is possible to adapt our methods and results to other sampling schemes with non-random . Under Assumption (A1), the combined set of variables are i.i.d. realizations from a mixture distribution , where is an indicator variable, equal to either 1 or 0 if the th unit is drawn from the primary or the auxiliary population. The combined set of observed data are

 {(Ti,Ui,(1−Ti)Xi):i=1,…,n}.

The moment conditions (1) can be represented as

 E{Φ(X,U;θ0)|T=1}=0, (2)

where denotes the expectation with respect to the mixture distribution . This setup is exactly the “verify-out-of-sample” case in Chen et al. (2008). Because are not jointly observed given (primary population), we need borrow information from jointly observed given (auxiliary population).

To achieve identification of by information-borrowing from the auxiliary data, there are two basic assumptions needed (e.g., Chen et al. 2008). The first assumption is that the conditional distributions of given are the same under the primary and auxiliary populations, that is,

• and are conditionally independent given .

Assumption (A2) is similar to the unconfoundedness for controls for identification of ATT (e.g., Imbens 2004). The marginal distributions of are, however, allowed to differ between the primary and auxiliary populations. The second assumption is that the support of the common variable in the primary population is contained within that in the auxiliary population, that is,

• .

Assumption (A3) allows that is 0 for some values , i.e., subjects with certain -values will always be in the auxiliary population, because information from those subjects is not needed for inference about in the primary population. Nevertheless, Assumption (A3) will be strengthened in our asymptotic theory such that for all , where is a constant. See Condition (C5) and the associated discussion preceding Proposition 1.

### 2.1 Modeling approaches for estimation

There are two types of working/assisting models typically postulated for estimation of in (2), focusing on either the relationship between and or between and , similarly to those for estimation in missing-data problems (e.g., Kang & Schafer 2007; Tan 2006, 2010a). The two approaches roughly correspond to conditional expectation projection or inverse probability weighting in Chen et al. (2008).

The first approach is to build a (parametric) regression model for the outcome regression (OR) function, , such that for any value ,

 E{Φ(X,U;θ)∣∣U} =ψθ(U;α), (3)

where is a vector of known functions and is a vector of unknown parameters. In general, this model can be derived from a conditional density (not just mean) model of given , , by the relationship . In special cases as discussed in Section 4.2.1, model (3) can be directly induced from a conditional mean model of given . Let be an estimator of from the auxiliary sample (i.e., ), and denote by the fitted outcome regression function. Define as an estimator of that solves the equation

 n∑i=1Ti^ψθ(Ui)=0. (4)

If OR model (3) is correctly specified for each possible value (not just the true ), for example, a conditional density model is correctly specified, then is a consistent estimator of under standard regularity conditions. See Conditions (C1), (C4), and (C6) in Supplementary Material.

The other approach is to build a (parametric) regression model for the propensity score (PS), , such that (Rosenbaum & Rubin 1983)

 P(T=1|U)=π(U;γ)=Π{γTf(U)}, (5)

where is an inverse link function, is a vector of known functions including 1, and is a vector of unknown parameters. The score function of is:

 Sγ(T,U)={Tπ(U;γ)−1−T1−π(U;γ)}∂π(U;γ)∂γ.

Typically, a logistic regression model is used:

 π(U;γ)=[1+exp{−γTf(U)}]−1.

Denote by the maximum likelihood estimator (MLE) of that solves , which in the case of logistic regression reduces to

 ~E[{T−π(U;γ)}f(U)]=0, (6)

where denotes the sample average over the merged sample. For convenience, write the fitted propensity score as . Similarly as inverse probability weighting (IPW) for the estimation of ATT (e.g., Imbens 2004), an IPW estimator for is defined as a solution to

 ~E{1−T1−^π(U)^π(U)Φ(X,U;θ)}=0. (7)

If PS model (5) is correctly specified, then is consistent under standard regularity conditions. See Conditions (C2), (C4), (C5), and (C7) in Supplementary Material. However, because the fitted propensity score is used for inverse weighting in Eq. (7), can be very sensitive to possible misspecification of model (5).

### 2.2 AIPW estimators

As discussed in Section 2.1, the consistency of depends on the correct specification of OR model (3), and the consistency of depends on the correct specification of PS model (5). We exploit semiparametric theory to derive locally efficient and doubly robust estimators of in the form of augmented IPW (AIPW), using both OR model (3) and PS model (5). Understanding of these estimators will be important for our development of improved estimation in Section 2.3.

In Supplementary Material, Proposition S1 restates the semiparametric efficiency results from Chen et al. (2008) for estimation of under (2) in three settings:

• the propensity score is unknown with no parametric restriction;

• the propensity score is assumed to belong to a parametric family ;

• the propensity score is known.

The efficient influence functions are denoted by , , and , and their variances (i.e., semiparametric efficiency bounds) are denoted by , , and . It holds, in general with strict inequalities, that , which is in contrast with other missing-data problems such as Robins et al. (1994) and the “verify-in-sample” case in Chen et al. (2008), where .

Two estimator of can be derived by directly taking the efficient influence functions as estimating functions, with the unknown true functions and replaced by the fitted values and . The first estimator, denoted by , is based on and defined as a solution to

 ~E[1−T1−^π(U)^π(U){Φ(X,U;θ)−^ψθ(U)}+T^ψθ(U)]=0. (8)

The second estimator, denoted by , is based on or equivalently based on and defined as a solution to

 ~E[1−T1−^π(U)^π(U){Φ(X,U;θ)−^ψθ(U)}+^π(U)^ψθ(U)]=0. (9)

Proposition 1 shows that both estimators possess local efficiency but of different types,3 and only is doubly robust. For clarity, the semiparametric efficiency bound under the nonparametric PS model is hereafter called the nonparametric efficiency bound. See, for example, Newey (1990), Robins & Rotnitzky (2001), and Tsiatis (2006) for general discussions on local efficiency and double robustness.

We briefly describe regularity conditions for the asymptotic results below. See Appendix II in Supplementary Material for details. To match the Supplementary Material, the numbering of the conditions is not consecutive here.

• For a constant , it holds that . If OR model (3) is correctly specified, then .

• For a constant , it holds that . If PS model (5) is correctly specified, then .

• The vector of estimating functions satisfies regularity conditions to ensure -convergence of the estimator of that solves if, hypothetically, were jointly observed given .

• There exists a constant such that for all . We assume that is bounded away from 1, to avoid “irregular identification” for inverse weighting (Khan & Tamer 2010). Moreover, we assume that is nonzero (but possibly close to 0), to simplify technical arguments; otherwise, some components of would be in PS model (5).

• (C7)–(C8) Partial derivative matrices of and are uniformly integrable in neighborhoods of , , and .

###### Proposition 1

In addition to Assumptions (A1)–(A2), suppose that Conditions (C1), (C2), (C4), (C5), (C7), and (C8) are satisfied, allowing for possible model misspecification (e.g., \citealtappendWhite1982). Then the following results hold.

1. The estimator is doubly robust: it remains consistent when either model (3) or model (5) is correctly specified. Moreover, is locally nonparametric efficient: it achieves the nonparametric efficiency bound when both model (3) and model (5) are correctly specified.

2. The estimator is locally semiparametric efficient: it achieves the semiparametric efficiency bound when both model (3) and model (5) are correctly specified. But is, generally, not doubly robust.

For both estimators and , the estimating equations (8) and (9) can be expressed in the following AIPW form with the choice or respectively:

 ~E[1−T1−^π(U)^π(U)Φ(X,U;θ)−{1−T1−^π(U)−1}h(U)]=0. (10)

Setting leads to the IPW estimator . By local semiparametric efficiency in Proposition 1(ii), achieves the minimum asymptotic variance among all regular estimators under PS model (5), including AIPW estimators like as solutions to (10) over possible choices of , when both PS model (5) and OR model (3) are correctly specified. However, is not doubly robust, and is doubly robust. This situation should be contrasted with other missing-data problems such as the “verify-in-sample” case where nonparametric and semiparametric efficiency bounds are the same, and there exists an AIPW estimator that is locally nonparametric and semiparametric efficient and doubly robust simultaneously (e.g., Robins et al. 1994; Tan 2006, 2010a). These differences present new challenges in our development of improved estimation; see the discussion after Proposition 2.

### 2.3 Improved estimation

We develop improved estimators of under moment conditions (2) which are not only doubly robust and locally nonparametric efficient, but also intrinsically efficient: as long as PS model (5) is correctly specified, these estimators will attain the smallest asymptotic variance among a class of AIPW estimators including but with replaced by the fitted value from an augmented propensity score model as defined later in (11). The new estimators are then similar to in achieving local nonparametric efficiency and double robustness, but often achieve smaller variances than when PS model (5) is correctly specified but the OR model is misspecified.

#### Calibrated regression estimator

We derive regression estimators for , similar to the regression estimators of ATE in Tan (2006), but with an important new idea as follows. For simplicity, assume that PS model (5) is logistic regression. With additional technical complexity, the approach can be extended when PS model (5) is non-logistic regression similarly as discussed in Shu & Tan (2018). Consider an augmented PS model

 P(T=1|U)=π\scriptsize aug(U;γ,δ,^α) =expit{γTf(U)+δT^ψθ(U)}, (11)

where and is a vector of unknown coefficients for additional regressors . Let be the MLE of and , depending on through . This dependency on is suppressed for convenience in the notation. A consequence of including the additional regressors is that, by Eq. (6), we have the two equations,

 ~E[{T−~π(U)}f(U)]=0, (12) ~E[{T−~π(U)}^ψθ(U)]=0. (13)

For augmented PS model (11), there may be linear dependency in the variables , . In this case, the regressors should be redefined to remove redundancy.

We define the regression estimator as a solution to

 ~E{~τ\scriptsize reg(θ)}=0 (14)

with and , where

 ~τ\scriptsize init(θ)=1−T1−~π(U)~π(U)Φ(X,U;θ), ~ξ={1−T1−~π(U)−1}~h(U)~π(U),~ζ=1−T1−~π(U)~h(U)~π(U),

and are defined as follows,

 ~h1(U) ~h2(U) =~π(U){1−~π(U)}{fT(U),^ψTθ(U)}T.

The dependency of , , and on through is suppressed in the notation. To compute , the equations (12)–(13) and (14) can be solved jointly by alternating Newton-Raphson iterations to update and , as in Tan (2010b). The computation can be simplified in special cases, as discussed in Section 4.2.2.

The variables in are included to achieve different properties. First, is included in to ensure efficiency gains over the ratio estimator. Second, is included in to achieve double robustness and local nonparametric efficiency, as later seen from Eq. (16). Finally, is included to account for the variation of to achieve intrinsic efficiency as described in Proposition 2 below. The variables in corresponding to are exactly the scores for the augmented PS model (11). The subvector can be removed to reduce the dimension of , with little sacrifice or even improvement in finite samples.

We impose the following regularity conditions in addition to those described earlier for Proposition 1. See Appendix II in Supplementary Material for details.

• For in a neighborhood of and some constants , it holds that . If PS model (5) is correctly specified, then .

• There exists a constant such that for all and .

• Partial derivative matrices of are uniformly integrable in neighborhoods of for .

Assumption (C6) is similar to (C5), whereas (C9) is similar to (C8). In particular, if PS model (5) is correctly specified, then (C6) is equivalent to (C5). If PS model (5) is misspecified, then (C6) requires that the limit propensity score under the augmented PS model (11) is bounded away from 1 for .

###### Proposition 2

Suppose that Assumptions (A1)–(A2) and Conditions (C1)–(C9) are satisifed, and PS model (5) is logistic regression. Then the following results hold.

1. is doubly robust: it remains consistent when either model (3) or model (5) is correctly specified.

2. is locally nonparametric efficient: it achieves the nonparametric efficiency bound when both model (3) and model (5) are correctly specified.

3. is intrinsically efficient: if model (5) is correctly specified, then it achieves the lowest asymptotic variance among the class of estimators of that are solutions to estimating equations of the form

 ~E{~τ\scriptsize init(θ)−bT~ξ}=0, (15)

where is a matrix of constants.

In the following, we provide several remarks to discuss Proposition 2.

Double robustness.  We explain why the use of the augmented propensity score is important for to achieve double robustness, in addition to the fact that is included in . If the OR model (3) is correctly specified, then, as shown in the proof of Proposition 2 in Supplementary Material, is asymptotically equivalent, up to , to a solution of the equation

 (16)

mainly because is included in . By the use of the augmented PS model, Eq. (13) holds and hence Eq. (16) is identical to the equation

 ~E[1−T1−~π(U)~π(U){Φ(X,U;θ)−^ψθ(U)}+T^ψθ(U)]=0, (17)

which has exactly the same form as Eq. (8) but with replaced by the augmented propensity score . Let be a solution of Eq. (17). Then is doubly robust similarly as based on by Proposition 1. Therefore, is consistent when OR model (3) is correctly specified even if the PS model (5) is misspecified.

Local efficiency.  The asymptotic equivalence between and discussed above under OR model (3) also implies that when model (3) is correctly specified, is locally nonparametric efficient, similarly as and . It should be noted that is generally not locally semiparametric efficient in terms of PS model (5), but locally semiparametri efficient in terms of PS model (11): achieves the semiparametric efficiency bounded calculated under model (11), not under model (5), when both model (3) and model (5) are correctly specified. In fact, when PS model (5) holds, the efficiency bound under model (11) coincides with the nonparametric efficiency bound , because can be shown to be included in the score function of model (11). On the other hand, with replaced by throughout would be locally semiparametric efficient with respect to original PS model (5), but generally not doubly robust, similarly as .

Intrinsic efficiency.  The regression coefficient is constructed by the approach of Tan (2006), for to achieve intrinsic efficiency beyond local nonparametric efficiency and double robustness. In fact, we did not apply , the classical estimator of the optimal choice in minimizing the asymptotic variance of (15). The estimator , which solves the equation , is asymptotically equivalent to the first order to when the PS model is correctly specified. But , unlike , is generally inconsistent for , when OR model is correctly specified and PS model may be misspecified. The particular form of can also be derived through empirical efficiency maximization (Rubin & van der Laan 2008; Tan 2008) and design-optimal regression estimation for survey calibration (Tan 2013). See further discussion related to calibration estimation after Proposition 3.

The advantage of achieving intrinsic efficiency can be seen as follows. Let and, as done before, be a solution to Eq. (7) for and Eq. (8) for respectively, with replaced by . Moreover, consider an extension of the auxiliary-to-study tilting (AST) estimator in Graham et al. (2016) under our general setting using the augmented PS model (11) as mentioned in Introduction. Let be a solution to , where , is the MLE of for model (11), and is chosen such that . Then and can be shown to be doubly robust and locally nonparametric efficient.

###### Corollary 1

Under the setting of Proposition 2, if PS model (5) is correctly specified, then the estimator is asymptotically at least as efficient as , , and .

The concept of intrinsic efficiency was introduced in related works on missing-data and causal inference problems (Tan, 2006, 2010a, 2010b), and is useful for comparing various estimators that are all shown to be doubly robust and locally efficient when both OR and PS models are involved. Roughly speaking, intrinsic efficiency indicates that an estimator achieves the smallest possible asymptotic variance among a class of AIPW-type estimators, such as (15), using the same fitted OR function as long as the PS model is correctly specified. It is tempting, but remains an open question, to formulate a similar property in terms of a correctly specified OR model and construct estimators with the desired property. See Tan (2007) for a discussion about different characteristics of PS and OR approaches related to DR estimation.

#### Calibrated likelihood estimator

A practical limitation of the regression estimator is that it may take some outlying values due to large inverse weights in both terms and . In this section, we derive a likelihood estimator of which is doubly robust, locally nonparametrically efficient and intrinsically efficient similarly to the regression estimator , but tends to be less sensitive to large weights than the regression estimator.

We take two steps to derive a likelihood estimator achieving all the desirable properties. First, we derive a locally nonparametric efficient, intrinsically efficient, but non-doubly robust estimator of by the approach of empirical likelihood using estimating equations (Owen 2001; Qin & Lawless 1994) or equivalently the approach of nonparametric likelihood (Tan 2006, 2010a). Specifically, our approach is to maximize the log empirical likelihood subject to the constraints

 n∑i=1pi=1,n∑i=1pi~ξi=0, (18)

where is a nonnegative weight assigned to for with . Let be the weights obtained from the maximization. The empirical likelihood estimator of , , is defined as a solution to

 n∑i=1^pi{1−Ti1−~π(Ui)~π(Ui)Φ(Xi,Ui;θ)}=0, (19)

where is the fitted value of under model (11) as in Section 2.3.1. In the just-identified setting with and of the same dimension, this approach is equivalent to maximizing the empirical likelihood subject to (18) and (19) together. In Supplementary Material, we show that Eq. (19) can also be expressed as

 1nn∑i=1{1−Ti1−ω(Ui;^λ)~π(Ui)Φ(Xi,Ui;θ)}=0, (20)

where and is a maximizer of the function

 ℓ(λ)=~E[Tlogω(U;λ)+(1−T)log{1−ω(U;λ)}],

subject to if and if for . Setting the gradient of to zero shows that is a solution to

 ~E[T−ω(U;λ)ω(U;λ){1−ω(U;λ)}~h(U)]=0. (21)

The estimator can be shown to be intrinsically efficient among the class of estimators (15) and locally nonparametric efficient, but generally not doubly robust. Next we introduce the following modified likelihood estimator, to achieve double robustness but without affecting the first-order asymptotic behavior.

Partition as and accordingly as . Define , where are obtained from , and is a maximizer of the function

 κ(λ1)=~E[(1−T)log{1−ω(U;λ1,^λ2)}−log{1−ω(U;^λ)}~π(U)−λT1~v(U)],

subject to if for . Setting the gradient of to 0 shows that is a solution to

 ~E[{1−T1−ω(U;λ1,^λ2)−1}~v(U)]=0. (22)

The resulting estimator of , , is defined as a solution to

 ~E{1−T1−ω(U;~λ)~π(U)Φ(X,U;θ)}=0, (23)

where is also involved in and , although this dependency is suppressed in the notation. To compute , the equations (12)–(13), (21)–(22), and (23) can be solved by alternating Newton-Raphson iterations. See Section 4.2.2 for simplification in special cases. The estimator has several desirable properties as follows.

###### Proposition 3

Under the setting of Proposition 2, the estimator has the following properties.

1. is doubly robust, similarly as in Proposition 2.

2. If model (5) is correctly specified, then is asymptotically equivalent, to the first order, to . Hence is intrinsically efficient among the class (15) and locally nonparametric efficient, similarly as in Proposition 2.

The double robustness of holds mainly for two reasons. First, we have by Eq. (22) with included in . Second, we have by Eq. (13) for the augmented PS model (11). Combining the two equations indicates that , which can be easily shown to imply that remains consistent when OR model (3) is correctly specified even if the PS model (5) is misspecified.

The doubly robust estimators and can be regarded as calibrated regression and likelihood estimators, with an important connection to calibration estimation using auxiliary information in survey sampling (Deville & Sarndal 1992; Tan 2013). In fact, the estimating equations (14) and (23) can be expressed as , where the (possibly negative) weights are determined to satisfy the calibration equation

 ∑1≤i≤n:Ti=0wi~v(Ui)=n∑i=1~v(Ui). (24)

For the likelihood estimator , by Eq. (22). For the regression estimator , it can be shown by direct calculation that , where . However, the associated weights for may be negative, whereas the weights for are always nonnegative by construction. As a result, tends to perform better (less likely yield outlying values) than , especially with possible PS model misspecification. It remains an interesting but challenging topic to provide further theoretical analysis of performances of and in the presence of model misspecification.

## 3 Data combination

Consider setting (II) described in the Introduction, where another variable in addition to is observed from the primary data. The moment restriction model of interest is postulated in a separable form as

 E(1){Φ1(Y,U;θ)−Φ0(X,U;θ)}=0, (25)

where is a vector of known functions of only whereas is a vector of known functions of only. The expectation can be directly estimated as simple sample averages from the primary data. For estimation of , the main challenge is then to estimate using both the primary and secondary data, which is exactly the problem addressed in Section 2.

Similarly as in Section 2, we assume that the sample sizes are determined from binomial sampling. The combined set of observed data are

 {(Ti,Ui,(1−Ti)Xi,TiYi):i=1,…,n}, (26)

where is an indicator variable, equal to either 1 or 0 if the th unit is in the primary or auxiliary sample. The moment conditions (25) can be represented by

 E{Φ1(Y,U;θ)−Φ0(X,U;θ)|T=1}=0, (27)

Various statistical problems can be studied in the above setup of data combination as discussed by Graham et al. (2016) and references therein.

The methods and theory developed in setting (I) for moment restriction models with auxiliary data can be adopted and extended to setting (II). An AIPW estimator for in (27) can be defined as a solution to equation similar to (9),

 ~E{TΦ1(Y,U;θ)}−~E[1−T1−^π(U)^π(U){Φ0(X,U;θ)−^ψθ(U)}+T^ψθ(U)]=0, (28)

where is a fitted propensity score using model (5) as before, and is a fitted outcome regression function using model (3), with replaced by . A calibrated regression estimator can be defined as a solution to

 ~E{TΦ1(Y,U;θ)}−~E{~τ%reg(θ)}=0, (29)

and a calibrated likelihood estimator defined as a solution to

 ~E{TΦ1(Y,U;θ)}−~E{1−T1−ω(U;~λ)~π(U)Φ0(X,U;θ)}=0, (30)

where is defined as in (14), and defined as in (22), with replaced by throughout, including the newly defined in augmented PS model (11). Similarly as in Section 2, the estimators , , and can be shown to be doubly robust and locally nonparametric efficient. Moreover, and are expected to yield smaller variances than and the doubly robust estimator in Graham et al. (2016) when the propensity score model is correctly specified. In general, and do not achieve intrinsic efficiency or the theoretical guarantee as in Corollary 1, mainly due to the inefficiency of in (29) and (30), which, however, tends to be of less concern than the variability from the second term. It is possible to construct calibrated estimators of differently to achieve intrinsic efficiency, as shown in Shu & Tan (2018) for ATT estimation. But such estimators are more complex than above and may not be preferable in the case where and are multi-dimensional, due to finite-sample consideration. See the end of Section 4.2.2 for related discussion.

In the next section, we study two-sample instrumental variable estimation (Klevmarken 1982; Angrist & Krueger 1992) as a concrete application, where estimating equations (28)–(30) can be simplified to yield closed-form estimators.

## 4 Two-sample instrumental variable estimation

A typical problem in econometrics involves estimating regression coefficients in a linear regression model with endogeneity,

 Y=βX+βcTZc+ε, (31)

where is a response variable, is a scalar, endogenous variable possibly correlated with the error term , and is a vector of exogenous variables uncorrelated with