Generalized Method of Moments Estimator Based On Semiparametric Quantile Regression Imputation

# Generalized Method of Moments Estimator Based On Semiparametric Quantile Regression Imputation

Senniang Chen and Cindy Yu Department of Statistics, Iowa State University, Ames, IA 50011, USA. snchen@iastate.eduDepartment of Statistics, Iowa State University, Ames, IA 50011, USA. cindyyu@iastate.edu
###### Abstract

In this article, we consider an imputation method to handle missing response values based on semiparametric quantile regression estimation. In the proposed method, the missing response values are generated using the estimated conditional quantile regression function at given values of covariates. We adopt the generalized method of moments for estimation of parameters defined through a general estimation equation. We demonstrate that the proposed estimator, which combines both semiparametric quantile regression imputation and generalized method of moments, has competitive edge against some of the most widely used parametric and non-parametric imputation estimators. The consistency and the asymptotic normality of our estimator are established and variance estimation is provided. Results from a limited simulation study and an empirical study are presented to show the adequacy of the proposed method.

Key Words: generalized method of moments, imputation, semiparametric quantile regression.

## 1 Introduction

Missing data is frequently encountered in many disciplines. Missing data analyses are important because inference based on ignoring missingness undermines efficiency and often leads to biases and misleading conclusions. The large literature handling missing data can basically be grouped into three categories: observed likelihood-based approaches, inverse probability weighting methods, and imputation methods. The main motivation for imputation is to provide a complete data set so that standard analytical techniques can be applied and the resulting point estimates are consistent among different users. Due to its intuitive simplicity, imputation becomes particularly popular among practitioners and is the focus of our paper.

Many different imputation approaches have been developed in the literature and some prominent examples are included here. Rubin’s pioneer work (1987) discussed multiple imputation (MI) based on Bayesian methods to generate pseudo values from the posterior predictive distribution and impute multiple data sets. Despite its simple form, however, the variance estimator of MI will have convergence problems if congeniality and self-sufficiency conditions are not met (Meng 1994). Fractional imputation was proposed to retain both estimation efficiency of multiple imputation and consistency of the Rao-Shao variance estimator (Rao and Shao 1992). In fractional imputation, multiple values are imputed for each missing cell with assigned weights. Kim (2011) proposed parametric fractional imputation (PFI) with inspirations from importance sampling and calibration weighting to reduce the computation burden. Noticeably, both PFI and MI assume a parametric regression model, and therefore may suffer from model misspecification. While MI and PFI resort to the creation of artificial responses, hot-deck imputation (HDI) replaces missing units with observed data through matching methods. By using covariate information, the matching method could be classifying donors and recipients into similar categorical classes (Brick and Kalton 1996; Kim and Fuller 2004), or creating metrics to match donors and recipients (Rubin 1986; Little 1988). More examples are documented in Andridge and Little (2010). In a recent work by Wang and Chen (2009), multiple imputed values are independently drawn from observed respondents with probabilities proportional to kernel distances between missing cells and donors. Both HDI and Wang and Chen (2009) are purely non-parametric, so the stability and accuracy of the estimators depend on the dimensionality and the sample size concerned. In fact, finite sample biases are observed in both of these non-parametric methods in our simulation study. It might be due to the fact that a donor with a higher probability of being present is more likely selected for imputing than a donor with a lower probability of being present, which possibly results in a distorted conditional density when the covariate is non-uniformly distributed. For more detailed discussions about this issue, see Section 3.

To leverage the advantages of both parametric and non-parametric methods and avoid the limitation of a pure or exclusive approach, we propose an imputation method based on semiparametric quantile regression, which has the following set up. Define as the conditional density where is the response subject to missing and is the covariate always observed, and as the -th conditional quantile function, which is the inverse conditional distribution function . Instead of estimating parametrically or non-parametrically, we estimate semiparametrically using observed data under the missing at random (MAR) assumption, in the sense intended by Rubin (1976). Then multiple imputed values are obtained via , where is independently drawn from Uniform. The semiparametric quantile regression imputation (hereafter called SQRI) is expected to have appealing features. Firstly, the entire conditional distribution function is used to draw imputed values, hence preserving the conditional density of the filled-in response values. Secondly, because different conditional quantiles instead of conditional means or actual observations are used in imputation, the method is less sensitive to outliers, as quantiles are known to be less affected by extremes. Thirdly, it does not require strong model assumptions as in a fully parametric solution, and is therefore robust against model violations. Lastly, imputed values can be easily created through random numbers generated from Uniform once is estimated.

In this paper, we are interested in estimating parameters defined through a general estimation equation. After imputation, the data set is regarded as complete and the generalized method of moments (GMM) is used for parameter estimation. However, combining GMM estimation with SQRI (hereafter called SQRI-GMM) has not been studied, to our best knowledge. So it is not clear, despite its aforementioned theoretical appeals, whether the proposed method can be advocated as an effective alternative in imputation. There are two main goals in this paper. The first goal is to rigorously establish a large sample theory of our GMM estimator based on SQRI, and the second goal is to evaluate its finite sample performance through numerical simulation. We examine the first goal in Section 2 and investigate the second goal in Section 3 through addressing the following three research questions: (1) Can our SQRI-GMM method significantly reduce biases caused by model misspecification, compared with MI and PFI? Our simulations are contrived to cover different kinds of misspecified mean structures, and performances of the estimators are compared; (2) Can our SQRI-GMM method have competitive finite sample performance, compared with some established non-parametric imputation methods? This question is interesting since both hot-deck imputation and Wang and Chen (2009) are also robust against model violations. (3) Can our SQRI-GMM method provide credible inference? The coverage probability of the confidence interval based on our SQRI-GMM estimator is studied in the simulation. Through the analyses of these three important questions, this paper demonstrates some numerical advantages of our SQRI-GMM estimator for imputation.

We are not the first to use quantile regression for imputation. Papers pertaining to quantile regression imputation include Munoz and Rueda (2009), Wei et al (2012) and Yoon (2013). Our paper is distinctive from these papers in terms of objective, type of imputation and theory. (i) For objective, while Wei et al (2012) and Yoon (2013) limited their attention to the estimation of quantile regression coefficients, our method can be used for estimating parameters defined through any general estimation equation. Munoz and Rueda (2009) focused on the imputation strategy only, and parameter estimation was not an objective of the paper. It is worth noting that the setting in Wei et al (2012) is also different since they dealt with missing covariates, not missing responses. (ii) For type, Wei et al (2012) imputed multiple data sets, while Munoz and Rueda (2009) proposed a single and deterministic imputation. However, our method utilizes fractional imputation. (iii) For theory, instead of assuming a linear quantile regression model, like Wei et al (2012) and Yoon (2013) did, we rely on a flexible semiparametric approach incorporating penalty for model complexity. And the key idea, which is used to arrive at the consistency and normality in the proof, is substantially different from Wei et al (2012) and Yoon (2013). Because the primary interest of Munoz and Rueda (2009) was the computation strategy, no theory was offered in their paper. Our paper is unique in its contribution to theory building and its emphasis on application for a general framework with less restrictive assumptions.

The rest of paper is organized as follows. In Section 2, we introduce our imputation method through semiparametric quantile regression with penalty and present large sample theories of our SQRI-GMM estimator. Section 3 compares our method with some competing methods through simulation studies and reports the statistical inference results of our SQRI-GMM estimator. Section 4 analyzes an income data set from Canadian Census Public Use Tape. The Appendix outlines proofs of the theorems appearing in the main text. Details of the proofs are in the supplemental file Chen and Yu (2014).

## 2 Proposed GMM Estimator Based On Semiparametric Quantile Regression Imputation (SQRI)

In this section, we introduce our GMM estimator based on SQRI. Section 2.1 builds the framework and discusses SQRI using penalized B-splines. Section 2.2 establishes the asymptotic consistency and normality of the unweighted SQRI-GMM estimator. Section 2.3 extends the large sample theory to the weighted SQRI-GMM estimator.

### 2.1 SQRI using penalized B-splines

We consider , to be a set of i.i.d. observations of random variables , where is a -dimension variable always observed and is the response variable subject to missing. Let if is observed and if is missing. We assume that and are conditionally independent given , i.e.

 P(δ=1|Y=y,X=x)=P(δ|X=x):=p(x),

a condition called “missing at random” by Rubin (1976). The primary interest of this article is to estimate a -dimensional parameter , which is the unique solution to , and make inference on . Here is a vector of estimating functions for . Let be the unknown conditional % quantile of response given . It satisfies for a given . When , is the conditional median of . It is easy to show that satisfies

 qτ(x)=argminh(x)E{ρτ(Y−h(x))|X=x},

where , the check function proposed in Koenker and Bassett (1978). Many have studied the estimation of based on parametric methods, and a summary of relevant literature can be found in Koenker (2005). Parametric model assumptions may not hold sometimes, giving rise to nonparametric methods. Nonparametric quantile regression, including the kernel quantile regression in Yu and Jones (1994) and the smoothing spline method in Koenker et al (1994), has also been intensively studied. Among many findings is the well-known trade-off between computational cost and smoothness. In other words, spline smoothing methods demand massive computation, and the unpenalized spline tends to give wiggly curves despite its cheap computational cost. In this paper, we employ a semiparametric quantile regression method based on penalized B-splines, as suggested in Yoshida (2013), that features a relatively smoothed quantile function at reduced computational burden.

To simplify notations, we assume is an univariate variable with a distribution function on . We discuss how to deal with multivariate in Section 3 and how to rescale on any compact set into in Section 4. Let be the number of knots within the range , and be the degree of B-splines. In order to construct the -th degree B-spline basis, we define equidistantly located knots as . Note there are knots located in . The -th B-spline basis is

 B(x)=(B[p]−p+1(x),B[p]−p(x),⋯,B[p]Kn(x))T,

where are defined recursively as

• For :

 B[s]k(x)=Bk(x)={1,κk−1
• For :

 B[s]k(x)=x−κk−1κk+s−1−κk−1B[s−1]k(x)+κk+s−xκk+s−κkB[s−1]k+1(x),

where

Readers can refer to de Boor (2001) for more details and properties of the B-spline functions. The estimated conditional quantile regression function is , where is a vector obtained by

 ^b(τ)=argminb(τ)n∑i=1δiρτ[yi−BT(xi)b(τ)]+λn2bT(τ)DTmDmb(τ). (1)

Here is the smoothing parameter, and is the -th difference matrix and is dimensional with its element defined as

 dij=⎧⎪⎨⎪⎩(−1)|i−j|(m|i−j|)0≤j−i≤m0o.w.,

where the notation is the choose function given by and is the order of penalty. As discussed in Yoshida (2013), the difference penalty is used to remove computational difficulty occurring when the penalty term is defined through an integral, and it controls the smoothness of the estimated quantile regression function. Section 3 discusses how we choose the numbers in practice.

To control the variability of the estimating functions with imputed values, we generate independent imputed values when is missing by the following procedure.

1. Simulate Uniform(0,1) independently for ;

2. For each , is calculated as

 ^b(τj)=argminb(τ)n∑i=1δiρτj[yi−BT(xi)b(τ)]+λn2bT(τ)DTmDmb(τ);
3. For the missing unit , independent values are generated as

 y∗ij|xi=^qτj(xi)=BT(xi)^b(τj),j=1,2,⋯,J.

Repeat step 3 for every missing unit in the data set. Then we use as the estimating function for the -th observation.

Sometimes the conditional mean of given is used for imputation, such as in Cheng (1994) and Wang and Rao (2002), but it does not work for general parameter estimation. For some parametric imputation methods, imputation and estimation steps are entwined, in that updating parameters and re-imputing based on most recently updated parameters are iteratively done. This might require heavy computing time. In the SQRI described above, imputation and estimation steps are totally separate, making general purpose parameter estimation possible. Also in SQRI, standard analytical tools can be directly applied to imputed data without re-imputation. The PFI by Kim (2011) avoids re-imputation by adjusting weights of imputed values based on iteratively updated parameters. However, any parametric imputation method, including PFI and MI, might suffer from model misspecification. Non-parametric imputation, such as HDI or the method proposed in Wang and Chen (2009) using kernel distance, assumes no parametric model, but the stability and accuracy of non-parametric estimators depend on sample size and dimensionality of the problem. The SQRI provides a useful compromise between a fully parametric approach and a purely non-parametric approach.

Assume the number of knots and the smoothing parameter depend on . By Barrow and Smith (1978), there exists that satisfies

 supx∈(0,1)|qτ(x)+baτ(x)−BT(x)b∗(τ)|=o(K−(p+1)n), (2)

where if , and is the -th derivative of with respect to . Here is the -th Bernoulli polynomial inductively defined as where is the -th Bernoulli number (Barrow and Smith (1978) and Yoshida (2013)). The following Lemma gives the asymptotic property of where is defined in (1).

Lemma 1: Under condition 1 given in the Appendix, and assuming , , and for , we have
(i)

 √nKn[^qτ(x)−BT(x)b∗(τ)+bλτ(x)]→dN(0,Vτ), (3)

(ii)

 √nKn[^qτ(x)−qτ(x)+baτ(x)+bλτ(x)]→dN(0,Vτ), (4)

for a given and , where

 bλτ(x)=λnnBT(x)(\boldmathΦ(τ)+λnnDTmDm)−1DTmDmb∗(τ),Vτ(x)=limn→∞τ(1−τ)KnBT(x)(\boldmathΦ(τ)+λnnDTmDm)−1\boldmathΦ×(\boldmathΦ(τ)+λnnDTmDm)−1B(x),\boldmathΦ=∫10p(x)B(x)BT(x)dFx(x),\boldmathΦ(τ)=∫10p(x)fy|x(qτ(x))B(x)BT(x)dFx(x). (5)

Here is the conditional density of given . There exist two sources of asymptotic biases in . One is which is the model bias between the true function and the spline model used, see equation (2). Another source of bias is introduced by adding the penalty term into the quantile regression. When there is no penalty term (), this bias vanishes. Both of these two bias terms have an order . The proof of this lemma draws from Theorem 1 of Yoshida (2013), which deals with complete data. The detailed proof of this order and Lemma 1 can be found in the supplemental file Chen and Yu (2014).

We define and our estimating function as

 Gn(\boldmathθ)=1nn∑i=1{δig(yi,xi;\boldmathθ)+(1−δi)1JJ∑j=1g(y∗ij,xi;\boldmathθ)}. (6)

We consider the generalized method of moments (GMM), a usual estimation equation approach, to make inference on . Our proposal of combining SQRI with GMM is attractive, thanks to the applicability to general parameter estimation of GMM and the aforementioned appeals of SQRI.

### 2.2 Unweighted GMM estimator based on SQRI

The unweighted GMM-SQRI estimator is obtained as

 ˆ\boldmathθn=argmin\boldmathθ∈ΘGTn(\boldmathθ)Gn(\boldmathθ). (7)

We first present Lemma 2, which regards the asymptotic normality of .

Lemma 2: Under conditions 1 and 2 (a) (b) given in the Appendix, and assuming , , and for , as and we have

 √nGn(\boldmathθ0)→dN(0,VG(% \boldmathθ0)), (8)

where

 VG(\boldmathθ)=Var(ξi(\boldmathθ)), (9)
 ξi(\boldmathθ)=g(yi,xi;\boldmathθ)+(1−δi)[μg|x(xi;\boldmathθ)−g(yi,xi;\boldmathθ)]+δiCphn(yi,xi;% \boldmathθ)B(xi), (10)
 hn(yi,xi;\boldmathθ)=∫+∞−∞∫10˙gy(qτ(x),x;\boldmathθ)BT(x)H−1n(τ)ψτ(ei(τ))dτdFX(x), (11)
 μg|x(x;\boldmathθ)=E{g(y,x;\boldmathθ)|X=x}, Hn(τ)=Φ(τ)+λnnDTmDm,
 ˙gy(y,x;\boldmathθ)=∂g(y,x;% \boldmathθ)∂y, ψτ(u)=τ−1u<0,
 ei(τ)=yi−BT(xi)b∗(τ), and Cp=E{1−p(x)}.

Justification of Lemma 2 is crucial to show consistency and asymptotic normality of our SQRI-GMM estimator (Pakes and Pollard 1989). We decompose into three terms

 √nGn(\boldmathθ0)=1√n∑ni=1g(yi,xi;% \boldmathθ0):=B1+1√n∑ni=1[(1−δi)(μg|x(xi;\boldmathθ0)−g(yi,xi;\boldmathθ0)]:=B2+1√n∑ni=1[(1−δi)(^μg|x(xi;\boldmathθ0)−μg|x(xi;\boldmathθ0)]:=B3, (12)

where and . The terms and are simple since they are sums of i.i.d. random variables. The term is much more complicated because it involves additional randomness from the uniformly distributed random variable , and it also depends on the estimated coefficients calculated using all respondents. Therefore the summands in are not independent. The key idea in the proof is to replace by where , and to show the following two results: (1) , and (2) . Combining these two results with equation (12) gives the asymptotic normality in Lemma 2.

Remark 1: When there is no missing, in equation (10) coincides with .

Theorem 1: Under conditions 1 and 2 given in the Appendix, and assuming , , and for , as and we have

(i)

 ˆ\boldmathθn→p\boldmathθ0;

(ii)

 √nΣ−1/2(\boldmathθ0)(ˆ\boldmathθn−\boldmathθ0)→dN(0,Idθ×dθ),

where

 Σ(\boldmathθ)={ΓT(\boldmathθ)Γ(\boldmathθ)}−1ΓT(\boldmathθ)VG(\boldmathθ)Γ(\boldmathθ){ΓT(\boldmathθ)Γ(\boldmathθ)}−1, (13)
 and Γ(\boldmathθ)=E{∂g(Y,X;\boldmathθ)∂\boldmathθ}. (14)

Theorem 1 shows that is consistent and asymptotically normal. With Lemma 2 and the fulfillment of the following 2 conditions: (1) and (2) for any positive sequence converging to zero, Theorem 1 can be proved following Corollary 3.2 and Theorem 3.3 of Pakes and Pollard (1989). Here the notation of represents the norm of a matrix, defined as .

To consider variance estimation for , let an estimator of be

 ^ξi(\boldmathθ)=g(yi,xi;\boldmathθ)+(1−δi){^μg|x(xi;\boldmathθ)−g(yi,xi;\boldmathθ)}+δi^Cp^hn(yi,xi;\boldmathθ)B(xi), (15)

where

 ^hn(yi,xi;\boldmathθ)=1n1Jn∑k=1J∑j=1˙gy(^qτj(xk),xk;% \boldmathθ)BT(xk)^H−1n(τj)ψτj(^ei(τj)),
 ^ei(τj)=yi−BT(xi)^b(τj), ^Hn(τj)=^Φ(τj)+λnnDTmDm,
 ^Φ(τj)=1nn∑i=1δi^fY|X(xi,^qτj(xi))B(xi)BT(xi) with ^qτj(xi)=BT(xi)^b(τj),
 ^fY|X(x,y)=1nab∑ni=1δiK(y−yia)K(x−xib)1na∑ni=1δiK(x−xia), and ^Cp=n−1n∑i=1(1−δi).

Here the estimation of uses a Normal kernel and bandwidth or for (or ). The estimator of is obtained by

 ^Γ(^\boldmathθn)=1nn∑i=1⎧⎪⎨⎪⎩δi∂g(yi,xi;^\boldmathθn)∂\boldmathθ+(1−δi)1JJ∑j=1∂g(y∗ij,xi;^\boldmathθn)∂\boldmathθ⎫⎪⎬⎪⎭.

Then, the variance estimator of is , where is calculated as

 ˆΣ(ˆ\boldmathθn)={^ΓT(ˆ\boldmathθn)^Γ(ˆ% \boldmathθn)}−1^ΓT(ˆ\boldmath% θn)^VG(ˆ\boldmathθn)^Γ(ˆ\boldmathθn){^ΓT(ˆ%\boldmath$θ$n)^Γ(ˆ\boldmathθn)}−1, and (16)
 ^VG(\boldmathθ)=1n−1n∑i=1{^ξi(\boldmathθ)−1nn∑i=1^ξi(\boldmathθ)}{^ξi(\boldmathθ)−1nn∑i=1^ξi(\boldmathθ)}T. (17)

Corollary 1: Under conditions 1 3 given in the Appendix, and assuming , , and for , as and we have

 √nˆΣ−1/2(ˆ\boldmathθn)(ˆ\boldmathθn−\boldmathθ0)→dN(0,Idθ×dθ).

Corollary 1 allows us to construct confidence intervals based on the asymptotic normality and the variance estimator.

### 2.3 Weighted GMM estimator based on SQRI

A weighted GMM estimator is calculated by minimizing for a positive definite weight matrix . It can be shown that taking will result in the most efficient estimator in the class of all asymptotically normal estimators using arbitrary weight matrices. In practice, is replaced by the inverse of the random matrix defined in (17) and the weighted GMM estimator is obtained as

 ˆ\boldmathθwn=argmin\boldmathθ∈ΘGTn(\boldmathθ)^V−1G(% \boldmathθ)Gn(\boldmathθ). (18)

The following Lemma proves that is close to the fixed non-singular matrix uniformly over a sequence of shrinking neighborhoods, an important condition for to be consistent and asymptotically normal.

Lemma 3: Under conditions 1 3 given in the Appendix, and assuming , , and for , as and we have

 sup|\boldmathθ−\boldmathθ0|<ζn|^V−1G(\boldmathθ)−V−1G(\boldmathθ0)|=op(1),

for a sequence of positive numbers that converges to zero.

The following theorem presents the large sample properties of the weighted GMM estimator .

Theorem 2: Under conditions 1 3 given in the Appendix, and assuming , , and for , as and we have

(i)

 ˆ\boldmathθwn→p\boldmathθ0;

(ii)

 √nΣ−1/2w(\boldmathθ0)(ˆ% \boldmathθwn−\boldmathθ0)→dN(0,Idθ×dθ),

where

When Lemma 3 holds, the results in Theorem 2 follow immediately from Lemmas 3.4 and 3.5 of Pakes and Pollard (1989).

Remark 2: The asymptotic variance of the most efficient GMM estimator based on the complete data is . It can be shown that in equation (9) can also be expressed as

 VG(\boldmathθ0)=Var{g(y,x;\boldmathθ0)}−E{(1−p(x))σ2g|x(x;\boldmathθ0)}+C2pE{δihn(yi,xi;\boldmathθ0)B(xi)BT(xi)hTn(yi,xi;\boldmathθ0)}+2CpE{δihn(yi,xi;\boldmathθ0)B(xi)gT(yi,xi;\boldmathθ0)}, (19)

where . So when missing is low, i.e. is large and is close to zero, the efficiency of is close to the asymptotic efficiency of the best GMM estimator under no missing.

Remark 3: When , the semiparametric efficiency bound defined in Chen, Hong and Tarozzi (2008) is

 Σspeb(\boldmathθ0)=[ΓT(\boldmathθ0)E−1{σ2g|x(x;\boldmathθ0)/p(x)+μg|x(x;\boldmathθ0)μTg|x(x;\boldmathθ0)}Γ(\boldmathθ0)]−1.

Rewrite

 VG(\boldmathθ0) = E{p(x)σ2g|x(x;\boldmathθ0)}+V{μg|x(x,\boldmathθ0)} +C2pE{δihn(yi,xi;\boldmathθ0)B(xi)BT(xi)hTn(yi,xi;\boldmathθ0)} +2CpE{δihn(yi,xi;\boldmathθ0)B(xi)gT(yi,xi;\boldmathθ0)}.

Our estimator will achieve the semiparametric efficiency bound if , i.e.

 E{(1p(x)−p(x))σ2g|x(x;\boldmath% θ0)}≥C2pE{δihn(yi,xi;\boldmathθ0)B(xi)BT(xi)hTn