[
Abstract
The theory of statistical inference along with the strategy of divideandconquer for large scale data analysis has recently attracted considerable interest due to great popularity of the MapReduce programming paradigm in the Apache Hadoop software framework. The central analytic task in the development of statistical inference in the MapReduce paradigm pertains to the method of combining results yielded from separately mapped data batches. One seminal solution based on the confidence distribution has recently been established in the setting of maximum likelihood estimation in the literature. This paper concerns a more general inferential methodology based on estimating functions, termed as the Raotype confidence distribution, of which the maximum likelihood is a special case. This generalization provides a unified framework of statistical inference that allows regression analyses of massive data sets of important types in a parallel and scalable fashion via a distributed file system, including longitudinal data analysis, survival data analysis, and quantile regression, which cannot be handled using the maximum likelihood method. This paper investigates four important properties of the proposed method: computational scalability, statistical optimality, methodological generality, and operational robustness. In particular, the proposed method is shown to be closely connected to Hansen’s generalized method of moments (GMM) and Crowder’s optimality. An interesting theoretical finding is that the asymptotic efficiency of the proposed Raotype confidence distribution estimator is always greater or equal to the estimator obtained by processing the full data once. All these properties of the proposed method are illustrated via numerical examples in both simulation studies and realworld data analyses.
CIF]Scalable and Efficient Statistical Inference with Estimating Functions in the MapReduce Paradigm for Big Data ZHOU & SONG]Ling Zhou and Peter X.K. Song
1 Introduction
In response to rapidly growing demands of big data analytics and computational tools, parallel computing and distributed data storage have become the leading innovations for solving big data problems. For instance, multicore and cloud computing platforms, including the popular open source Apache Hadoop (2006), are now the standard software technology used extensively in academia and industry Hadoop (2017). This new distributed file system necessitates developing general statistical methodology that allows for analysing massive data through parallel and scalable operations in the Hadoop software framework. Being the core of Hadoop programming, MapReduce Dean and Ghemawat (2008); Lämmel (2008) represents a computing architecture that provides fast processing of massive data sets. Built upon the strategy of divideandconquer, the MapReduce paradigm refactors data processing into two primitives: a map function, written by the user, to process distributed local data batches and generate intermediate results, and a reduce function, also written by the user, to combine all intermediate results and then generate summary outputs. The detailed examples and implementation are referred to Dean and Ghemawat (2008). Figure 1 displays a schematic outline of MapReduce workflow, which splits data and performs computing tasks in the form of parallel computation. The salient features of MapReduce include scalability and independence of data storage; the former enables automatic parallelization and allocation of largescale computations, and the latter allows to process data without requiring it to be loaded into a common data server. In this way, it avoids the high computational cost of loading input data into a centralized data server prior to the analysis.
Although MapReduce and its variants have shown superb power to process largescale dataintensive applications on highperformance clusters, most of these systems are restricted with an acyclic data flow, which is not suitable for general statistical analysis in that iterative numerical operations are involved Yang et al. (2007). This is because operation of an iterative algorithm, like NewtonRaphson, requires repeatedly reload data from multiple data disks into a common server, incurring a significant performance penalty Zaharia et al. (2010). This paper is motivated to address this computational hurdle by a new combined inference approach in the framework of confidence distributions, so the resulting methodology of statistical estimation and inference can avoid repeated operations of data reloading in iterative numerical jobs, and truly enjoys the power of scalability offered by a distributed file system such as Hadoop.
In most of Hadoop platforms, data partition and allocation in the Mapstep may be operated by certain builtin system software to accommodate specific hardware configurations. As far as statistical inference concerns, methodological needs take place mostly in the Reducestep, in which statistical approaches of combining separate results are called for. Unlike the ordinary least squares method in the linear model, most of nonlinear regression models are relied on certain iterative numerical algorithms to obtain point estimates and quantities for statistical inference. Technically, these iterative algorithms typically request processing the entire data under centralized nonseparable calculations, except for some simple linear operations of data, such as arithmetic mean, count and proportion. Repeatedly loading local datasets into a central data server is not only extremely timeconsuming, but is prohibited if data volume exceeds the memory, or if, physically, data sets are stored in different servers located in different sites with no data merging agreement in place. This presents indeed the challenge of generalizing the MapReduce paradigm to many important statistical models and their numerical toolboxes, such as generalized estimating equations for longitudinal data, Cox proportional hazards model for survival data, and quantile regression, among others.
We consider a class of parametric models , with the parametric space . Here is assumed to be fixed. In many important statistical problems where the underlying probability density function by which the data are generated cannot by fully specified, estimating functions built upon some aspects of the probability mechanism such as moment conditions are utilized to carry out parameter estimation and inference. Given independent samples , in the current literature, parameter may be estimated as a solution, denoted by , of the following estimating equation:
(1) 
that is, , where denotes the entire data. See for example McLeish and Small (2012); Heyde (2008), and Song (2007, Chapter 3) for the theory of estimating functions, and more references therein. When in (1) is a score function, namely the firstorder derivative of a loglikelihood, the solution is the maximum likelihood estimator (MLE). In other words, the method of MLE may be regarded as a special case of the estimating function method. In general, equation (1) encompasses many important cases, such as the generalized estimating equation (GEE) for longitudinal data, the partial likelihood score function in the Cox model for survival data, and the quantile regression estimating function, and so on. If there were a “Godmade” computer with infinite power, there would be no pressing need of developing new methods for processing big data, and the existing methodologies and associated software would be directly applied to solve equation (1). Unfortunately, thus far humanmade computer does not have such capacity, so the MapReduce paradigm that implements the divideandconquer strategy has emerged as one of stateoftheart computational solutions to make big data computation feasible. Using this computing platform to implement divideandconquer strategy for statistical estimation and inference leads to two primary methodological questions:

If is it possible, and if so how, to establish a statistical inference procedure that is suitable to implement the Reducestep in the MapReduce paradigm? Specially, consider a data partition scheme to, say, disjoint subsets, . In the Mapstep each subdataset is processed in a parallelized fashion, where equation (1) is solved separately at individual computer nodes by existing statistical software (e.g. R package
gee
), resulting in estimates , . Then, in the Reducestep, there is a need of developing a procedure to gather these separate estimates and their variances to perform a valid statistical inference, if possible. 
Suppose that there exists an established procedure in part (a) that enables to derive a combined or meta estimator, say, . Then, there is a need of assessing the performance of the proposed , in terms of its estimation bias, estimation efficiency, and robustness, and comparing it to the solution of equation (1) obtained by processing the entire data once using a “Godmade” computer. For convenience, the latter solution, denoted as , serves as the benchmark solution in the rest of this paper.
Solutions to these two questions above have been discussed in the setting of maximum likelihood estimation in the literature. Recently, Lin and Zeng (2010) and Liu et al. (2015) proposed meta estimators, , defined as an inverse variance weighted average of , with being MLE obtained from each subdataset . Liu et al. (2015) showed that their meta estimator is asymptotically as efficient as the MLE derived from using the entire dataset once. In the setting of randomeffects models, Zeng and Lin (2015) reported a similar finding; that is, their meta estimator is at least as efficient as the one obtained from the entire data. The divideandconquer scheme has been also studied in other statistical problems, such as hypothesis testing; see also Battey et al. (2015); Chen and Xie (2014); Lee et al. (2017); Li et al. (2013); Zhang et al. (2015); Zhao et al. (2016), among others.
One of the most promising approaches to statistical inference suitable for the Reducestep is the socalled confidence distribution Xie and Singh (2013), which was the method that has been applied by Liu et al. (2015) to derive an asymptotically fully efficient solution. As a “distributional estimator”, confidence distribution (CD) has gained increasing attention due to its computational convenience, giving rise to a theoretically unified framework for estimation and statistical inference. The concept of confidence distribution may be traced back to Bayes and Price (1763), Fisher (1930, 1956) and later Efron (1993) in the field of fiducial inference; see Xie and Singh (2013) for a comprehensive review, and more references therein. Relevant to this paper, the most critical question here is why the CD approach suits for the derivation of a combined estimation and inference in the Reducestep. The key insight learned from the setting of maximum likelihood estimation lies in the fact that the construction of CD only requires summary statistics rather than individual subjectlevel data, and that the resulting inference has shown asymptotically no loss of statistical power. This is aligned with the analytic goal that the Reduce step aims to achieve. Thus, in this paper we consider generalizing the CD approach to the setting of estimating functions through which we hope to integrate separate pieces of inferential information to obtain a combined inference in mathematical rigor. This is different from the currently popular strategy of directly combining estimators. Our proposed Reducestep can be applied to deal with a broad range of statistical models especially in cases where likelihood is not available.
To facilitate our discussion, we begin with a simple heuristic presentation of the CD approach in the Reducestep. Suppose that under some regularity conditions such that the standard largesample properties hold, for each subdataset, estimator satisfies where is the underlying true parameter and is the Godambe information or is the sandwich covariance matrix. Then follows asymptotically the dimensional independent copula or the dimensional distribution of independent uniform marginals, where is the variate normal cumulative distribution function with mean and the identity variance matrix. According to Fisher (1935), it constitutes a pivotal quantity, a distributional element essential for the socalled fiducial inference. The pivotal density is termed as the confidence density by Efron (1993) whose expression takes the form, , where is a consistent estimate of the information matrix . Clearly, the above confidence density may be used to construct confidence regions of at any given confidence level. Suggested by Singh et al. (2005), a meta estimator of , under the homogeneity assumption , may be obtained by maximizing a joint confidence density of the following form:
where the fold product is due to the independence across the subdatasets. This procedure has been thoroughly discussed by Liu et al. (2015) in the context of maximum likelihood method where matrix is the observed Fisher information matrix, a special case of the Godambe information matrix when the Bartlett identity holds (Song, 2007, Chapter 3).
In the setting of estimating functions, there is another way to establish the asymptotic normality, based directly on the estimation functions, , where matrix is the variability matrix, i.e., the variance of the estimating function , different from the sandwich covariance matrix above. Likewise, we may construct another pivotal quantity to obtain a different CD, where is a consistent estimate of the variability matrix . Godambe and Thompson (1978) had strongly advocated the use of estimating functions , instead of its estimator , to make statistical inference due to better finitesample performances. This motivates us to take a new route of investigation to construct pivotal quantities and then confidence distributions, which results in a different meta estimation. In order to differentiate these two different routes of CD constructions, we borrow terms from the classical hypothesis testing theory, and name the estimatordriven CD as the Waldtype CD and our new estimating function based CD as the Raotype CD. Moreover, for the convenience of exposition, they are abbreviated as WaldCD and RaoCD, respectively, in this paper. There has been little work in the literature concerning MapReduce approaches to parameter estimation and inference with estimating functions; Lin and Xi (2011) proposed an aggregated estimating equation (AEE), which was not developed in the CD framework, and thus it is less general in comparison to the proposed RaoCD method; the detailed comparison between AEE and our RaoCD is available in both methodology discussion and simulation studies later in this paper.
The primary objective of this paper is to develop, assess and compare our proposed RaoCD approach in the Reducestep with existing methods. The focus of investigation will be on the following four aspects. (i) Scalability. Being implemented by the strategy of divideandconquer within the MapReduce paradigm, the RaoCD estimation and inference procedures are scalable to massive largescale data. (ii) Optimality. The RaoCD approach is closely related to seminal Hansen’s generalized method of moments (GMM), which supplies a powerful analytic tool for us to establish theoretical justifications for the optimality of the proposed method. Moreover, the RaoCD approach is also shown to be connected to the Crowder’s optimality, another important perspective on the optimality of the RaoCD meta estimation. The most interesting theoretical result in this paper is given by Theorem 5 in Section 3; that is, the asymptotic estimation efficiency of the RaoCD meta estimator is always equal or higher than that of the benchmark estimator obtained by processing the entire data once. (iii) Generality. The proposed RaoCD method provides a general valid inference procedure by combining results from separately fitted models where the likelihood is not available. It also includes Lin and Xi (2011)’s aggregated estimating equation (AEE) estimator as a special case. (iv) Robustness. The proposed RaoCD method is shown to be robust against certain heterogeneity and/or contaminated data, which has not been studied in any divideandconquer method from the frequentest perspective. The robustness is rooted in two facts: (a) diluted abnormality. Data partition in the Mapstep may allocate abnormal data cases into some subdatasets while the others contain no outliers. When this happens, the analysis would be only affected within a small number of subdatasets, while analyses with the majority of subdatasets remain unaffected. (b) Automatic downweighting. The RaoCD confidence density provides an automatic downweighting scheme to minimize the contributions from bad estimators (with inflated variances) due to the fact that weighting is antiproportional to the variance of an estimator Qu and Song (2004). The consequent combined estimator in the Reducestep will be robust by the two layers of protection. In contrast, if the entire data is run together in equation (1), the analysis will be affected by a few strong influential data cases, unless certain robustness treatments are applied to estimating functions. In Bayesian inference, Minsker et al. (2014) proposed a robust and scalable approach to Bayesian analysis in a big data framework.
The rest of the paper is organized as follows. In Section 2, we introduce the RaoCD in detail. Section 3 discusses the development of RaoCD in the Reducestep, including its optimality. Section 4 shows the theoretical properties of CD meta estimators. Section 5 focuses on a fast MapReduce implementation procedure. Section 6 presents three useful examples. The numerical performance is evaluated in Section 7. Finally, we apply the RaoCD method to several realworld data sets in Section 8, and we conclude with a discussion of the RaoCD’s limitations and future work in Section 9. All the conditions and proofs are given in the Appendix.
2 Raotype confidence distribution
When the likelihood is not available, the theory of estimating functions provides an appealing approach to obtain an estimator of , as an solution to equation (1) (e.g. Heyde, 2008). In this theoretical framework, there exist two forms of information matrices, the variability matrix and sensitivity matrix, denoted by and , respectively. Both of them are assumed to be positive definite in this paper. The Bartlett identity refers to the equality, for , which holds for the case of being the score function. Under some regularity conditions (Song, 2007, Chapter 3), the estimating function has the following asymptotic normal distribution:
(2) 
It follows that , where is the sample variance, , a root consistent estimator of . Denote . Under the same regularity conditions, it is also known that the estimator is asymptotically normally distributed,
(3) 
where is the Godambe information matrix.
According to the definition of confidence distribution Schweder and Hjort (2002); Singh et al. (2005) from the asymptotical normality in (2), in this paper we define a Raotype confidence distribution with respect to as follows:
(4) 
where is the variate normal cumulative distribution function with mean and the identity variance matrix. Clearly, asymptotically, , with . Likewise, from the asymptotic normality in (3), a Waldtype confidence distribution for is given by
(5) 
where is the observed Godambe information matrix with being the observed sensitivity matrix, namely, , provided that the estimating function is differentiable. Again, , asymptotically. In the current literature, this type of WaldCD Liu et al. (2015); Xie and Singh (2013) has been the choice of confidence distribution considered in the setting of maximum likelihood estimation.
A question arises naturally: what is the relationship between RaoCD and WaldCD? An answer may be drawn by an analysis resembling the classical comparison between Wald test and Score test in the theory of hypothesis testing; see for example Engle (1984). From the definition of CD, both WaldCD, , and RaoCD, , are distributional estimators for statistical inferences, and they can be shown to be asymptotically equivalent for inference on (see Theorem 4 and Lemma 1 in Appendix B.4). On the other hand, in comparison to WaldCD, RaoCD has the following advantages. First, RaoCD is invariant under onetoone parameter transformation, say, , which leads to a different distribution function of parameter but an equivalent estimating function Godambe and Kale (1991). Second, RaoCD is favorable if calculation of the sensitivity matrix, , which is involved in WaldCD, is analytically tedious or numerically unstable Song et al. (2005). Note that RaoCD only requires calculation of the variability matrix, . Third, as being the most important advantage, RaoCD provides a much more convenient theoretical framework than WaldCD to establish theoretical properties of the CD meta estimation, because we can shown that it is connected to Hansen’s GMM in Section 3. Using this remarkable connection, we can provide theoretical justifications for optimal efficiency and estimation robustness against contaminated data of the CD meta estimator.
3 RaoCD meta estimation from parallel datasets
3.1 Definition
In this section, we present the procedure of combining Raotype confidence distributions to derive a meta estimator for a common parameter of interest. Consider parallel datasets , each with observations, . Assume these subdatasets are independently sampled from disjoint sets of subjects, and each of which is processed separately by the estimating equation in (1), leading to estimators , .
In a similar spirit to ideas given in Singh et al. (2005) and Liu et al. (2015), the crossdataset independence permits multiplication of the Raotype confidence distributions given by (4) with respect to . Specifically, we consider the density of Raotype CD for the th subdataset by, subject to some asymptotically constant factors,
where is a variate normal density function with mean and the identity variance matrix, and . To proceed the CD approach, we take a product of these confidence densities as follows,
(6) 
Moreover, we define a meta estimator of by . We show in the next two subsections 3.23.3 that this meta estimator in (6) has the following two important properties of optimality, namely the Crowder’s optimality and the Hansen’s optimality in the context of generalized method of moments estimator (GMM).
3.2 Crowder’s optimality
Note that , which is obtained as a solution of the following estimating equation:
(7) 
Under some regularity conditions, Crowder (1987) showed that the optimal estimating function, , in the Crowder’s class of estimating functions, of the following forms,
is the one with , where and are the sensitivity and variability matrices of with subdataset . Also see Theorem 3.13 in Song (2007). The following proposition shows that the estimating function in (7) is asymptotically equivalent to the Crowder’s optimal estimating function at .
Proposition 1
Under regularity conditions (C1)(C3) and (C4.0) in Appendix A, if with some , we have the norm of asymptotically converges to 0; that is, as .
3.3 Hansen’s optimality
Since , it is interesting to see that the RaoCD estimator is equivalent to minimizing the following quadratic function,
(8) 
where is an extended vector of estimating functions, each based on one subdataset, and . Here is an overidentified estimating function in the sense that its dimension is bigger than the dimension of . Because of the independent sampling across the subdatasets, the variance of will be blockdiagonal, and is a consistent estimator of its variance. According to Hansen (1982), expression (8) presents a form of GMM. Thus, it is known from the classical theory of GMM that under some regularity conditions, our proposed meta estimator has the smallest asymptotic variance among those meta estimators given by the following forms:
where is a certain weighting matrix from, say, the class of semipositive definite matrices. Expression (8) provides a convenient theoretical framework for the development of largesample properties for the proposed meta estimator , as many established theorems and properties for the GMM may be applied here.
On the other hand, based on the asymptotic normality of estimators , the WaldCD meta estimator takes the following form,
(9) 
which is obtained using a product of the Waldtype confidence distributions in (5), where is the estimated sensitivity matrix. Applying similar arguments established in the classical theory of hypothesis testing for a comparison between Rao’s score test and Wald’s test, we can show that in (8) and in (9) are indeed asymptotically equivalent under some smooth conditions of estimating function such as condition (C4.2) in Appendix A. See Theorem 4 for the details.
4 Large sample properties
4.1 Consistency and asymptotic normality
We establish the consistency and asymptotic normality of in Theorems 13, respectively. All proofs of these theorems are given in Appendices B.1 and B.2. Without loss of generality, we assume , throughout the rest of this paper, whenever applicable.
Theorem 1
Under regularity conditions (C1C3) and (C4.0) given in Appendix A and the homogeneity of parameters , meta estimator is consistent, namely,
Theorem 2
Under the same conditions of Theorem 1 and an additional condition (C4.1), meta estimator is asymptotically normally distributed, namely,
where .
Note that according to the definition of , we have . Thus as . It follows from Theorem 2 that the convergence rate of is of order , not of order . This presents an important difference from subsampling strategy Mahoney (2011); Ma et al. (2015), in which the asymptotic convergence rate of their estimators is usually of an order given by the subsample size.
In practice, when the number of computing nodes in parallelization increases, i.e., , the following Theorem 3 shows the asymptotic properties of the proposed RaoCD meta estimator .
Theorem 3
Remark 1
In the case where each subdataset has the same size, i.e., , it is easy to obtain that in Theorem 3 under , together with Theorem 1 and Theorem 2 under or , for . Lin and Xi (2011) derived the asymptotic distribution for a quasilikelihood estimator under the assumption that for a positive constant with , which is much narrower than the range given in Theorem 3.
4.2 Asymptotic efficiency
In this section, we first present the asymptotic equivalency of the two types of CD estimators, and , in Theorem 4. This theorem provides the theoretical basis for a fast algorithm to implement in Section 5. Then we discuss the issue between and in Theorem 5. The proofs of Theorems 4 and 5 are given in Appendices B.4 and B.6, respectively.
Theorem 4
If conditions (C1)(C3) and (C4.2) hold, we have
Remark 2
Conditions (C4.0), (C4.1) and (C4.2) are weaker conditions than the typical smoothness assumptions adopted in the theory of the estimating functions (i.e., twice continuously differentiable). These conditions (C4.0), (C4.1) and (C4.2) have also been considered in Pakes and Pollard (1989), Newey and McFadden (1994), among others. In other words, these conditions automatically hold when estimating function is twice continuously differentiable.
Remark 3
Theorems 3 and 4 establish the estimation consistency and asymptotic normality of both RaoCD and WaldCD meta estimators as . These important theoretical properties are useful to implement these meta estimators in the MapReduce paradigm. With no surprise, the number of parallel datasets, , cannot increases at an arbitrarily fast rate as the information attrition can affect the quality of estimation within each subdataset. Intuitively, ensuring the goodness of fit for individual estimator is of the first importance in order to yield a desirable meta estimator. Technically, it is attributed to the fact that the estimation bias for a subdataset is at order of , which does not vanish over the data aggregation relatively to the variance of the resultant meta estimator. In other words, the proposed combination procedure helps improve the order of the variance to the parametric rate , whereas the order of the estimation bias remains the same at the rate of subdataset size. Consequently, an increase in the number of computing nodes should be controlled in such a way that the estimation bias is ignorable relative to the variance of the meta estimator. From a theoretical point of view, one of the directions to improve is through debiased methods (e.g., Firth, 1993; Cordeiro and McCullagh, 1991), which may permit increases to infinity at a faster rate than what has been obtained in this paper. However, from a practical point of view, allocating the number of CPUs is constrained by budget and available computing sources, and thus it is not necessary to let diverge at an arbitrary rate.
We now turn to the asymptotic efficiency of relative to that of , which is the estimator obtained by processing the entire data once from the following estimating equation, where MapReduce strategy is not used; that is, satisfies
(10) 
The standard theory of estimation functions claims that under the same conditions of Theorem 2, we have both estimation consistency and asymptotic normality,
where Godambe information , with sensitivity matrix , and variability matrix .
It is interesting to note that the root of equation (10), , may be regarded as a minimizer of the following quadratic estimation function:
(11) 
where and is the extended vector of estimating functions defined in Section 3. The method given in (11) is quite similar to the socalled aggregated estimation equation (AEE) proposed by Lin and Xi (2011); that is,
where is a consistent estimator of . Hence, and may be different numerically under finite samples, but they have the same asymptotic distribution.
We gain two important insights by comparing expressions (8) and (11) in terms of the two types of weighting matrices, versus , leading to (asymptotically equivalent to ) and (asymptotically equivalent to ), respectively. Note that and are different in general in the context of estimating functions. Because of such weighting differences, according to Hansen’s theory of GMM, (or ) will be asymptotically more efficient than (or ). This insight is summarized in Theorem 5 below.
Theorem 5
Under the same conditions of Theorem 2, we have the following inequality of Godambe information,
where the equality holds if and only if the Bartlett identity holds for estimating function (e.g. being the score function), ; or there exists a homogeneous asymptotic godambe information across all subdatasets, i.e., .
Theorem 5 indicates the RaoCD meta estimator is asymptotically at least as efficient as the onetime estimator, , when the same estimating method is applied with individual subdatasets in the MapReduce paradigm and with the entire data once in a “Godmade” computer. If the homogeneity of information matrices across individual subdatasets is violated, will produce better efficiency than the . The result is somewhat counterintuitive; but it always occurs in actual data analysis, where with finite samples one would yield unequal empirical information matrices and . This efficiency improvement is actually rooted in the fact that the way of weighting in (8) is optimal Hansen (1982), and thus, better than that in (11). Similar results are also found in Zeng and Lin (2015) under random effects models, and in Hu and Kalbfleisch (2000), who pointed out that the studentized estimating function bootstrap is secondorder accurate in comparison to the first order approximation of the estimating function bootstrap. When the estimating function is the score function, the same result has already been found in Lin and Zeng (2010) and Liu et al. (2015) for the maximum likelihood estimation.
Another important property of the RaoCD meta estimation concerns estimation robustness against data contamination. Note that the variancebased weighting scheme in (8) creates an automatic downweighting for any data cases associated with large residual values in the estimation procedure; see Qu and Song (2004); Preisser and Qaqish (1999); Hampel et al. (2011), among others. This downweighting mechanism makes the RaoCD meta estimation more robust than the onetime estimator, , obtained from (10) or (11). For the WaldCD meta estimator in (9), this weighting scheme takes approximately a form of the variance of , hence the robustness of the RaoCD meta estimator is also shared with the WaldCD meta estimator. In addition to the automatic downweighting scheme in the RaoCD meta estimation approach, data split actually allocates outliers into some of the subdatasets, affecting potentially a few ’s in those subdatasets that contain outliers. This dilution of influential cases via the data division adds another layer of protection for the RaoCD meta estimation in addition to the downweighting mechanism, which ensures greater robustness of the CDbased estimation and inference against contaminated data cases. In section 7, we will use simulation study to illustrate the robustness of the RaoCD meta estimation approach.
5 Implementation
It follows immediately from the definition of the WaldCD meta estimator in (9) that at the Reducestep based on mapped subdatasets is given by
(12) 
This closedform expression of in (12) only involves summary statistics , that are calculated separately in the Mapstep on individual computing nodes. Apparently, (12) presents a scalable parallel calculation with no need of loading the entire data into a common server, and thus reduces considerable amount of computation time. Note that the AEE estimator can also be implemented similarly in this scalable MapReduce framework.
The proposed RaoCD meta estimation in (7) may be implemented by the NewtonRaphson iterative algorithm. To do so, taking the secondorder Taylor expansion of (7) around the , we have
It is interesting to note that the second factor in the second term of the above expression equals to zero because the expression of in (12). These two types of CD meta estimators are numerically very close to each other, where may be regarded as a solution obtained by a onestep NewtonRaphson update from the in (12). The associated approximation error between them is explicitly gauged in Theorem 4, with the theoretical order of error being , which supports the above numerical approximation. This point of view is particularly appealing for big data computation with large .
To establish statistical inference, we propose to estimate the variance of by its empirical asymptotic variance, namely, , with , where . In the MapReduce paradigm, to avoid using the entire data, we propose to approximate by the following estimate:
(13) 
where is the empirical Godambe information matrix of obtained in the Mapstep on a single computing node. The following theorem provides a theoretical assessment of the approximation error incurred by the estimate in (13).
Theorem 6
Under conditions (C1)(C3) and (C4.1), we have
(14) 
The proof of Theorem 6 is given in Appendix B.5. Theorems 4 and 6 suggest that for big data computation with very large , the two types of meta CD estimators and are numerically very close to each other in terms of both point estimation and statistical inference. However, it is worth pointing out that the RaoCD approach presents a much more appealing theoretical framework to establish and interpret theoretical properties, and more importantly, carry out relevant analytic justification. We have implement the proposed RaoCD and WaldCD meta estimation methods in the Hadoop programming framework using Python language. The code has been applied to conduct simulation studies and real data analyses. The software is available for download at http://www.umich.edu/songlab/software.html#RCD.
6 Several examples
The RaoCD approach is applicable to a wide range of important statistical models. Here we present three representatives that are of great popularity in practice.
6.1 Quantile regression
Denote regression data by , where is the outcome and is a vector of regressors. Let the conditional distribution function of variable given be , and let the th quantile of be , . According to Koenker (2005), a quantile regression model takes a form of
Now applying the MapReduce paradigm, one may divide the data into K subdatasets and use the following estimating function to estimate parameter with the subdataset at one computing node:
where is the indicator function. For a solution of the above equation satisfying , the standard theory of quantile regression Koenker (2005) establishes the following asymptotic distributions, under some regularity conditions,
where , and . It follows from (8) that the RaoCD meta estimator is given by:
(15) 
with , independent of . Also, by (9), the WaldCD meta estimator is obtained as,
where the empirical sensitivity matrix involves estimation of unknown density . Estimating may be tedious and unstable when is not large. From this perspective, the RaoCD meta estimator may be numerically more stable than the WaldCD meta estimator in cases where the density is hard to estimate. Although directly minimizing the quadratic term (15) is in favor of numerical stability, it is prohibited in the MapReduce framework as the direct optimization in (15) requires reloading the entire data. Alternatively, the Apache Spark platform Zaharia et al. (2010) may be used to overcome the challenge of implementation as Spark provides a more flexible management of data reloading. This is beyond the scope of this paper. When the size of each subdataset is set large enough under which the density is well estimated, we may use the onestep updating strategy given in Section 5, where a completely parallelized calculation gives rise to a fast and simple implementation. Some numerical results are shown in Section 7 for the advantage of this parallelized computing scheme.
6.2 Generalized estimation equation
Consider longitudinal data consisting of independent realizations , with subject being observed repeatedly at times. In the literature of longitudinal analysis, generalized estimating equation (GEE), proposed by Liang and Zeger (1986), is one of the most widely used methods, which is a quasilikelihood approach based only on the first two moments of the data distribution. Denote the first two conditional moments of given by , and , where is a known link function,