Efficient Estimation of Nonlinear Finite Population
Parameters Using Nonparametrics
Currently, the high-precision estimation of nonlinear parameters such as Gini indices, low-income proportions or other measures of inequality is particularly crucial. In the present paper, we propose a general class of estimators for such parameters that take into account univariate auxiliary information assumed to be known for every unit in the population. Through a nonparametric model-assisted approach, we construct a unique system of survey weights that can be used to estimate any nonlinear parameter associated with any study variable of the survey, using a plug-in principle. Based on a rigorous functional approach and a linearization principle, the asymptotic variance of the proposed estimators is derived, and variance estimators are shown to be consistent under mild assumptions. The theory is fully detailed for penalized B-spline estimators together with suggestions for practical implementation and guidelines for choosing the smoothing parameters. The validity of the method is demonstrated on data extracted from the French Labor Force Survey. Point and confidence intervals estimation for the Gini index and the low-income proportion are derived. Theoretical and empirical results highlight our interest in using a nonparametric approach versus a parametric one when estimating nonlinear parameters in the presence of auxiliary information.
Keywords auxiliary information; penalized B-splines; calibration; concentration and inequality measures; influence function; linearization; model-assisted approach; total variation distance.
The estimation of nonlinear parameters in finite populations has become a crucial problem in many recent surveys. For example, in the European Statistics on Income and Living Conditions (EU-SILC) survey, several indicators for studying social inequalities and poverty are considered; these include the Gini index, the at-risk-of-poverty rate, the quintile share ratio and the low-income proportion. Thus, deriving estimators and confidence intervals for such indicators is particularly useful. In the present paper, assuming that we have a single continuous auxiliary variable available for every unit in the population, we propose a general class of estimators that take into account the auxiliary variable, and we derive their asymptotic properties for general survey designs. The class of estimators we propose is based on a nonparametric model-assisted approach. Interestingly, the estimators can be written as a weighted sum of the sampled observations, allowing a unique weight variable that can be used to estimate any complex parameter associated with any study variable of the survey. Having a unique system of weights is very important in multipurpose surveys such as the EU-SILC survey.
The estimation of nonlinear parameters is a problem that has already been addressed in several papers such as Shao (1994) for L-estimators, Binder and Kovacevic (1995) for the Gini index and Berger and Skinner (2003) for the low-income proportion. We mention also the very recent work of Opsomer and Wang (2011). Taking auxiliary information into account for estimating means or totals is a topic that has been extensively studied in the literature; it now encompasses the model-assisted and the calibration approaches, which coincide in particular cases (Särndal, 2007). In a model-assisted setting, linear models are usually used, thus leading to the well-known generalized regression estimators (GREG). Some nonparametric models have also been considered (Breidt and Opsomer, 2009). However, to the best of our knowledge, ratios, distribution functions and quantiles are the only examples of nonlinear parameters estimated using auxiliary information.
To derive our class of estimators and their asymptotic properties, we use an approach based on the influence function developed by Deville (1999). This approach utilizes a functional interpretation of the parameter of interest and a linearization principle to derive asymptotic approximations of the estimators. In general, the precision of an estimator of a nonlinear finite population parameter is obtained by resampling techniques or linearization approaches and in the present paper we focus on linearization techniques. When a sample is selected from the finite population according to a sampling design , the linearization of leads under some assumptions, to the following approximation:
where denotes the first-order inclusion probability for element under the design . The right term of (1) is the difference between the well-known Horvitz-Thompson estimator and the parameter it estimates, namely the total of the variable over the population . Here, referred to as the linearized variable of and the way it is derived depends on the type of linearization method used which could include the Taylor series (Särndal et al., 1992), estimating equations (Binder, 1983) or influence function (Deville, 1999) approaches. The artificial variable is used to compute the approximative variance of as
with the joint inclusion probability for the elements
Roughly speaking, when examining (1) and (2), we can see that, if we estimate in an efficient way , we will achieve a small approximative variance and good precision for . As stated above, it is well known that auxiliary information is useful for improving on the estimation of a total in terms of efficiency and, based on a linear model, the use of a GREG estimator is the most common alternative. When estimating a total, note that the asymptotic variance of the GREG estimator depends on the residuals of the study variable on the auxiliary variable. Because linearized variables may have complicated mathematical expressions, fitting a linear model onto linearized variables may not be the most appropriate choice. This may occur even if the study and the auxiliary variables have a clear linear relationship, as illustrated in the following example. Consider a data set of size 1000 extracted from the French Labor Force Survey and consider (the wages of person in 2000) as the study variable and (the wages of person in 1999) as the auxiliary variable. We now consider the problem of estimating the Gini index. The expression of the linearized variable , for the Gini index is given in Binder and Kovacevic (1995) and recalled in equation (17). It is a complex function of the study variable , . In the left (resp. right) graphic of Figure 1, the study variable is plotted (resp. the linearized variable ) on the -axis and the auxiliary variable is plotted on the -axis. The relationship between the study variable and the auxiliary variable is almost linear; however the relationship between the linearized variable of the Gini index and the auxiliary information is no longer linear. The consequence of this is that we cannot increase the efficiency of estimating a Gini index if we take the auxiliary information into account through a GREG estimator. Therefore, nonparametric models should be preferred to estimate nonlinear parameters .
Recent work already employs nonparametric models to estimate totals (Breidt and Opsomer, 2000, Breidt et al., 2005 and Goga, 2005). The use of nonparametrics prevents model failure; however the improvement over parametric estimation for totals and means may not be significant enough to justify the supplemental difficulties of implementing nonparametric methodology. As illustrated above, the motivation for using nonparametrics becomes much stronger when estimating nonlinear parameters. Note that the use of nonparametric regression to estimate distribution functions and quantiles has also been studied, for example in Johnson et al. (2008); however, to our knowledge, this has not been performed for other nonlinear parameters.
We propose a novel methodology that allows for the efficient estimation of any parameter by combining the functional approach (Deville, 1999) with any of the previously suggested nonparametric methods. One issue with the functional approach is that several technical details are not provided in Deville (1999); thus it is difficult to derive rigorous proof of asymptotic results by following this approach. In the present paper, we propose to clarify some important points and derive rigorous proofs of our asymptotic results. Most importantly, we prove that the total variation distance between finite measures is an adequate choice for the derivation of asymptotic approximations in this context. Asymptotic results are detailed at length for penalized B-spline nonparametric estimators.
The estimators under study combine two types of nonlinearity: nonlinearity due to the expression of a complex parameter and nonlinearity due to nonparametric estimation. We propose a two-step linearization procedure that provides an approximation of the nonparametric estimator via a Horvitz-Thompson estimator of a total using an artificial variable. Roughly speaking, this artificial variable corresponds to the residuals of the linearized variable on the fitted values under the model. Because the linearized variables depend on the parameter of interest, the residuals will also depend on this parameter. The consequence of this important and general property is that the nonparametric approach helps to get a unique system of weights that may lead to a gain in efficiency for different complex parameters.
The paper is structured as follows: the second section provides some background information on the nonparametric estimation of a finite population total in a general framework. In the third section, a class of nonparametric substitution estimators based on nonparametric regression is introduced. Variance approximations are derived using the influence function linearization approach (Deville, 1999) in a general nonparametric setting. We propose in the fourth section a penalized B-spline model-assisted estimator for the finite population totals which is in fact an extension to a survey sampling framework of the penalized B-spline estimator studied in Claeskens et al. (2009). We prove that the estimator is asymptotically design-unbiased and consistent. Next, we build the nonparametric penalized spline estimation for nonlinear parameters and we assess the validity of the two-step linearization technique. The fifth section defines a class of consistent variance estimators while section six contains a case study. The data set is extracted from the French Labor Force surveys of 1999 and 2000 as presented previously. Asymptotic and finite-sample properties of the regression B-spline estimators are illustrated for the simple random sampling without replacement and the stratified simple random sampling. This section also includes suggestions for practical implementation and guidelines for choosing the smoothing parameters. Finally, section seven concludes this study and the assumptions and the technical proofs together with some discussion are provided in the Appendix.
2 Nonparametric model-assisted estimation of finite population totals
We focus on the estimation of the total of the study variable over , taking into account the univariate auxiliary variable The values of are assumed to be known for the entire population.
Many approaches can be used to take into account auxiliary information and thus improve on the Horvitz-Thompson estimator The goal is to derive a weighted linear estimator of such that the sample weights do not depend on the study variable values but include the values for all The construction of the model-assisted (MA) class of estimators is based on a superpopulation model :
where the are independent random variables with mean zero and variance If was known for all the total may be estimated by the generalized difference estimator (Cassel et al., 1976),
Note that consists in the difference between the Horvitz-Thompson estimator and its bias under the model namely . As a consequence, is unbiased under the model, and moreover, it is unbiased under the sampling design, The variance of under the sampling design is given by
which shows clearly that the difference estimator is more efficient than the Horvitz-Thompson estimator if approximates well for all
In practice, we don’t know the true regression function thus we use an estimator of it. Generally, this estimator is obtained using a two-step procedure: we estimate first by under the model and next, we estimate by using the sampling design. Plugging in (4), yields the final estimator of
The linear regression function yields the generalized regression estimator (GREG) extensively studied by Särndal et al. (1992). The GREG estimator is efficient if the model fits the data well, but if the model is misspecified, the GREG estimator exhibits no improvement over the Horvitz-Thompson estimator and may even lead to a loss of efficiency. One way of guarding against model failure is to use nonparametric regression which does not require a predefined parametric mathematical expression for .
Recently, Breidt and Opsomer (2000) proposed local linear estimators and Breidt et al. (2005) and Goga (2005) used nonparametric spline regression. The unknown function is approximated by the projection of the population vector onto different basis functions, such as the basis of truncated th degree polynomials in Breidt et al. (2005) and the B-spline basis in Goga (2005). In the following, we briefly recall the definition and the main asymptotic properties of nonparametric model-assisted estimators for finite population totals (see also Breidt and Opsomer, 2009).
Let be the estimator of obtained at the population level using one of the three nonparametric methods mentioned above. Plugging into (4) results in the following nonparametric generalized difference pseudo-estimator of the finite population total:
Note that is called a pseudo-estimator because it is not feasible in practice since is unknown. This pseudo-estimator is still design-unbiased but it is model-biased because nonparametric estimators are biased for (Sarda and Vieu, 2000). Nevertheless, under supplementary assumptions (Breidt and Opsomer, 2000 and Goga, 2005), the bias under the model vanishes asymptotically to zero when the population and the sample sizes go to infinity. The unknown quantities are usually obtained by least squares methods (ordinary, weighted or penalized) and we may write
where the dimensional vector depends on the population values as well as on the projection matrix for the considered basis functions, but does not depend on The expression of depends on the chosen nonparametric method, as discussed in Breidt and Opsomer (2000), Breidt et al. (2005) and Goga (2005).
As in the parametric case, we estimate by using the sampling design,
where is the -dimensional design-based estimator of and is the sample restriction of Plugging into (6) yields the following nonparametric model-assisted estimator (NMA)
This estimator can be written as a weighted sum of the sampled observations
where the weights depend only on the sample and on the auxiliary information,
with the dimensional vector of ones, the diagonal matrix with along the diagonal and the matrix having as rows with sample restriction The estimator (10) is a nonlinear function of Horvitz-Thompson estimators, and its asymptotic variance has been obtained on a case-by-case study. Under mild hypothesis (Breidt and Opsomer, 2000, Breidt et al., 2005 and Goga, 2005), is asymptotically design-unbiased, namely and design -consistent in the sense that
Moreover, it can be approximated by the nonparametric generalized difference estimator
Furthermore, if the asymptotic distribution of is normal , we have that the asymptotic distribution of is also normal where is obtained according to formula (5) applied to residuals This means that the NMA estimators bring an improvement over parametric methods and the Horvitz-Thompson estimator when the relation between and is not linear. In this case, the residuals will be smaller than under a parametric smoother, which explains the diminution of the design variance of NMA estimators. Nevertheless, nonparametric estimators require that the auxiliary information should be known on the whole population unlike the GREG estimator that requires only the finite population total for
The efficiency of NMA estimators depends on the choice of the smoothing parameters. Opsomer and Miller (2005) and Harms and Duchesne (2010) derive the optimal bandwidth for the local polynomial regression, while Breidt et al. (2005) circumvent the issue of the number of knots by introducing a penalty coefficient. They also give a practical method for estimating this penalty.
3 Nonparametric model-assisted estimation
of nonlinear finite population parameters
3.1 Definition of the nonparametric substitution estimator
Let us consider the estimation of some nonlinear parameters by taking into account univariate auxiliary information known for all the population units.
Examples of a nonlinear parameter of interest include the ratio, the Gini coefficient and the low-income proportion. A parameter may depend on one or several variables of interest; however, the same auxiliary variable will be used to explain these variables of interest.
We aim to provide a general method for the estimation of using and considering the functional approach introduced by Deville (1999). The methodology consists in considering a discrete and finite measure where is the Dirac measure at the point and is such that there is unity mass on each point with and zero mass elsewhere. Furthermore, we write as a functional of
The nonparametric weights are provided by (11) and is estimated by
Even if these weights are derived to estimate the total they do not depend on the study variable ; thus they can be used to estimate any nonlinear parameter of interest when it can be expressed as a function of Note that is a random measure of total mass equal to
Plugging into (14) provides the following nonparametric substitution estimator for ,
We will now illustrate the computation of using the simple case of a ratio and subsequently
the more intricate case of the Gini index and parameters defined by implicit equations.
a. The ratio R between two finite population totals. We write in a functional form as The nonparametric estimator of is easily obtained by replacing the measure with namely A similar estimation of using GREG weights was previously considered by Särndal et al. (1992).
b. The Gini index. The Gini index (Nygard and Sandström, 1985) is given by
where is the empirical distribution function. Again, the nonparametric estimator for is obtained by simply replacing with Hence,
c. Parameters defined by an implicit equation. Let be defined as the unique solution of an implicit estimating equation (Binder, 1983) that may be written in a functional form as We replace with and the nonparametric sample-based estimator of is the unique solution of the sample-based estimating equation An example of such a parameter is the odds-ratio which is extensively used in epidemiological studies. Goga and Ruiz-Gazen (2012) have studied the estimation of the odds-ratio by taking into account auxiliary information and nonparametric regression.
3.2 Asymptotic properties of the nonparametric substitution estimator under the sampling design
In this section, we investigate the asymptotic properties of the nonparametric estimator , using the asymptotic framework suggested by Isaki and Fuller (1982). Additionally, we make several assumptions (detailed in the Appendix) regarding the regularity of the functional and the first order inclusion probabilities of the sampling design.
The nonparametric estimator is doubly nonlinear, with nonlinearity due to the parameter and nonlinearity due to the nonparametric estimation. Our main goal is to approximate using a linear estimator (Horvitz-Thompson type) which will allow to compute the asymptotic variance of This approximation will be accomplished in two steps: first, we will linearize and next, we will linearize the nonparametric estimator obtained in step one.
The first linearization step is a first-order expansion of with the reminder going to zero. The parameter of interest is a statistical functional defined with respect to the measure or equivalently, with respect to the probability measure (by assumption A1). Using the first-order expansion of statistical functionals as introduced by von Mises (1947) and under the assumption of Fréchet differentiability of , the reminder depends on some distance function between and an estimator of this measure (Huber, 1981). Deville (1999) uses these facts to prove the linearization of the Horvitz-Thompson substitution estimator of ; however, no details are given about the considered distance, while Goga et al. (2009) provide only minimal details. In what follows, we provide a distance between and the true which goes to zero when the sample and the population sizes go to infinity.
We consider the total variation distance for two finite and positive measures and to be defined by
with . We first prove (lemma 1 from below), that the distance between the Horvitz-Thompson estimator of and the true goes to zero. Next, we extend the result (lemma 2 from below) to the nonparametric estimator
Let represent the Horvitz-Thompson weights, namely for all and let be the estimator of using these weights. Let and for ease of notation, . Thus, for all uniformly in and
where is the sample membership indicator.
Assume (A3) and (A5) from the Appendix. Then,
Assume (A3) and (A5) from the Appendix. Assume in addition that:
() for all uniformly in .
() uniformly in
The proof is provided in the Appendix. In section 4, we prove that the nonparametric estimator of constructed using B-spline estimators satisfies the assumptions () and () from the above lemma. The results from Breidt and Opsomer (2000) may be used to prove the assumptions for local polynomial regression; however, this issue will not be pursued further here.
To provide the first order expansion of we must also define its first derivative. This derivative is referred to as the influence function and is defined as follows (Deville, 1999)
where is the Dirac measure at point . Note that the above definition is slightly different from the definition of the influence function given by Hampel (1974) in robust statistics, which is based on a probability distribution instead of a finite measure.
Let for all be the influence function computed at , namely
These quantities are referred to as the linearized variables of and serve as a tool for computing the approximative variance of They depend on the parameter of interest and they are usually unknown even for the sampled individuals. Deville (1999) provides many practical rules for computing for rather complicated parameters
Examples. The linearized variable of a ratio is
and for the Gini index, it is given by
where is the mean of lower than
We now provide the main result of this paper. The following theorem is the first linearization step of . This proves that under broad assumptions the nonparametric estimator is approximated by the nonparametric estimator for the population total of the linearized variable. The proof is provided in the Appendix.
(First linearization step) Assume (A1)-(A3) and (A5) from the Appendix. Additionally assume () and from lemma 2. Then, the nonparametric substitution estimator fulfills
We can put in the form of an NMA estimator. Let denote Using (11), we can write
where with is given by (8) and is the sample restriction of
Remark 1: A model-based interpretation of may be given. For the nonparametric model , the linearized variable can be fitted using the auxiliary variable
where the are independent random variables with mean zero and variance The estimator of under the model denoted by , is obtained using the same nonparametric method employed for estimating under the model This implies that is the best fit of the population vector with given by (7). Furthermore, is estimated by which leads to the pseudo-estimator of However, unlike the linear case, is not an estimate of because the sample linearized variable vector is not known and we refer to it as a pseudo-estimator. Remark also that the estimator is efficient if the nonparametric model holds.
The nonparametric pseudo-estimator given by (18) is a nonlinear function of Horvitz-Thompson estimators; however, it estimates a linear parameter of interest, namely the total of This indicates that is similar to estimators used by Breidt and Opsomer (2000), Breidt et al. (2005) and Goga (2005) although it is computed for the artificial variable The second linearization step approximates by the generalized difference estimator of given by
(Second linearization step) Assume that Then,
Moreover, if the asymptotic distribution of is then the asymptotic distribution of is also
In section 4, we provide the necessary assumptions for the linearized variables and the auxiliary variable
to obtain an approximation of by in a B-spline estimation context.
Remark 2. When the linearized variable is a linear combination of the study variables, the assumption from proposition 4 is reduced to assumptions on the study variables. For example, this occurs in the case of a ratio where the linearized variable is given by The error can be written as a linear combination of errors between and , respectively.
Using mild regularity assumptions on and on the sampling design, and are shown to be of order (see Fuller, 2009, for linear regression and section 4 for B-spline estimators). Thus is also of order provided that and are bounded.
Remark 3. The asymptotic variance given by theorem 3 and proposition 4 depends on the population residuals of the linearized variables under the model . For the simple case of a ratio, the relationship between and the study variables is explicit and given by . If linear models fit the data and well, then a linear model will also fit well. Nevertheless, for nonlinear parameters such as the Gini index, the relationship between and the study variable is not as simple as that for the ratio. In such situations, the use of nonparametric regression methods may provide a major improvement with respect to variance compared to parametric regression.
4 Penalized B-spline estimators
Spline functions have many attractive properties, and they are often used in practice due to their good numerical features and ease of implementation. We suppose without loss of generality that all have been normalized and lie in For a fixed the set of spline functions of order with equidistant interiors knots is the set of piecewise polynomials of degree that are smoothly connected at the knots (Zhou et al., 1998),
For is the set of step functions with jumps at knots. For each fixed set of knots, is a linear space of functions of dimension . A basis for this linear space is provided by the B-spline functions (Schumaker, 1981, Dierckx, 1993) defined by
where if and zero, otherwise. For all each function has the knots with for (Zhou et al., 1998) which means that its support consists of a small, fixed, finite number of intervals between knots. Moreover, B-spline are positive functions with a total sum equal to unity:
For the same order and the same knot location, one can use the truncated power basis (Ruppert and Carroll, 2000) given by . The B-spline and the truncated power bases are equivalent in the sense that they span the same set of spline functions (Dierckx, 1993). Nevertheless, as indicated by Rupert et al. (2003), “the truncated power bases have the practical disadvantage that they are far from orthogonal”, which leads to numerical instability especially if a large number of knots are used.
4.1 Nonparametric penalized spline estimation for finite population totals
We now consider the superpopulation model given by (3). To estimate the regression function we use spline approximation and a penalized least squares criterion. We define the spline basis vector of dimension as The penalized spline estimator of is given by with as the least squares minimizer of
where represents the -th derivate with The solution of (21) is a ridge-type estimator,
where is the matrix with rows and the matrix is the squared norm applied to the th derivative of . Because the derivative of a -spline function of order may be written as a linear combination of -spline functions of order , for equidistant knots (Claeskens et al., 2009) where the matrix has elements with as the -spline function of order and as the matrix corresponding to the th order difference operator.
The amount of smoothing is controlled by The case results in an unpenalized B-spline estimator the asymptotic properties of which have been extensively studied in the literature (Agarwal and Studden, 1980, Burman, 1991, and Zhou et al., 1998, among others). The case is equivalent to fitting a th degree polynomial. The theoretical properties of penalized splines with have been studied only recently by Cardot (2000), Hall and Opsomer (2005), Kauermann et al. (2009) and Claeskens et al. (2009).
The design-based estimators of are
where is the design-based estimator of and is the matrix given by We note that may be written as in formula (8) for
Finally, the -spline NMA estimator of is as follows:
This indicates that may be written as a GREG estimator that uses the vectors as regressors of dimension with going to infinity and a ridge-type regression coefficient Furthermore, is a weighted sum of sampled values with weights expressed as in (11),
For we obtained the unpenalized B-spline estimator studied by Goga (2005) and called the regression splines. The B-spline property given in (20) may be written as with the dimensional vector of ones, implying that and Using these two relations in (25) (Goga, 2005), we observe that is equal to the finite population total of the prediction
where the weights are given by,
Note the similarity with the GREG weights obtained in the case of a linear model when the variance of errors is linearly related to the auxiliary information (Särndal, 1980). We note that for a B-spline of order the estimator becomes the well-known poststratified estimator (Särndal et al., 1992).
Based on assumptions regarding the sampling design and the variable (assumptions (A3)-(A5) from the Appendix) and assumptions regarding the distribution of and the knot number (assumptions (B1)-(B2) in the Appendix), Goga (2005) proved that the B-spline estimator for the total is asymptotically design-unbiased and consistent (equation (12)) and may be approximated by a nonparametric generalized difference estimator (equation (13)). These results are valid without supplementary assumptions regarding the smoothness of the regression function
Penalized splines using truncated polynomial basis functions
Let be the vector basis and let with be the least squares minimizer of for The solution is given by
with and the penalty matrix having zeros on the diagonal followed by one values, Note that for we obtain the same prediction as with an unpenalized B-spline estimation. This results follows from the fact that the two bases are equivalent, thus there exists a square and invertible transition matrix such that (Ruppert et al., 2003). For we have which indicates equivalency to the estimator obtained with penalized B-spline fitting given by (22) for (see Claeskens et al. (2009) for the expression of satisfying this equation).
In a design-based approach, Claeskens et al. (2005) proved that the NMA estimator