Calibration estimation in dual frame surveys

# Calibration estimation in dual frame surveys

M. Giovanna Ranalli Department of Political Sciences, Università degli Studi di Perugia, Italy, giovanna.ranalli@stat.unipg.it    Antonio Arcos Department of Statistics and Operational Research, Universidad de Granada, Spain    María del Mar Rueda Department of Statistics and Operational Research, Universidad de Granada, Spain    Annalisa Teodoro Department of Economics, Finance and Statistics, Università degli Studi di Perugia, Italy
###### Abstract

Survey statisticians make use of the available auxiliary information to improve estimates. One important example is given by calibration estimation, that seeks for new weights that are close (in some sense) to the basic design weights and that, at the same time, match benchmark constraints on available auxiliary information. Recently, multiple frame surveys have gained much attention and became largely used by statistical agencies and private organizations to decrease sampling costs or to reduce frame undercoverage errors that could occur with the use of only a single sampling frame. Much attention has been devoted to the introduction of different ways of combining estimates coming from the different frames. We will extend the calibration paradigm, developed so far for one frame surveys, to the estimation of the total of a variable of interest in dual frame surveys as a general tool to include auxiliary information, also available at different levels. In fact, calibration allows us to handle different types of auxiliary information and can be shown to encompass as a special cases some of the methods already proposed in the literature. The theoretical properties of the proposed class of estimators are derived and discussed, a set of simulation studies is conducted to compare the efficiency of the procedure in presence of different sets of auxiliary variables. Finally, the proposed methodology is applied to data from the Barometer of Culture of Andalusia survey.

Keywords: Auxiliary information, Kullback-Leibler distance, Raking ratio, Regression estimation, Survey Methodology.

## 1 Introduction

A main aim of survey statisticians is to obtain more accurate estimates, without increasing survey costs. Two popular tools to achieve this goal are the use of more than one population frame to select independent samples and the use of auxiliary information either at the design or at the estimation stage. The use of more than one list of population units is important because a common practical problem in conducting sample surveys is that frames may be incomplete or out of date, so that resulting estimates may be seriously biased. Multiple frame surveys are useful when no single frame covers the whole target population but the union of several available frames does, or when information about a subgroup of particular interest comes only from an incomplete frame. They also have other advantages. In fact, hartley1962multiple introduces dual frame surveys as a cost-saving device, showing that they can often achieve the same precision as a single-frame survey at a much reduced cost. kalton1986sampling suggest using two frames for sampling rare populations where even greater efficiencies can be obtained. Several estimators of the population total and mean have been proposed in the literature in dual frame surveys, usually classified, according to the level of frame information needed, as dual-frame and single-frame estimators.

On the other hand, the growing availability of information coming from census data, administrative registers and previous surveys provide a wide range of variables, concerning the population of interest, that are eligible to be employed as auxiliary information to increase efficiency in the estimation procedure. In this scenario, a very relevant example is given by calibration estimation that adjusts basic design weights to account for auxiliary information and meet benchmark constraints on auxiliary variables population statistics (Deville1992calibration). sarndal2007calibration provides an overview on developments in calibration estimation. In this paper, we will show how to extend calibration estimation to handle estimation from two frame surveys and how different types of auxiliary information can be easily integrated in the calibration process as benchmark constraints. Moreover, depending on the information available at the design stage, we show how to build calibration estimators under both the dual and the single frame approach. We will show that the proposed class of calibration estimators encompasses as particular cases some of the estimators already proposed in the literature. To show evidence of such connections, we will follow the minimum distance approach for calibration estimation, although using the instrumental variable approach is of course possible.

The paper is organized as follows. In Section 2 notation is introduced and those methods proposed in the literature to handle dual frame estimation are briefly reviewed. Then Section 3 illustrates the proposed class of calibration estimators by first dealing with the dual-frame approach and then moving, in Section 4, to the single-frame approach. The general form is provided and particular cases are derived according to relevant examples of auxiliary information. The theoretical properties of the proposed estimators are investigated in an asymptotic framework adapted from that of isaki1982survey. In addition, analytic and Jackknife variance estimators are proposed. Then, Section 6 reports the results of an extensive simulation study run on a set of synthetic finite populations in which the performance of the proposed class of estimators is investigated for finite size samples. Section 7 shows the application of the proposed estimation technique to data from the Barometer of Culture of Andalusia survey. Section 8 provides some conclusions and directions for future research.

## 2 Estimation in dual frame surveys

Consider a finite set of population units identified by the integers, , and let and be two sampling-frames, both can be incomplete, but it is assumed that together they cover the entire finite population. Let be the set of population units in frame and the set of population units in frame . The population of interest, , may be divided into three mutually exclusive domains, and . Because the population units in the overlap domain can be sampled in either survey or both surveys, it is convenient to create a duplicate domain , which is identical to , to denote the domain in the overlapping area coming from frame . Let , , , , , , be the number of population units in , , , , , , , respectively. It follows that , and .

Let be a variable of interest in the population and its value on unit , for . The entire set of population values is our finite population . The objective is to estimate the finite population total of , that can be written as

 Y=Ya+ηYab+(1−η)Yba+Yb, (1)

where , and , , and . Two probability samples and are drawn independently from frame and frame of sizes and , respectively. Each design induces first-order inclusion probabilities and , respectively, and sampling weights and . Units in can be divided as , where and . Similarly, , where and . Note that and are both from the same domain , but is part of the frame sample and is part of the frame sample. In this way, we have a sort of “poststratified” sample with “poststratum” sample sizes , , and . Note that and (see rao2010pseudo).

The hartley1962multiple estimator of is given by

 ^YH(η)=^Ya+η^Yab+(1−η)^Yba+^Yb, (2)

where is the Horvitz-Thompson estimator for the total of domain and similarly for the other domains. If we let

 d∘k=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩dAkif k∈saηdAkif k∈sab(1−η)dBkif k∈sbadBkif k∈sb

then . In the following, we will drop for ease of notation. Since each domain is estimated by its Horvitz-Thompson estimator, is an unbiased estimator of for a given . Since frames and are sampled independently, the variance of is given by

 V(^YH)=V(^Ya+η^Yab)+V((1−η)^Yba+^Yb), (3)

where the first component of the right hand side is computed under (the sampling design in frame ) and the second one under , and both are always understood conditional on the finite population .

Choice of a value for has attracted much attention in literature; the value of that minimizes the variance in (3) depends on unknown population variances and covariances and, when estimated from the data, it depends on the values of the variable of interest. This implies a need to recompute weights for every variable of interest , which will be inconvenient in practice for statistical agencies conducting surveys with numerous variables and lead to inconsistencies in the estimates (see lohr2009multiple, for a review).

The estimator developed by fuller1972estimators incorporates information regarding the estimation of to improve over , but has the drawback of not being a linear combination of values, unless using simple random sampling. skinner1996estimation propose a modification of the estimator proposed by fuller1972estimators for simple random sampling to handle complex designs. They introduce a pseudo maximum likelihood (PML) estimator that does not achieve optimality like the FB estimator, but it can be written as a linear combination of the observations and the same set of weights can be used for all variables of interest.

Recently, rao2010pseudo extend the Pseudo-Empirical-Likelihood approach (PEL) proposed by wu2006pseudo from one-frame surveys to dual-frame surveys following a stratification approach. They consider estimation of the population mean of ,

 ¯Y=Wa¯Ya+Wab(η)¯Yab+Wba(η)¯Yba+Wb¯Yb,

where , , and , , and again is a fixed constant to be specified. The PEL function takes the following expression:

 lD(pak,pabk,pbak,pbk) = n[Wa∑k∈sa~daklog(pak)+Wab(η)∑k∈sab~dabklog(pabk)+ (4) +Wba(η)∑k∈sba~dbaklog(pbak)+Wb∑k∈sb~dbklog(pbk)],

for all , where , , , and . The four sets of probability measures in (4) are found by maximizing the PEL function under the following normalizing constraints

 ∑k∈sapak=1,∑k∈sabpabk=1,∑k∈sbapbak=1,∑k∈sbpbk=1,

and the constraint induced by the common domain mean

 ∑k∈sabpabkyk=∑k∈sbapbakyk. (5)

The maximum PEL estimator of is then computed as

 ^¯YP=Wa^¯Ya+Wab(η)^¯Yab+Wba(η)^¯Yba+Wb^¯Yb, (6)

where , and because of constraint (5). Situations in which population domain sizes are not known are sketched and the choice of is also discussed.

When inclusion probabilities in domain are known for both frames, and not just for the frame from which the unit was selected, single-frame methods can be used that combine the observations into a single dataset and adjust the weights in the intersection domain for multiplicity. In particular, observations from frame and frame are combined and the two samples drawn independently from and are considered as a single stratified sample over the three domains , and . To adjust for multiplicity, the weights are defined as follows for all units in frame and in frame ,

Note that units in the overlap domain, which are expected to be selected a number of times given by have equal weights in frame and in frame . The estimator proposed by kalton1986sampling is essentially an Horvitz-Thompson estimator for which

 ^YS=∑k∈sd⋆kyk. (7)

Its variance is given by , where the first component of the right hand side is computed under and the second one under . If and were known, the single-frame estimator could be adjusted using raking ratio estimation (bankier1986estimators; skinner1991efficiency).

In the following section calibration estimation for dual frame surveys is introduced. We will first consider dual-frame methods and, then, to encompass situations in which auxiliary information is also in the form of inclusion probabilities for all units in both frames from both sampling design, single-frame methods will be considered as well (Section 4).

## 3 Calibration estimation: dual-frame methods

In this section, we will show how to extend calibration estimation, as discussed in one frame surveys by Deville1992calibration, to handle estimation from two frame surveys and how different types of auxiliary information can be easily integrated in the calibration process as benchmark constraints. Now, let be the value taken on unit by a vector of auxiliary variables of which we assume to know the population total . This vector of totals may pertain only , only , the entire population , or a combination of the three. We will first look at a general formulation of the problem, and then provide (relevant) examples of auxiliary vectors . Using the calibration paradigm, we wish to modify, as little as possible, basic Hartley weights to obtain new weights , for to account for auxiliary information and derive a more accurate estimation of the total . A general dual-frame calibration estimator can be defined as

 ^Y\scriptsize CAL=∑k∈sw∘kyk (8)

where is such that

 min∑k∈sG(w∘k,d∘k)s.t.∑k∈sw∘k\boldmathxk=\boldmathtx, (9)

where is a distance measure satisfying the usual conditions required in the calibration paradigm (see e.g. Deville1992calibration, Section 2). Note that is the Hartley estimator of for a given . Then , where and denotes the inverse function of . The vector is determined using

 ϕs(\boldmathλ)=∑k∈sd∘k[F(% \boldmathxk\boldmathλ)−1]\boldmathxTk,

so that .

Given the set of constraints, different calibration estimators are obtained by using different distance measures. In many instances, numerical methods are required to solve the the minimization problem in (9). However, it is well known that, if we take the Euclidean (or -statistic) type of distance function , equivalent to the linear method in deville1993generalized, we can obtain an analytic solution. In particular,

 w∘k=d∘k(1+\boldmathxk\boldmathλ) (10)

and, substituting this value in the calibration constraint in (9), we obtain . Substituting this value back into equation (10), the weights take the following form

 w∘k=d∘k[1+(\boldmathtx−^%\boldmath$t$xH)(∑k∈sd∘k\boldmathxTk\boldmathxk)−1\boldmathxTk]. (11)

In this case, estimator can be written as:

 ^Y\scriptsize CAL = ∑k∈sw∘kyk=^YH+(\boldmathtx−^\boldmathtxH)^\boldmathβ% ∘ = N∑k=1\boldmathxk^\boldmathβ∘+∑k∈sd∘k(yk−\boldmathxk^\boldmathβ∘),

where This estimator takes the form of a generalized regression type estimator for dual frame surveys and will be denoted by . These results are in line with those of calibration estimator in one frame surveys: Horvitz-Thompson estimators of and , and in the regression coefficient are here replaced by Hartley estimators.

Now, if we take as distance function the Kullback-Leibler divergence defined as

 G(w∘k,d∘k)=−d∘klog(w∘k/d∘k)+w∘k−d∘k, (12)

that is Case 4 distance examined in Deville1992calibration, then and numerical methods are required. It can be noted that maximizing the PEL function in (4) is equivalent to minimizing (12) given the same set of starting weights and set of constraints. This equivalence was already noted in one frame surveys by deville2005talk.

The calibration process induces a different final value for the weights which depends on both the distance measure used and the benchmark constraints applied. On the other hand, given a value for , the final set of weights does not depend on the values of the variables of interest and can be, therefore, used for all variables of interest. When a value for is to be computed from the sample data, then it is essential to consider proposals based on estimators of , and as the one in, e.g., skinner1996estimation so that it is the same for all variables of interest. In the following, we consider some relevant examples of the form taken by the calibration estimator according to the auxiliary information available. Then, the theoretical properties are proven in Section 3.6.

### 3.1 Na, Nb and Nab all known

Suppose that the dimension of the three sets , and is known. Then, we can build the auxiliary vector using domain membership indicator variables, i.e.

 \boldmathxk=(δk(a),δk(ab),δk(ba),δk(b)),fork=1,…,N, (13)

where if and otherwise, if and otherwise, if and otherwise and if and otherwise. In order to have final weights that can be used directly to estimate population totals as in equation (8), we will let be the vector of known totals. In this case the calibration constraints are given by

 ∑k∈saw∘k=Na,∑k∈sabw∘k=ηNab,∑k∈sbaw∘k=(1−η)Nba,∑k∈sbw∘k=Nb, (14)

and the minimization problem has an analytic solution irrespective of the distance function employed. Such solution is given by

 w∘k=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩dAkNa/^Naif k∈saηdAkNab/^Nabif k∈sab(1−η)dBkNba/^Nbaif k∈sbadBkNb/^Nbif k∈sb, (15)

where , , and . Note that these weights provide Hájek type estimators for each domain and mirror the result provided in deville1993generalized when dealing with the calibration estimator in case auxiliary information consists of known cell counts in a frequency table. deville1993generalized denote this case as complete post-stratification, that is when all the domain sizes are known and used for calibration. Note that, given that we are estimating totals in domains using ratio type estimators, the sample size of the domains is important to avoid the introduction of possible bias in the final estimates.

### 3.2 Na, Nb known and Nab unknown

Following the terminology of deville1993generalized, we call the case treated in this section as incomplete post-stratification and we mean that not all the domain sizes are known, in particular we know only the size of frame and of frame , but we don’t known the size of the overlap domain . In this case, for , we can write the vector of auxiliary information as:

 \boldmathxk=(δk(a)+δk(ab)+δk(ba),δk(b)+δk(ab)+δk(ba)). (16)

The vector of known totals in this case is and we have the following calibration constraints

 ∑k∈saw∘k+∑k∈sabw∘k+∑k∈sbaw∘k=NA ∑k∈sbw∘k+∑k∈sabw∘k+∑k∈sbaw∘k=NB,

in which we, in some sense, calibrate on the margins. Final calibration weights are no longer independent from the distance function used and it is not possible to obtain an analytical expression unless we use the Euclidean distance. In this latter case we obtain the following estimator of :

 ^Nwab=^Nab,H^NaNB+^NbNA−^Na^Nb^Na^NB+^Nb^NA−^Na^Nb, (17)

where . We can note how the calibration procedure adjusts the Hartley estimator of accounting for auxiliary information.

rao2010pseudo also consider the case in which is unknown. However, they do not estimate it from within the maximum PEL procedure, but they first estimate it by , where and denotes variance estimates. Then, they take a pseudo-complete post stratification approach by suitably modifying the likelihood function.

### 3.3 Population totals for group membership indicators are known

Let the population be divided into mutually exclusive groups , for such that and let be the indicator variable that takes value 1 if unit and 0 otherwise, for and . Then, and . Now, consider the situation in which we know the population total of such indicator variables for each of the four domains, i.e. , , , , for . Note that and similarly for the other cases. In practice, this would mean that we know, say the number of units for each of age-sex groups in the population for each of the four domains. This amount of auxiliary information of course implies that we also know the dimension of the three sets , and considered in the Section 3.1. Indeed, that is a special case of the present one.

In this case the vector of auxiliary variables is defined for by

 \boldmathxk={(δk(a)δk(h),δk(ab)δk(h),δk(ba)δk(h),δk(b)δk(h)}h=1,…,H

and the vector of known totals is set to be . As in Section 3.1 the minimization problem has an analytic solution irrespective of the distance function employed. Such solution is given by

 (18)

where and similarly for the other size estimators. This is another case of complete post-stratification. The final estimator will be more efficient than the Hartley estimator as much as groups collect units with a similar value of the variable of interest.

When, on the other side, we only know the population total in frame and in frame , i.e. we do not know the distribution for the intersection domain , then we are again in a situation of incomplete post-stratification, like that of Section 3.2. Here,

 \boldmathxk={[δk(a)+δk(ab)+δk(ba)]δk(h),[δk(b)+δk(ab)+δk(ba)]δk(h)}h=1…,H

and . We have an analytic solution for the form of the weights only for the Euclidean distance case, but it does not take a simple tractable form as that considered in Section 3.2. A similar situation arises also when, as in the case considered later in the application (Section 7), we do not know the distribution for, say, age-sex groups, but we know only the total for age and the total for sex, in each of the two frames and . This is another example of incomplete post-stratification, that employs a form of raking (depending on the distance function employed) to obtain the final set of weights (see also examples in Section 4).

### 3.4 Na, Nb, Nab known and Xa known

Suppose that we know not only the frame sizes , and , but, also the population total of an auxiliary numerical variable correlated to the study variable and relative to frame , whose total is . In this case the vector of auxiliary variables is defined for by

 \boldmathxk=(δk(a),δk(ab),δk(ba),δk(b),[δk(a)+δk(ab)+δk(ba)]xAk)

and the calibration constraints are those in (14) plus

 ∑k∈saw∘kxAk+∑k∈sabw∘kxAk+∑k∈sbaw∘kxAk=XA. (19)

Again, it is not possible to obtain an analytic expression for the calibration weights unless we use the Euclidean distance for the Lagrange function. It can be shown that, in this case, the calibrated weights for are such that

 w∘k=dAk[Na^Na+λ(^Xa^Na−xAk)], (20)

where is the Lagrange multiplier for the last constraint in (19) given by

 λ=XA−^XA,\scriptsize H\'{a}j^S2a,x+η^S2ab,x+(1−η)^S2ba,x

where is a Hartley type estimator in which each component is estimated using the Hájek estimator, and similarly for and . Calibrated weights for and for are similar to those in (20) but with quantities referred to the appropriate domain, while weights for are the same as in (15). With such weights, the resulting calibration estimator resembles a combined regression estimator; in fact

 ^Y\scriptsize CAL=^Y\scriptsize H\'{a}j+(XA−^XA,\scriptsize H\'{a}j)^βA

where is the Hartley estimator of in which each component is estimated by its Hájek estimator, while

 ^βA=^Sa,xy+η^Sab,xy+(1−η)^Sba,xy^S2a,x+η^S2ab,x+(1−η)^S2ba,x,

with and similarly for and .

### 3.5 Other examples

The cases previously discussed are only a few examples of the very many possible ones that can be treated with calibration. The calibration approach is very flexible and can also handle both indicator and numerical variables simultaneously. Next we provide some details on how to construct the auxiliary vector and the vector of control totals for other interesting cases in practice; some of these cases will be used in the simulation study and in the application.

, , known and known.

Suppose that we know the frame sizes , and , and let the population total of an auxiliary numerical variable be available for the whole population and not only for frame as in the previous section. The auxiliary vector is thus and the calibration constraints are those in (14) plus

, , known and and known.

Suppose that we know the frame sizes , and the population total of an auxiliary numerical variable relative to frame , whose total is and the population total of another auxiliary numerical variable relative to frame , whose total is . The auxiliary vector is

 \boldmathxk=(δk(a)+δk(ab)+δk(ba),δk(b)+δk(ab)+δk(ba),[δk(a)+δk(ab)+δk(ba)]xAk,[δk(b)+δk(ab)+δk(ba)]zBk)

and the vector of known totals in this case is , which allows us to write the following calibration constraints

 ∑k∈saw∘k+∑k∈sabw∘k+∑k∈sbaw∘k=NA ∑k∈sbw∘k+∑k∈sabw∘k+∑k∈sbaw∘k=NB, ∑k∈saw∘kxAk+∑k∈sabw∘kxAk+∑k∈sbaw∘kxAk=XA ∑k∈sbw∘kzBk+∑k∈sabw∘kzBk+∑k∈sbaw∘kzBk=ZB. (21)
, , known and , known.

When we know the frame sizes , and and the population totals of the same auxiliary variable in the two frames and , the auxiliary vector is

 \boldmathxk=(δk(a),δk(ab),δk(ba),δk(b),[δk(a)+δk(ab)+δk(ba)]xk,[δk(b)+δk(ab)+δk(ba)]xk)

and the vector of known totals in this case is .

### 3.6 Asymptotic properties of ^Y\scriptsize CAL

To show the asymptotic properties of the general calibration estimator we adapt and place ourselves in the asymptotic framework of isaki1982survey, in which the dual-frame finite population and the sampling designs and are embedded into a sequence of such populations and designs indexed by , , with . We will assume therefore, that and tend to infinity and that also and tend to infinity as . We will further assume that and . In addition , where , , as . Subscript may be dropped for ease of notation, although all limiting processes are understood as . Stochastic orders and are with respect to the aforementioned sequences of designs. The constant is kept fixed over repeated sampling. In order to prove our results, we make the following technical assumptions.

###### A   1.

Let . Assume that exists; the distribution of and of , and the sampling designs are such that is consistently estimated by and is consistently estimated by .

###### A   2.

The limiting design covariance matrix of the normalized Hartley estimators,

 \boldmathΣ=[Σyy\boldmathΣxy\boldmathΣTxy\boldmathΣxx]=limN→∞nNN2[V(^YH)\boldmathC(^\boldmathtxH,^YH)\boldmathC(^\boldmathtxH,^YH)T\boldmathV(^\boldmathtxH)]

is positive defined.

###### A   3.

The normalized Hartley estimators of and are such that a central limit theorem holds:

 √nNN[∑k∈sd∘kyk−Y∑