Abstract
The Millennium Villages Project (MVP) is a tenyear integrated rural development project implemented in ten subSaharan African sites. At its conclusion we will conduct an evaluation of its causal effect on a variety of development outcomes, measured via household surveys in treatment and comparison areas. Outcomes are measured by six survey modules, with sample sizes for each demographic group determined by budget, logistics, and the group’s vulnerability. We design a sampling plan that aims to reduce effort for survey enumerators and maximize precision for all outcomes. We propose twostage sampling designs, sampling households at the first stage, followed by a second stage sample that differs across demographic groups. Twostage designs are usually constructed by simple random sampling (SRS) of households and proportional withinhousehold sampling, or probability proportional to size sampling (PPS) of households with fixed sampling within each. No measure of household size is proportional for all demographic groups, putting PPS schemes at a disadvantage. The SRS schemes have the disadvantage that multiple individuals sampled per household decreases efficiency due to intrahousehold correlation. We conduct a simulation study (using both design and modelbased survey inference) to understand these tradeoffs and recommend a sampling plan for the Millennium Villages Project. Similar design issues arise in other studies with surveys that target different demographic groups.
MVP_paper_Lancet_FINAL_March14_2015_Revision1
Design of the Millennium Villages Project Sampling Plan:
a simulation study for a multimodule survey
Shira Mitchell, Rebecca Ross, Susanna Makela, Elizabeth A. Stuart, Avi Feller, Alan M. Zaslavsky, Andrew Gelman
1 Background
The Millennium Villages Project (MVP) is an economic development project that targets rural populations across ten countries in subSaharan Africa, implementing a multisector package of interventions at a village level (Sachs and McArthur, 2005; Sanchez et al., 2007). See Mitchell et al. (2015a) for background on the project, study site selection, outcomes of interest, and a comprehensive description of the plan to evaluate its effectiveness. Mitchell et al. (2015b) describe our plan for causal inference about the MVP’s effect on a variety of development outcomes measured in different demographic groups. These outcomes will be measured via survey modules administered in both treatment and comparison villages.
A design analysis described in Mitchell et al. (2015b) was used to recommend the number of control villages and magnitude of sampling in each. Next, we must determine how to select households and individuals within households. We propose a twostage sample: households will be sampled in stage I, followed by individuals within households in stage II (Lohr, 2010; Särndal et al., 1992). In the first stage, we must decide between simple random sampling and probability proportional to size sampling of households. Because the project operates at the village level, a sampling plan that efficiently estimates outcome means per village is an efficient sampling plan for the overall causal evaluation. In this paper we conduct a simulation study to decide on a sampling plan for estimating finite population village means.
We aim to minimize the design effect, the ratio between the actual and effective sample sizes. One factor in determining the efficiency of a sampling design is the intraclass correlation, i.e. the correlation among individuals within a household. If more than one individual is sampled per household, the intraclass correlation increases the design effect, reducing the effective sample size relative to the actual sample size.
Another factor in the efficiency of a sampling design is the distribution of individuals’ sampling probabilities. Sampling probabilities can be optimized for a specific outcome, e.g. by sampling with probability approximately proportional to the outcome (Särndal et al., 1992, p.88). However, with many outcomes of interest, such tailored optimization is difficult or impossible. Therefore, a selfweighted sample design is preferred, such that all individuals are sampled with equal probability (Kish, 1992; Lohr, 2010, p.287). Such samples are representative without weighting adjustments, and unbiased point estimates can be obtained from standard statistical procedures.
Given a fixed precision, we aim to minimize time and resources for the survey enumerator teams. This includes minimizing the numbers of people surveyed (i.e. the actual sample size), but also considering the number of households visited, and the effort required to prepare a sampling frame. To conduct the first stage of sampling, a scheme that samples households with equal probability only requires a list of all households with GPS coordinates identifying their locations. However, a scheme which samples households with probability proportional to size requires some measure of household size (e.g. the total number of household members). This additional piece of information requires more effort for enumerators, especially for larger villages with many households. After either method of first stage sampling, we will conduct a demographic census in the sampled households to create the sampling frame for the second stage.
In this paper we conduct a simulation study to understand the tradeoffs between simple random sampling and probability proportional to size sampling of households in the context of the MVP evaluation. Additionally, our simulations explore designbased versus modelbased inference, a dichotomy which has implications for our the analysis of our outcome data.
2 Outcomes and survey modules
The Millennium Villages Project (MVP) defines 51 outcomes of interest, including measures of poverty alleviation, agriculture, education, gender equality, health, environmental sustainability, and infrastructure (Mitchell et al., 2015a). These outcomes are measured in six different survey modules, whose content is discussed in Section LABEL:surveys of Mitchell et al. (2015a). These modules include:

a household survey, administered to all household heads (or other knowledgeable household members) within the sampled households;

a sexspecific adult survey, administered to men and women of reproductive age (15 to 49 years) within the sampled households;

within the adultfemale survey, a birth history section, administered to women of reproductive age (15 to 49 years) both in the sampled households and in additional sampled households to reach sample size sufficient for estimating child mortality;

a nutrition survey, administered to men and women age 15 to 49 years in sampled households;

blood (malaria and anemia) testing, administered to four agesex groups in sampled households: children age 6 to 59 months, schoolaged children (5 to 14 years old), men age 15 to 49 years, women age 15 to 49 years; and

anthropometry measurements, administered among children age 6 to 59 months in sampled households.
For each module and agesex group combination, the project has budgeted a target sample size based on a combination of budget, logistics, and relative importance of different vulnerable populations and intervention beneficiaries.
3 Sampling plans considered
For the purpose of our simulation study, we consider all survey modules except for birth history and the nutrition survey. Our sampling will be performed in two phases. First, we will sample households using either simple random sampling (SRS, without replacement) or probability proportional to size sampling (PPS, with replacement), with household size defined as , the number of household members under 50 years old in household . Let be the set of (unique) sampled households. In the PPS scheme, we use to denote the set of sampled households with repeats. Let be the number of households sampled without replacement in the SRS scheme and let be the number of households sampled with replacement in the PPS scheme. We let based on the project’s previous survey rounds and budget for the final survey round.
To describe the withinhousehold sampling plans for each survey module, we use the following notation. Let be the total number of people in household that are in the target agesex group for a particular module. Let be the number of people in household that we sample and survey. For example, if considering the anthropometry module, then is the number of children under five years of age in household and is the number of those sampled for the anthropometry module. Let be the total number of people in the sampling frame (an MV1 or a control village) that are in the module’s target agesex group.
We now outline the withinhousehold sampling plans considered in our simulation study.
Adult, anthropometry, and blood modules
For each module and agesex group combination, the project has budgeted a target sample size, :

adult survey  men and women of reproductive age (15 to 49 years);

blood (malaria and anemia)  children age 6 to 59 months, schoolaged children (5 to 14 years old), men age 15 to 49 years, women age 15 to 49 years; and

anthropometry  children age 6 to 59 months.
The second stage sampling schemes we consider in this simulation study are, for a given module and agesex group:

For SRS sampling of households  combine all people in the sampled households in the target agesex group. If , then survey all. Otherwise, we consider two options:

stratify by sampling individuals from each household, where is proportional (up to rounding) to , and the constant of proportionality is determined by the total in the sampled households,

take an equalprobability systematic sample of people. We order the households randomly, and people (in the module’s target agesex group) within households randomly, so that the people within a household are listed consecutively. We then take a sample using the fractional interval method described in Särndal et al. (1992, p.77) and Appendix A. This procedure enables us to control sample sizes and spread the sample across households such that the sample size in a household is always either the ceiling or the floor of the expected sample size in that household under simple random sampling (see Appendix A). Conceptually, this is similar to stratifying on household, except that there is dependence of the samples between strata (i.e. households).


For PPS sampling of households  if , sample a fixed number of people, , per household (regardless of household size) if available.
^{1} If , take a simple random sample of households from to obtain a smaller PPS sample of households. Then sample per household if available.
Household survey
For both the SRS and PPS schemes, the household survey module is administered to the head of household in each sampled household.
3.1 Simulated data
For each survey module, we simulate one outcome measured by that module. For the household survey we use the total household consumption; for the adult male survey we use the number of days after illness began when the man first sought advice or treatment; for the adult female survey we use the number of times a woman received antenatal care during her most recent pregnancy; for malaria and anemia testing we use hemoglobin blood concentration; for the anthropometry module we use the weight for age zscore. In generating simulated data, we make the simplifying assumption that all individuals in a target agesex group have nonmissing outcomes. For example, we generate antenatal care outcomes for all women of reproductive age.
To generate data, we use the multilevel model
(1)  
For total household consumption we use a model analogous to model 1:
(2) 
We also use a model for the log total consumption (which in our data is more Normally distributed than the total consumption),
(3) 
We use the demographic information from the census and the multilevel model with estimated parameter values (from the survey data) to generate simulated populations. If when models 1, 2 or 3 are fit to past survey data, the 50% posterior interval of or contains 0, we set the parameter to 0 when simulating populations. This prevents us from using very noisy estimates of coefficients. Within each simulated population, we randomly sample according to the sampling plans described above, and estimate the finite population mean using either modelbased or designbased inference.
4 Bayesian modelbased inference
To generalize from the data to the population, both designbased and modelbased inference must take into account how the data are collected. Let denote data for the population of interest, and indicators of the observation of , where if is sampled,
We include design variables such that the data collection mechanism is ignorable with respect to this model. For example, in our SRSstratified sampling plan, the data collection mechanism is:
if s.t. and for all . Otherwise, the probability of missingness pattern is zero.
Thus, we include as design variables the household identifiers and the (e.g. the number of women per household, if the survey module targets women). Similar computations show that under the SRSsystematic sampling scheme these variables are also sufficient to satisfy missing at random. For the PPS scheme, we also will need the measure of household size used to select the households (e.g. the total number of household members under 50) (Gelman et al., 2014, p.211). For simplicity, we fit the same ignorable model for both the SRS and PPS schemes. For the anthropometry, blood, and adult survey modules we fit
(4)  
Our parameter of interest is the finite population mean , where (Gelman et al., 2014, p.205). We obtain posterior simulations of as follows: if household is sampled, we use a simulation of to generate simulated ’s. If household is not sampled, we use simulations of and to simulate a new , then generate simulated ’s.
5 Frequentist designbased inference
We use the survey package to compute designbased estimates and variances (Lumley, 2004). Though we perform our SRS schemes without replacement, we compute all variances without finite population corrections, using the Horvitz Thompson (Hajek) ratio estimator and its withreplacement variance (Lohr, 2010, p.247).
Our SRS schemes are twophase rather than twostage designs, since the sampling within a household depends on which households were sampled in the first stage (Särndal et al., 1992, p.134135). This dependence is reflected in the design weights we compute, see below. For the SRSsystematic sampling scheme, the independence assumption of twostage sampling is also violated, with the sampling in each household dependent on the sampling in other households. Our designbased analysis approximates these twophase designs with a twostage analysis. In contrast, in modelbased inference the details of the design do not matter in the analysis once we include design variables in our model (Gelman et al., 2014, p.202, 206211).
5.1 Design weights
For the SRSsystematic design, the inclusion probabilities are:
For SRSstratified, we replace with . In the simulations, instead of computing this precisely, we estimate it by randomly sampling such that . This avoids the computationally intensive loop over all such sets. Although these weights are not equal for all individuals, because the distributions of household sizes (from the MVP demographic data) have no extreme outliers, in our simulations the weights are nearly equal.
For the PPS scheme, the inclusion probabilities are:
In PPS sampling, , where is a measure of household size (Särndal et al., 1992, p.97). So the PPS weights are:
If , and , then the design is selfweighted. We take , a constant, but we cannot choose such that for all modules, since the target agesex groups differ from module to module. We chose , the number of household members under 50 years of age, because it represented a compromise between the different target agesex groups. Thus, our weights are .
6 Comparisons between sampling schemes: variances and design effects
We want to compare the PPS and SRS designs (in either the Bayesian modelbased or the designbased paradigms). In general, the two schemes will have slightly different sample sizes, making direct comparisons of variances less relevant. For the household survey module, we fix the sample sizes to be equal, and for the adult, anthropometry, and blood modules, we adjust for the differing sample sizes by computing a design effect, defined below.
The household survey module is administered to the heads of households only, not individual members. Therefore, the time cost of the household module is mostly determined by the number of households surveyed. We set up our simulations such that the number of household heads to be interviewed (i.e. sample size) is the same for the SRS and PPS sampling schemes. We first perform a PPS sampling of households. Then, we use the number of unique sampled households to obtain the number of households to sample for the SRS scheme. We then directly compare the variances in estimating , the finite population mean consumption per person.
For the remaining modules, we compute design effects. To define the design effect (often abbreviated as “deff”), we first introduce the following notation. Let be the estimator of (in our case, ) where is the sampling distribution assumed to have been used in drawing sample . Let be the sampling variance of an estimator of that assumes sampling distribution , and is the distribution with respect to which we want the variance. Let be an estimator where is the sampling distribution assumed to have been used in drawing sample . The population design effect is defined as . The estimated design effect is defined as .
In the designbased setting, we compute design effects assuming sampling withreplacement in both numerator and denominator variances. This is done in the survey package by specifying deff = `replace'.
For the modelbased simulations, we estimate the numerator of the deff with the posterior variance for from fitting a model that includes enough design variables such that the data collection mechanism is ignorable with respect to this model. This posterior variance includes an implicit finite population correction, so we compute a denominator variance that also includes such a correction:
(5)  
where .
To assess our estimated deff in the modelbased setting, we compare the posterior variance from fitting an ignorable model with respect to a sampling distribution to the designbased sampling variance of the posterior means, . The latter can be computed by simulation: we sample repeatedly from the full population using distribution , fit the ignorable model, obtain a posterior mean of , and compute the variance of these across the samples from . Fixing one finite population, in Figure 1 we create a histogram of posterior variances from fitting the ignorable model to each sample, and indicate with a vertical line the designbased variance of the posterior means, which is computed by simulation. We make the same comparison for a simple random sample (and its ignorable model with flat priors and no design variables), and include the closedform designbased estimate (5) as a vertical line, in addition to the simulationcomputed designbased estimate. See Figure 1. We see that the posterior variances appear unbiased for the designbased variances.
7 Simulation results
Our results are displayed in Appendix B, where we see that neither the SRS nor PPS sampling of households is more efficient (i.e. has a lower design effect) in general.
We see that for modules with higher target sample sizes, SRS tends to be less efficient. For example, in the under5 blood () and adult () modules the SRS scheme is less efficient. One explanation for this observation is the different numbers of people sampled per household in the SRS versus PPS schemes, which has efficiency implications due to the intrahouse correlation. In the PPS scheme, the households sampled in the first stage are larger and therefore more likely to include people in the target demographics. In contrast, in the SRS scheme, the sample is often drawn from fewer households, with more people sampled per household. Moreover, the PPS scheme only samples one person per household draw (though this can result in more than one person being sampled per household due to the withreplacement sampling at the first stage).
For modules where the target sample size is low, there are fewer people sampled per household in the SRS sampling scheme, and the intrahouse correlation does not substantially impact the design efficiency. Therefore, because SRS has nearequal individuallevel probability of sampling (see the designweights computed above), its design effect in the absence of household clustering should be close to one. In contrast, the PPS scheme does not have nearequal individuallevel sampling probabilities because the measure of household size is not proportional to the target demographic (see the designweights computed above).
The relative efficiency of SRS versus PPS is similar between designbased and modelbased simulations. In the few cases where they differ, designbased results show that SRS has higher design effects than PPS, relative to modelbased results. In general, our modelbased simulations show more variability across simulations than the designbased simulations. Comparing systematic to stratified sampling at the second stage of the SRS schemes, we see few differences except that stratified sampling tends to have higher variance across simulations.
8 Final sampling plan
As described above, the PPS scheme requires a sampling frame that includes household sizes, whereas the SRS scheme only requires a list of households. Given our results, we cannot justify the additional resources required to collect the more detailed household list for the PPS scheme. Therefore, our sampling scheme will begin with an SRS sample of households. For the second stage of sampling for the adult, anthropometry, and blood modules, we prefer the control over sample size achieved by systematic sampling (as opposed to stratified sampling).
The household and nutrition modules follow a different sampling scheme. As mentioned above, the household module is administered to all household heads (or other knowledgeable household members) within the sampled households. The nutrition module consists of a food frequency questionnaire, which takes longer to administer than other modules. We suspect that the withinhousehold correlation is very high for data on food frequency, because household members are likely to eat similar foods. (This intrahouse correlation cannot be measured from project data, because the project has always limited this module to one member per household.) For these reasons, we limit the nutrition module to one adult (age 15 to 49 years) per household.
9 Software
Acknowledgements
The authors would like to thank the following people for very valuable feedback and ideas: Jeffrey D. Sachs, Joseph K. Blitzstein, Qixuan Chen, Jennifer Hill, Macartan Humphreys, Michael Clemens, Alberto Abadie, Marc Levy, Linda Pistolesi, Keli Liu, Peng Ding, Natalie Exner, Abhishek Chakrabortty, Rachael Meager, and Natalie Bau.
All mistakes are our own.
Appendix A Properties of systematic sampling
Definition 1 (The fractional interval method of systematic sampling).
Consider a population of size consisting of people grouped into households indexed by , with people withinhousehold . Let be the desired sample size. Set . Order the households randomly, and randomly order the people in the target group within the households. Let label the people in this order:
Draw a random real number uniformly between 0 and , , and sample all people with such that
(Särndal et al., 1992, p.77)
Claim 1.
When performing the sampling scheme in Definition 1, the sample size will be .
Proof of Claim 1.
Since and , . Since , we can write . The ceiling function is monotone increasing and , so each time increases by 1, we get a different value of . Now we must show that the ’s stay in the set , i.e. those from which we are sampling. The first is such that , where . Since , we must have . The last is such that , and we know because . Then . Thus, since each maps to a unique , we’ve proven we get a sample size of exactly . ∎
Claim 2.
When performing the sampling scheme in Definition 1, the sample size within each household is always the ceiling or the floor of the expected sample size in household under simple random sampling: .
We first prove the following lemma which is used to prove the above claim:
Lemma 1.
Consider the set . The maximum of is if , if , or if , where , the “decimal part.”
Proof of Lemma 1.
If , let , and we see that because , so . Increasing by 1 increases the lefthand side of the inequality by , and since , we see this . Therefore, is the maximum.
If , let , the “decimal part.” We see that , so . Increasing by 1 (to ) increases the leftmost side of the inequality by . If , i.e. if , then . ∎
Proof of Claim 2.
Consider household , of size . Let be the last person before household in the ordering used by the systematic sampling. Then are the indices for all members of household . In order to get the number sampled in household , we consider the maximum of set (the number of people sampled up through household ) and subtract from it the maximum of set (the number of people sampled before household ). This gives, by Lemma 1, . ∎
Claim 3.
The sampling scheme in definition 1 is selfweighted.
Proof.
See below for a visual, in orange is the interval , which can overlap at most 2 intervals of length 1 (shown as overbraces, with overlaps totaling a length 1:
∎
Appendix B Survey Sampling Simulation Results
b.1 Designbased results
b.2 Modelbased Results
Footnotes
 It is possible that a household is sampled without any members of the target agesex group. Therefore, if , then the PPS scheme will result in a smaller sample size than the SRS scheme. Additionally, for the adult module (where ), if then the PPS scheme will at most sample only 300 adults, one per sampled household.
 Model 2 can be motivated by assuming that model 1 holds for individuallevel consumption (this would assume that within a household consumption is identically distributed, not taking into account agesex differences). This model implies that and that the marginal variance of is .
 We assume that all units that are sampled are observed.
References
 A Gelman, J B Carlin, H S Stern, D B Dunson, A Vehtari, and D B Rubin. Bayesian Data Analysis. Chapman & Hall/CRC texts in statistical science, third edition, 2014.
 L Kish. Weighting for unequal . Journal of Official Statistics, 8(2):183–200, 1992.
 S L Lohr. Sampling: Design and Analysis. Cengage Learning, 2 edition, 2010.
 T Lumley. Analysis of complex survey samples. Journal of Statistical Software, 9(8), April 2004.
 S Mitchell, A Gelman, R Ross, U Kim Huynh, L McClellan, M Harris, S Bari, J Chen, S OhemengDapaah, P Namakula, S E Sachs C Palm, and J D Sachs. The Millennium Villages Project: A protocol for the final evaluation. Protocol in The Lancet, 2015a. URL http://www.thelancet.com/doi/story/10.1016/html.2015.07.03.2167.
 S Mitchell, R Ross, S Makela, E A Stuart, A Feller, A M Zaslavsky, and A Gelman. Causal inference with small samples and incomplete baseline for the Millennium Villages Project. Working Paper, 2015b. URL http://www.stat.columbia.edu/~gelman/research/unpublished/MVP_paper_technical_JRSSA_short.pdf.
 R Development Core Team. The R project for statistical computing, February 2014. URL http://www.rproject.org/.
 J D Sachs and J W McArthur. The millennium project: a plan for meeting the millennium development goals. The Lancet, 365(9456):347–353, January 2005.
 P Sanchez, C Palm, J D Sachs, G Denning, R Flor, R Harawa, B Jama, T Kiflemariam, B Konecky, R Kozar, E Lelerai, A Malik, P Mutuo, A Niang, H Okoth, F Place, S E Sachs, A Said, D Siriri, A Teklehaimanot, K Wang, J Wangila, and C Zamba. The african millennium villages. Proceedings of the National Academy of Sciences, 104(43):6775–80, 2007.
 C E Särndal, B Swensson, and J Wretman. Model Assisted Survey Sampling. SpringerVerlag, New York, 1992.
 Stan Development Team. Stan: A c++ library for probability and sampling, version 1.3, February 2013. URL http://mcstan.org/.