Generating a synthetic population of individuals in households: Sample-free vs sample-based methods
We compare a sample-free method proposed by  and a sample-based method proposed by  for generating a synthetic population, organized in households, from various statistics. We generate a reference population for a French region including 1310 municipalities and measure how both methods approximate it from a set of statistics derived from this reference population. We also perform a sensitivity analysis. The sample-free method better fits the reference distributions of both individuals and households. It is also less data demanding but it requires more pre-processing. The quality of the results for the sample-based method is highly dependent on the quality of the initial sample.
[ fileext=loa, listname=List of Algorithms, name= Algorithm, placement=tbhp, ]algorithm
For two decades, the number of micro-simulation models, simulating the evolution of large populations with an explicit representation of each individual, has been constantly increasing with the computing capabilities and the availability of longitudinal data. When implementing such an approach, the first problem is initialising properly a large number of individuals with the adequate attributes. Indeed, in most of the cases, for privacy reasons, exhaustive individual data are excluded from the public domain. Aggregated data at various levels (municipality, county,…), guaranteeing this privacy, are hence only available in general. Sometimes, individual data are available on a sample of the population, these data being chosen also for guaranteeing the privacy (for instance omitting the individual’s location of residence). This paper focuses on the problem of generating a virtual population with the best use of these data, especially when the goal is generating both individuals and their organisation in households.
Two main methods, both requiring a sample of the population, aim at tackling this problem:
The synthetic reconstruction methods (SR) . These methods generally use the Iterative Proportional Fitting  and a sample of the target population to obtain the joint-distributions of interest ; ; ; ; . Many of the SR methods match the observed and simulated households joint-distribution or individual joint-distribution but not simultaneously. To circumvent these limitations ; ;  proposed different techniques to match both household and individual attributes. Here, we focus on the Iterative Proportional Updating developed by .
Recently, sample-free SR methods appeared ; . The sample-free SR methods build households by picking up individuals in a set comprising initially the whole population and progressively shrinking. In , if there is no appropriate individual in the current set, the individual is picked up in the already generated households, whereas in , the individuals are picked up in the set only. Both approaches are illustrated on real life examples,  generated a synthetic population of Belgium at the municipality level and  generated the population of two municipalities in Auvergne region (France). These methods can be used in the usual situations where no sample is available and one must only use distributions of attributes (of individuals and households). Hence, they overcome a strong limit of the previous methods. It is therefore important to assess if this larger scope of the sample-free method implies a loss of accuracy compared with the sample-based method.
In order to compare the methods, the ideal case would be to have a population with complete data available about individuals and households. It would allow us to measure precisely the accuracy of each method, in different conditions. Unfortunately, we do not have such data. In order to put ourselves in a similar situation, we generate a virtual population and then use it as a reference to compare the selected methods as in . All the algorithms presented in this paper are implemented in JAVA on a desktop machine (PC Intel 2.83 GHz).
In the first section we formally present the two methods. In the second section we present the comparison results. Finally, we discuss our results.
Details of the chosen methods
We consider a set of individuals to dispatch in a set of households in order to obtain a set of filled households . Each individual is characterised by a type from a set of differents individual types (attributes of the individual). Each household is characterized by a type from a set of different household types (attributes of the household). We define as the number of individuals of each type and as the number of households of each type. Each household of a given type has a probability to be filled by a subset of individuals , then the content of the household equals , which is denoted . We use this probability to iteratively fill the households with the individuals of .
The iterative algorithm used to dispach the individuals into the households according to the Equation 1 is described in Algorithm Sample-free method. The algorithm starts with the list of individuals and of the households , defined by their types. Then it iteratively picks at random a household, and from its type and Equation 1, derives a list of individual types. If this list of individual types is available in the current list of individuals , then this filled household is added to the result, and the current lists of individuals and households are updated. This operation is repeated until one of the lists or is void, or a limit number of iterations is reached.
In the case of the generation of a synthetic population, we can replace the selection of the list by the selection of the individuals one at a time by order of importance in the household. In this case Equation 2 replaces Equation 1.
The iterative approach algorithm associated with this probability is described in Algorithm Sample-free method. The principle is the same as previously, it is simply quicker. Instead of generating the whole list of individuals in the household before checking it, one generates this list one by one, and as soon as one of its members cannot be found in , the iteration stops, and one tries another household.
In practice this stochastic approach is data driven. Indeed, the types and are defined in accordance with the data available and the complexity to extract the distribution of the Equation 2 increases with and . The distributions defined in Equation 2 are called distributions for affecting individual into household. In concrete applications, it occurs that one needs to estimate , and the distributions of probabilities presented in Equation 2. This estimation implies that the Algorithm Sample-free method can not converge in a reasonable time because of the stopping criterion (). This stopping criterion is equivalent to an infinite number of ”filling” trials by households. In this case, we can replace the stopping criterion by a maximal number of iterations by households and then put the remaining individuals in the remaining households using relieved distributions for affecting individual into household.
In a perfect case where all the data are available and the time infinite, the algorithm would find a perfect solution. When the data are partial and the time constrained, it is interesting to assess how this method manages to make the best use of the available data.
The sample-based approach (General Iterative Proportional Updating)
This approach, proposed by , starts with a sample of and the purpose is to define a weight associated with each individual and each househld of the sample in order to match the total number of each type of individuals in and households in to reconstruct . The method used to reach this objective is the Iterative Proportional Updating (IPU). The algorithm proposed in  is described in Algorithm The sample-based approach (General Iterative Proportional Updating). In this algorithm, for each type of households or individuals the purpose is to match the weighted sum with the estimated constraints with an adjustement of the weights. is an estimation of the total number of households or individuals in . This estimation is done separately for each individual and household type using a standard IPF procedure with marginal variables. When the match between the weighted
sample and the constraint becomes stable, the algorithm stops. The procedure then generates a synthetic population by drawing at random the filled households of with probabilities corresponding to the weights. This generation is repeated several times and one chooses the result with the best fit with the observed data.
Generating a synthetic population of reference for the comparison
Because we cannot access any population with complete data available about individuals and households, we generate a virtual population and then use it as a reference to compare the selected methods as in .
We start with statistics about the population of Auvergne (French region) in 1990 using the sample-free approach presented above. The Auvergne region is composed of 1310 municipalities, 1,321,719 inhabitants gathered in 515,736 households. Table 2 presents summary statistics on the Auvergne municipalities.
|1||Number of individuals grouped by ages||Municipality (LAU2)|
|2||Distribution of individual by activity status according to the age||Municipality (LAU2)|
|3||Joint-distribution of household by type and size||Municipality (LAU2)|
|4||Probability to be the head of household according to the age and the type of household||Municipality (LAU2)|
|5||Probability of having a couple according to the difference of age between the partners (from”-16years” to ”21years”)||National level|
|6||Probability to be a child (child=live with parent) of household according to the age and the type of household||Municipality (LAU2)|
Generation of the individuals
For each municipality of the Auvergne region we generate a set of individuals with a stochastic procedure. For each individual of the age pyramid (distribution 1 in Table 3), we randomly choose an age in the bin and then we draw randomly an activity status according to the distribution 2 in Table 3.
Generation of the households
For each municipality of the Auvergne region we generate a set of households according to the total number of individual with a stochastic procedure. We draw at random households according to the distribution 3 in Table 3 while the sum of the capacities is below and then we determine the last household to have equal to the sum of the size of the households.
Distributions for affecting individual into household
The ages of the children are determined according to the age of individual 1 (An individual can do a child after 15 and before 55) and the distribution 6 in Table 3.
Couple without child
The age of the individual 2 is determined using the distribution 5 in Table 3.
Couple with child
The ages of the others individuals are determined according to the age of individual 1.
To obtain a synthetic population with households filled by individuals we use the Algorithm Sample-free method where we approximate the Equation 2 with the distributions 4, 5 and 6 in Table 3. We put no constraint on the number of individuals in the age pyramid, hence the reference population does not give any advantage to the sample-free method. Figure S1 and Figure S2 show the values obtained for individual’s and household’s attributes for the Auvergne region and for Marsac-en-Livradois, a municipality drawn at random among the 1310 Auvergne municipalities. These figures show the results obtained with the reference, the sample-free and the sample-based populations.
Comparing sample-free and sample-based approaches
The attributes of both individuals and households are respectivily described in Table 4 and Table 5. The joint-distributions of both the attributes for individuals and households give respectively the number of individuals of each individual type and the number of households of each household type . In this case, and . It’s important to note that is not equal to because we remove from the list of household types the inconsistent values like for example single households of size . We do the same for the individual types (removing for example retired individuals of age comprised betweeen 0 and 5).
|85 and more|
|Family Status||Head of a single household|
|Head of a monoparental household|
|Head of a couple without children household|
|Head of a couple with children household|
|Head of a other household|
|Child of a monoparental household|
|Child of a couple with children household|
Fitting accuracy measures
We need fitting accuracy measures to evaluate the adequacy between both observed and estimated household and individual distributions. The first measure is the Proportion of Good Prediction (PGP) (Equation 3), we choose this first indicator for the facility of interpretation. In the Equation 3 we multiplied by 0.5 because as we have , each misclassified individual or household is counted twice .
We use the distance to perform a statistic test. Obviously the modalities with a zero value for the observed distribution are not included in the computation. If we consider a distibution with modalities different from zero in the observed distribution, the distance follows a distribution with degrees
|6 and more individuals|
|Couple without children|
|Couple with children|
For more details on the fitting accuracy measures see .
To test the sample-free approach, we extract from the reference population, for each municipality, the distributions presented in Table 3. Then we use the procedure used for generating the population of reference but now with the constraints on the number of individuals from the age pyramid derived from the reference (remember that we did not have such constraints when generating the reference population). Then we fill the households with the individuals one at a time using the distributions for affecting individual into household. We limit the number of iterations to 1000 trials by household: If after 1000 trials a household is not filled, we put at random individuals in this household and we change its type to ”other”. We repeat the process 100 times and we choose, for each municipality, the synthetic population minimizing the distance between simulated and reference distributions for affecting individual into household.
In order to assess the robustness of the stochastic sample-free approach, we generate 10 synthetic populations by municipalities, yielding 13,100 synthetic municipality populations in total. For each of them and for each distributions for affecting individual into household we compute the p-value associated to distance between the reference and estimated distributions. As we can see in the Figure 1 a the algorithm is quite robust.
To validate the algorithm we compute the proportion of good predictions for each 13,100 synthetic populations and for each joint-distribution. We obtain an average of 99.7% of good predictions for the household distribution and 91.5% of good predictions for the individual distribution (Figure 1b). We also compute the p-value of the distance between the estimated and reference distributions for each of the synthetic populations and for each joint-distribution. Among the 13,100 synthetic populations 100% are statistically similar to the observed one at a 0.95% level of confidence for the household joint-distribution and 94% for the individual joint-distribution.
In order to understand the effect of the maximal number of iterations by household, we repeat the previous tests for different values of this parameter (1,10,100,500,1000,1500 and 2000)and we compute the mean proportion of good predictions obtained for both individual and household. We note that after 100 the quality of the results no longer changes.
To use the IPU algorithm we need a sample of filled households and marginal variables. In order to obtain these data we pick at random a significant sample of of households from the reference population and we also extract from the two one-dimensional marginals (Size and Type distributions) that we need to build the household joint-distributions with IPF and the three two-dimensional marginals (Age x Activity Status, Age x Family Status and Family Status x Activity Status) joint-distributions that we need to build the individual joint-distributions with IPF. Then we apply the Algorithm The sample-based approach (General Iterative Proportional Updating) using the recommendation of  for the well-know zero-cell and zero-marginal problems to obtain a weighted sample . With this sample we generate 100 times the synthetic population and choose the one with lowest distance between reference and simulated individual joint-distributions.
To check the results obtained with the IPU approach, we generate 10 synthetic populations by municipality using different samples of of households randomly selected. For each of these synthetic populations and for each joint-distribution we compute the proportion of good predictions (Figure 2a). We obtain an average of 98.6% of good predictions for the household distribution and 86.9% of good predictions for the individual distribution. To determine the error of estimation due to the IPF procedure we compute the proportion of good predictions for the estimated and the IPF-reference distributions. As we can see in Figure 2b the results are improved for the household distribution but not for the individual distribution. We also compute the p-value of the distance between the estimated and observed distributions for each of the synthetic populations and for each joint-distribution. Among the 13,100 synthetic populations 100% are statistically similar to the observed one at a 0.95% level of confidence for the household joint-distribution and 61% for the individual joint-distribution. We obtained a similarity between the estimated and the IPF-objective distributions of 100% at a 0.95% level of confidence for the household distribution and 64% for the individual distribution.
In order to check the sensitivity of the results to the size of the sample, we plot, on Figure 2c, the average proportion of good predictions of the 13,100 household and individuals joint-distributons for different values of the percentage of the reference households drawn at random in the sample (5, 10, 15, 20 ,25, 30, 35, 40, 45 and 50). We note that the results are always good for the household distribution but for the individuals the results are good only from random sample of at least 25% of the reference household population. Not surprisingly, globally the quality of the results increases with the parameter.
The sample-free method is less data demanding but it requires more data pre-processing. Indeed, this approach requires to extract the distributions for affecting individual into household from data. The sample-free method gives better fit between observed and simulated distribution for both household and individual distribution than the IPU approach. We can observe in Figure 3 that, for both methods, the goodness-of-fit is negatively correlated with the number of inhabitants. This observation is especially true for the IPU method because it depends on the number of individuals in the sample. Indeed, the lower is the number of individuals, the higher is the number of sparse cells in the individual distribution. The results obtained with the IPU approach depend of the quality of the initial sample. The execution time on a desktop machine (PC Intel 2.83 GHz) is almost the same for 100 maximal iterations by household for the sample-free method and 25% reference households drawn at random in the sample reference households for the sample-based approach.
To conclude, the sample-free method gives globally better results in this application on small French municipalities. These results confirm those of  who compared their sample-free method for working with data from different sources with a sample-based method , and obtained similar conclusions. Of course, these conclusions cannot be generalized to all sample-free and sample-based methods without further investigation. However, these results confirm the possibility to initialise accurately micro-simulation (or agent-based) models, using widely available data (and without any sample of households).
This publication has been funded by the Prototypical policy impacts on multifunctional activities in rural municipalities collaborative project, European Union 7th Framework Programme (ENV 2007-1), contract no. 212345. The work of the first author has been funded by the Auvergne region.
- F. Gargiulo, S. Ternes, S. Huet, and G. Deffuant. An iterative approach for generating statistically realistic populations of households. PLoS ONE, 5, 2010.
- X. Ye, K. Konduri, R. M. Pendyala, B. Sana, and P. Waddell. A methodology to match distributions of both household and person attributes in the generation of synthetic populations. In 88th Annual Meeting of the Transportation Research Board, 2009.
- A. G. Wilson and C. E. Pownall. A new representation of the urban system for modelling and for the study of micro-level interdependence. Area, 8(4):246–254, 1976.
- W. E. Deming and F. F. Stephan. On a least squares adjustment of a sample frequency table when the expected marginal totals are known. Annals of Mathematical Statistics, 11:427–444, 1940.
- R. J. Beckman, K. A. Baggerly, and M. D. McKay. Creating synthetic baseline populations. Transportation Research Part A: Policy and Practice, 30(6 PART A):415–429, 1996.
- Z. Huang and P. Williamson. A comparison of synthetic reconstruction and combinatorial optimization approaches to the creation of small-area microdata. Working paper, Departement of Geography, University of Liverpool, 2002.
- J. Y. Guo and C. R. Bhat. Population synthesis for microsimulating travel behavior. Transportation Research Record: Journal of the Transportation Research Board, 2014:92–101, 2007.
- T Arentze, H Timmermans, and F Hofman. Creating synthetic household populations: Problems and approach. Transportation Research Record: Journal of the Transportation Research Board, 2014:85–91, 2007.
- D. Voas and P. Williamson. An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata. International Journal of Population Geography, 6(5):349–366, 2000.
- P. Barthelemy, J.and Toint. Synthetic population generation without a sample. Transportation Science, 47:266–279, 2013.
- K. Harland, A. Heppenstall, D. Smith, and M. Birkin. Creating realistic synthetic populations at varying spatial scales: A comparative critique of population synthesis techniques. Journal of Artificial Societies and Social Simulation, 15(1):1, 2012.
- D. Voas and P. Williamson. Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2):177–200, 2001.