Generating a synthetic population of individuals in households: Samplefree vs samplebased methods
Abstract
We compare a samplefree method proposed by [1] and a samplebased method proposed by [2] for generating a synthetic population, organized in households, from various statistics. We generate a reference population for a French region including 1310 municipalities and measure how both methods approximate it from a set of statistics derived from this reference population. We also perform a sensitivity analysis. The samplefree method better fits the reference distributions of both individuals and households. It is also less data demanding but it requires more preprocessing. The quality of the results for the samplebased method is highly dependent on the quality of the initial sample.
[ fileext=loa, listname=List of Algorithms, name= Algorithm, placement=tbhp, ]algorithm
Introduction
For two decades, the number of microsimulation models, simulating the evolution of large populations with an explicit representation of each individual, has been constantly increasing with the computing capabilities and the availability of longitudinal data. When implementing such an approach, the first problem is initialising properly a large number of individuals with the adequate attributes. Indeed, in most of the cases, for privacy reasons, exhaustive individual data are excluded from the public domain. Aggregated data at various levels (municipality, county,…), guaranteeing this privacy, are hence only available in general. Sometimes, individual data are available on a sample of the population, these data being chosen also for guaranteeing the privacy (for instance omitting the individual’s location of residence). This paper focuses on the problem of generating a virtual population with the best use of these data, especially when the goal is generating both individuals and their organisation in households.
Two main methods, both requiring a sample of the population, aim at tackling this problem:

The synthetic reconstruction methods (SR) [3]. These methods generally use the Iterative Proportional Fitting [4] and a sample of the target population to obtain the jointdistributions of interest [5]; [6]; [7]; [8]; [2]. Many of the SR methods match the observed and simulated households jointdistribution or individual jointdistribution but not simultaneously. To circumvent these limitations [7]; [8]; [2] proposed different techniques to match both household and individual attributes. Here, we focus on the Iterative Proportional Updating developed by [2].
Recently, samplefree SR methods appeared [1]; [10]. The samplefree SR methods build households by picking up individuals in a set comprising initially the whole population and progressively shrinking. In [10], if there is no appropriate individual in the current set, the individual is picked up in the already generated households, whereas in [1], the individuals are picked up in the set only. Both approaches are illustrated on real life examples, [10] generated a synthetic population of Belgium at the municipality level and [1] generated the population of two municipalities in Auvergne region (France). These methods can be used in the usual situations where no sample is available and one must only use distributions of attributes (of individuals and households). Hence, they overcome a strong limit of the previous methods. It is therefore important to assess if this larger scope of the samplefree method implies a loss of accuracy compared with the samplebased method.
The aim of this paper is contributing to this assessment. With this aim, we compare the samplebased IPU method proposed by [2] with the samplefree approach proposed by [1] on an example.
In order to compare the methods, the ideal case would be to have a population with complete data available about individuals and households. It would allow us to measure precisely the accuracy of each method, in different conditions. Unfortunately, we do not have such data. In order to put ourselves in a similar situation, we generate a virtual population and then use it as a reference to compare the selected methods as in [10]. All the algorithms presented in this paper are implemented in JAVA on a desktop machine (PC Intel 2.83 GHz).
In the first section we formally present the two methods. In the second section we present the comparison results. Finally, we discuss our results.
Details of the chosen methods
Samplefree method
We consider a set of individuals to dispatch in a set of households in order to obtain a set of filled households . Each individual is characterised by a type from a set of differents individual types (attributes of the individual). Each household is characterized by a type from a set of different household types (attributes of the household). We define as the number of individuals of each type and as the number of households of each type. Each household of a given type has a probability to be filled by a subset of individuals , then the content of the household equals , which is denoted . We use this probability to iteratively fill the households with the individuals of .
(1) 
The iterative algorithm used to dispach the individuals into the households according to the Equation 1 is described in Algorithm Samplefree method. The algorithm starts with the list of individuals and of the households , defined by their types. Then it iteratively picks at random a household, and from its type and Equation 1, derives a list of individual types. If this list of individual types is available in the current list of individuals , then this filled household is added to the result, and the current lists of individuals and households are updated. This operation is repeated until one of the lists or is void, or a limit number of iterations is reached.
In the case of the generation of a synthetic population, we can replace the selection of the list by the selection of the individuals one at a time by order of importance in the household. In this case Equation 2 replaces Equation 1.
(2) 
The iterative approach algorithm associated with this probability is described in Algorithm Samplefree method. The principle is the same as previously, it is simply quicker. Instead of generating the whole list of individuals in the household before checking it, one generates this list one by one, and as soon as one of its members cannot be found in , the iteration stops, and one tries another household.
In practice this stochastic approach is data driven. Indeed, the types and are defined in accordance with the data available and the complexity to extract the distribution of the Equation 2 increases with and . The distributions defined in Equation 2 are called distributions for affecting individual into household. In concrete applications, it occurs that one needs to estimate , and the distributions of probabilities presented in Equation 2. This estimation implies that the Algorithm Samplefree method can not converge in a reasonable time because of the stopping criterion (). This stopping criterion is equivalent to an infinite number of ”filling” trials by households. In this case, we can replace the stopping criterion by a maximal number of iterations by households and then put the remaining individuals in the remaining households using relieved distributions for affecting individual into household.
In a perfect case where all the data are available and the time infinite, the algorithm would find a perfect solution. When the data are partial and the time constrained, it is interesting to assess how this method manages to make the best use of the available data.
The samplebased approach (General Iterative Proportional Updating)
This approach, proposed by [2], starts with a sample of and the purpose is to define a weight associated with each individual and each househld of the sample in order to match the total number of each type of individuals in and households in to reconstruct . The method used to reach this objective is the Iterative Proportional Updating (IPU). The algorithm proposed in [2] is described in Algorithm The samplebased approach (General Iterative Proportional Updating). In this algorithm, for each type of households or individuals the purpose is to match the weighted sum with the estimated constraints with an adjustement of the weights. is an estimation of the total number of households or individuals in . This estimation is done separately for each individual and household type using a standard IPF procedure with marginal variables. When the match between the weighted
sample and the constraint becomes stable, the algorithm stops. The procedure then generates a synthetic population by drawing at random the filled households of with probabilities corresponding to the weights. This generation is repeated several times and one chooses the result with the best fit with the observed data.
Generating a synthetic population of reference for the comparison
Because we cannot access any population with complete data available about individuals and households, we generate a virtual population and then use it as a reference to compare the selected methods as in [10].
We start with statistics about the population of Auvergne (French region) in 1990 using the samplefree approach presented above. The Auvergne region is composed of 1310 municipalities, 1,321,719 inhabitants gathered in 515,736 households. Table 2 presents summary statistics on the Auvergne municipalities.
Statistics  Min  Max  Average 

Households  8  63,226  408.2 
Individuals  26  136,180  1,011.7 
ID  Description  Level 

1  Number of individuals grouped by ages  Municipality (LAU2) 
2  Distribution of individual by activity status according to the age  Municipality (LAU2) 
3  Jointdistribution of household by type and size  Municipality (LAU2) 
4  Probability to be the head of household according to the age and the type of household  Municipality (LAU2) 
5  Probability of having a couple according to the difference of age between the partners (from”16years” to ”21years”)  National level 
6  Probability to be a child (child=live with parent) of household according to the age and the type of household  Municipality (LAU2) 
Generation of the individuals
For each municipality of the Auvergne region we generate a set of individuals with a stochastic procedure. For each individual of the age pyramid (distribution 1 in Table 3), we randomly choose an age in the bin and then we draw randomly an activity status according to the distribution 2 in Table 3.
Generation of the households
For each municipality of the Auvergne region we generate a set of households according to the total number of individual with a stochastic procedure. We draw at random households according to the distribution 3 in Table 3 while the sum of the capacities is below and then we determine the last household to have equal to the sum of the size of the households.
Distributions for affecting individual into household
Single

The age of the individual 1 is determined using the distribution 4 in Table 3.
Monoparental
Couple without child
Couple with child
Other

The age of the individual 1 is determined using the distribution 4 in Table 3.

The ages of the others individuals are determined according to the age of individual 1.
To obtain a synthetic population with households filled by individuals we use the Algorithm Samplefree method where we approximate the Equation 2 with the distributions 4, 5 and 6 in Table 3. We put no constraint on the number of individuals in the age pyramid, hence the reference population does not give any advantage to the samplefree method. Figure S1 and Figure S2 show the values obtained for individual’s and household’s attributes for the Auvergne region and for MarsacenLivradois, a municipality drawn at random among the 1310 Auvergne municipalities. These figures show the results obtained with the reference, the samplefree and the samplebased populations.
Comparing samplefree and samplebased approaches
The attributes of both individuals and households are respectivily described in Table 4 and Table 5. The jointdistributions of both the attributes for individuals and households give respectively the number of individuals of each individual type and the number of households of each household type . In this case, and . It’s important to note that is not equal to because we remove from the list of household types the inconsistent values like for example single households of size . We do the same for the individual types (removing for example retired individuals of age comprised betweeen 0 and 5).
Attribute  Value 

Age  [0,5[ 
[5,15[  
[15,25[  
[25,35[  
[35,45[  
[45,55[  
[55,65[  
[65,75[  
[75,85[  
85 and more  
Activity Status  Student 
Active  
Family Status  Head of a single household 
Head of a monoparental household  
Head of a couple without children household  
Head of a couple with children household  
Head of a other household  
Child of a monoparental household  
Child of a couple with children household  
Partner  
Other 
Fitting accuracy measures
We need fitting accuracy measures to evaluate the adequacy between both observed and estimated household and individual distributions. The first measure is the Proportion of Good Prediction (PGP) (Equation 3), we choose this first indicator for the facility of interpretation. In the Equation 3 we multiplied by 0.5 because as we have , each misclassified individual or household is counted twice [11].
(3) 
We use the distance to perform a statistic test. Obviously the modalities with a zero value for the observed distribution are not included in the computation. If we consider a distibution with modalities different from zero in the observed distribution, the distance follows a distribution with degrees
Attribute  Value 

Size  1 individual 
2 individuals  
3 individuals  
4 individuals  
5 individuals  
6 and more individuals  
Type  Single 
Monoparental  
Couple without children  
Couple with children  
Other 
of freedom.
(4) 
For more details on the fitting accuracy measures see [12].
Samplefree approach
To test the samplefree approach, we extract from the reference population, for each municipality, the distributions presented in Table 3. Then we use the procedure used for generating the population of reference but now with the constraints on the number of individuals from the age pyramid derived from the reference (remember that we did not have such constraints when generating the reference population). Then we fill the households with the individuals one at a time using the distributions for affecting individual into household. We limit the number of iterations to 1000 trials by household: If after 1000 trials a household is not filled, we put at random individuals in this household and we change its type to ”other”. We repeat the process 100 times and we choose, for each municipality, the synthetic population minimizing the distance between simulated and reference distributions for affecting individual into household.
In order to assess the robustness of the stochastic samplefree approach, we generate 10 synthetic populations by municipalities, yielding 13,100 synthetic municipality populations in total. For each of them and for each distributions for affecting individual into household we compute the pvalue associated to distance between the reference and estimated distributions. As we can see in the Figure 1 a the algorithm is quite robust.
To validate the algorithm we compute the proportion of good predictions for each 13,100 synthetic populations and for each jointdistribution. We obtain an average of 99.7% of good predictions for the household distribution and 91.5% of good predictions for the individual distribution (Figure 1b). We also compute the pvalue of the distance between the estimated and reference distributions for each of the synthetic populations and for each jointdistribution. Among the 13,100 synthetic populations 100% are statistically similar to the observed one at a 0.95% level of confidence for the household jointdistribution and 94% for the individual jointdistribution.
In order to understand the effect of the maximal number of iterations by household, we repeat the previous tests for different values of this parameter (1,10,100,500,1000,1500 and 2000)and we compute the mean proportion of good predictions obtained for both individual and household. We note that after 100 the quality of the results no longer changes.
Ipu
To use the IPU algorithm we need a sample of filled households and marginal variables. In order to obtain these data we pick at random a significant sample of of households from the reference population and we also extract from the two onedimensional marginals (Size and Type distributions) that we need to build the household jointdistributions with IPF and the three twodimensional marginals (Age x Activity Status, Age x Family Status and Family Status x Activity Status) jointdistributions that we need to build the individual jointdistributions with IPF. Then we apply the Algorithm The samplebased approach (General Iterative Proportional Updating) using the recommendation of [2] for the wellknow zerocell and zeromarginal problems to obtain a weighted sample . With this sample we generate 100 times the synthetic population and choose the one with lowest distance between reference and simulated individual jointdistributions.
To check the results obtained with the IPU approach, we generate 10 synthetic populations by municipality using different samples of of households randomly selected. For each of these synthetic populations and for each jointdistribution we compute the proportion of good predictions (Figure 2a). We obtain an average of 98.6% of good predictions for the household distribution and 86.9% of good predictions for the individual distribution. To determine the error of estimation due to the IPF procedure we compute the proportion of good predictions for the estimated and the IPFreference distributions. As we can see in Figure 2b the results are improved for the household distribution but not for the individual distribution. We also compute the pvalue of the distance between the estimated and observed distributions for each of the synthetic populations and for each jointdistribution. Among the 13,100 synthetic populations 100% are statistically similar to the observed one at a 0.95% level of confidence for the household jointdistribution and 61% for the individual jointdistribution. We obtained a similarity between the estimated and the IPFobjective distributions of 100% at a 0.95% level of confidence for the household distribution and 64% for the individual distribution.
In order to check the sensitivity of the results to the size of the sample, we plot, on Figure 2c, the average proportion of good predictions of the 13,100 household and individuals jointdistributons for different values of the percentage of the reference households drawn at random in the sample (5, 10, 15, 20 ,25, 30, 35, 40, 45 and 50). We note that the results are always good for the household distribution but for the individuals the results are good only from random sample of at least 25% of the reference household population. Not surprisingly, globally the quality of the results increases with the parameter.
Discussion
The samplefree method is less data demanding but it requires more data preprocessing. Indeed, this approach requires to extract the distributions for affecting individual into household from data. The samplefree method gives better fit between observed and simulated distribution for both household and individual distribution than the IPU approach. We can observe in Figure 3 that, for both methods, the goodnessoffit is negatively correlated with the number of inhabitants. This observation is especially true for the IPU method because it depends on the number of individuals in the sample. Indeed, the lower is the number of individuals, the higher is the number of sparse cells in the individual distribution. The results obtained with the IPU approach depend of the quality of the initial sample. The execution time on a desktop machine (PC Intel 2.83 GHz) is almost the same for 100 maximal iterations by household for the samplefree method and 25% reference households drawn at random in the sample reference households for the samplebased approach.
To conclude, the samplefree method gives globally better results in this application on small French municipalities. These results confirm those of [10] who compared their samplefree method for working with data from different sources with a samplebased method [7], and obtained similar conclusions. Of course, these conclusions cannot be generalized to all samplefree and samplebased methods without further investigation. However, these results confirm the possibility to initialise accurately microsimulation (or agentbased) models, using widely available data (and without any sample of households).
IPU  Iterative  

Sample size  Time  Iterations  Time 
5  13min  1  40min 
10  24min  10  41min 
15  29min  100  45min 
20  38min  500  58min 
25  45min  1000  66min 
30  53min  1500  78min 
40  74min  2000  88min 
Acknowledgements
This publication has been funded by the Prototypical policy impacts on multifunctional activities in rural municipalities collaborative project, European Union 7th Framework Programme (ENV 20071), contract no. 212345. The work of the first author has been funded by the Auvergne region.
References
 F. Gargiulo, S. Ternes, S. Huet, and G. Deffuant. An iterative approach for generating statistically realistic populations of households. PLoS ONE, 5, 2010.
 X. Ye, K. Konduri, R. M. Pendyala, B. Sana, and P. Waddell. A methodology to match distributions of both household and person attributes in the generation of synthetic populations. In 88th Annual Meeting of the Transportation Research Board, 2009.
 A. G. Wilson and C. E. Pownall. A new representation of the urban system for modelling and for the study of microlevel interdependence. Area, 8(4):246–254, 1976.
 W. E. Deming and F. F. Stephan. On a least squares adjustment of a sample frequency table when the expected marginal totals are known. Annals of Mathematical Statistics, 11:427–444, 1940.
 R. J. Beckman, K. A. Baggerly, and M. D. McKay. Creating synthetic baseline populations. Transportation Research Part A: Policy and Practice, 30(6 PART A):415–429, 1996.
 Z. Huang and P. Williamson. A comparison of synthetic reconstruction and combinatorial optimization approaches to the creation of smallarea microdata. Working paper, Departement of Geography, University of Liverpool, 2002.
 J. Y. Guo and C. R. Bhat. Population synthesis for microsimulating travel behavior. Transportation Research Record: Journal of the Transportation Research Board, 2014:92–101, 2007.
 T Arentze, H Timmermans, and F Hofman. Creating synthetic household populations: Problems and approach. Transportation Research Record: Journal of the Transportation Research Board, 2014:85–91, 2007.
 D. Voas and P. Williamson. An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata. International Journal of Population Geography, 6(5):349–366, 2000.
 P. Barthelemy, J.and Toint. Synthetic population generation without a sample. Transportation Science, 47:266–279, 2013.
 K. Harland, A. Heppenstall, D. Smith, and M. Birkin. Creating realistic synthetic populations at varying spatial scales: A comparative critique of population synthesis techniques. Journal of Artificial Societies and Social Simulation, 15(1):1, 2012.
 D. Voas and P. Williamson. Evaluating goodnessoffit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2):177–200, 2001.