Quantitative Analysis of Genealogy Using Digitised Family Trees
Driven by the popularity of television shows such as Who Do You Think You Are? many millions of users have uploaded their family tree to web projects such as WikiTree . Analysis of this corpus enables us to investigate genealogy computationally. The study of heritage in the social sciences has led to an increased understanding of ancestry and descent  but such efforts are hampered by difficult to access data . Genealogical research is typically a tedious process involving trawling through sources such as birth and death certificates, wills, letters and land deeds . Decades of research have developed and examined hypotheses on population sex ratios, marriage trends, fertility, lifespan, and the frequency of twins and triplets. These can now be tested on vast datasets containing many billions of entries using machine learning tools. Here we survey the use of genealogy data mining using family trees dating back centuries and featuring profiles on nearly 7 million individuals based in over 160 countries. These data are not typically created by trained genealogists and so we verify them with reference to third party censuses. We present results on a range of aspects of population dynamics. Our approach extends the boundaries of genealogy inquiry to precise measurement of underlying human phenomena.
Keywords. Computational Genealogy; Genealogy; Data Mining; Name Trends; WikiTree
Telekom Innovation Laboratories at Ben-Gurion University
Nottingham University Business School
Genealogy is the study of family origins, bloodlines, and history. Recent advances in web technologies have encouraged millions of amateur genealogists to discover, assemble and share their family history by constructing their own online family trees. These joint effort online genealogy websites create large-scale data containing billions of entries regarding people who lived in past centuries [5, 6] and there is a lot of interest in contemporary relationships too, as evidenced by the Global Family Reunion campaign . The resulting corpora consist of personal information on family members going back many generations, and they provide details for each individual such as their date and place of birth and death, and their place in their family tree. These data create unique opportunities to study many and various aspects of human life and of the human life cycle over the past centuries.
The data we use were provided by WikiTree, a free, collaborative worldwide family tree project created by a community of amateur genealogists. Data are available on 6.67 million people in over 160 countries (but mainly the US, UK, Germany, Canada, New Zealand and Holland) going as far back as the first century. We coded relationships between individuals as either spouse, child, parent or sibling, and where available additional personal data were attached recording sex, the year and country of birth and death, and marriage date and location. The data therefore have three main dimensions: time, location and personal characteristics. Limits were set on personal characteristics and values falling outside of them (such as an age at death of 122+) were replaced with a missing value placeholder. Data were validated by WikiTree using their in-house procedures which include checking source materials and by making individuals’ profiles editable only by a limited list of users, and we provided additional validation by comparing lifespans in the data with those reported by third party sources [8, 9].
These data allow for many analyses that would be cumbersome or impossible using traditional genealogy research. For example one traditional study  examined the lifespans of 53,000 people, reconstructing pedigrees on 1,000 living Italians back to the early 17th century using family books and parish registers back to 1924 and municipality lists thereafter. The parish registers, which record baptisms, marriages, and burials, were viewed on microfilm and researchers had to physically attend the archive at Bolzano. Assembling these data took two years. We were able to do a similar study in under a week.
Name trends. To start we can illustrate the use of the data by highlighting trends in given names alongside cultural events. Figure 1 plots the ratio of the number of times selected given names were used in a year divided by the total number of babies born in a year against time. The graph for ‘Wendy’ for instance grows with the popularity of the Peter Pan story – originally staged in 1904, with the novel published in 1911 and related films and books appearing from the 1920s onwards up to Leonard Bernstein’s 1950 musical and the Disney film in 1953. Contrary to popular belief the name Wendy existed prior to Peter Pan.
Another interesting trend is variation in the most common child’s name. From the year 1000 up to 2000 the ratio of unique given names used per decade (where each name appears at least 10 times in each decade–other names were ignored) is shown in Figure 2. The chart shows low variation during the High Middle Ages and the Victorian era, suggesting that the desire to pick an uncommon name for children  is not new. The trend of naming a son after its father rises then falls through the 16th century, and throughout history there have been fewer girls named for their mother than boys named for their father (Figure 3). About 24% of twins’s names start with the same letter. The most frequent twin names between 1800 and 1900 are Mary and Martha, and John and James.
Births and fertility. We plotted the average age at which mothers gave birth to their first and last born against time in decade sized bins (Figure 4). The graphs show an upward trend in both as women tended to wait longer before having children. We then examined the frequency of twins and triplets. Hellin’s Law states that one in every 89 human maternities is twins and one in every is triplets although a proof exists  that this cannot hold as a general rule and many exceptions have been found . Nevertheless we assessed Hellin’s Law with the WikiTree data using all births which occurred between 1800 and 1900 (where the bulk of recorded births occur) and found support that it at least approximates reality. Of 963,416 births, 10,246 were twins (0.0106%), and 128 (0.00013%). Twin gender ratios were almost even. Where gender data are available the sets of twins were: male-male – 3,257 (32.7%); female-female – 3,376 (33.9%); and male-female – 3,307 (33.3%).
The relationship between natural factors and human sex ratio remains an active area of scientific research and these data may be able to contribute to such efforts. Examining the gender ratios in our data we observe a small but steady rise in females from the middle ages on (Figure 5). We acknowledge the possibly—as others have—that this could be because births of men tended to be recorded more than those of females however in developed countries studies have found that the human sex ratio at birth has historically varied for natural reasons .
Marriages. The age at which individuals first got married is shown in Figure 6. The general trend is in any given time period, for males to marry later than females, and the age increases over time. The raw data collaborates that during the medieval period it was not unknown for girls aged 12 and boys aged 14 to marry  but the trend shown in the graph is that these young ages did not represent the average.
Lifespan. Previous studies have found that spouses have an impact on an individual’s lifespan . We find support for this – if an individual’s spouse lives longer, then that individual lives longer too: the age at death in years of one partner correlates with the age of death of the other with a Pearson coefficient of . Twins also tend to have the same lifespan: the age at death of Twin 1 correlates with the age at death of Twin 2 with .
Computational genealogy. Computational genealogy is the application of machine learning tools, graph analysis and related techniques to the analysis of high volume ancestry data and is an emerging branch of computational social science . The results are a new type of evidence in social science. Here we have given a brief survey of early findings but the field opens up many more possibilities including:
Highlighting immigration trends, for instance by looking at surname changes between Italian and Irish immigration to the US.
Charting lifespan alongside economic trends – genealogical data can highlight wars, disease outbreaks and vaccinations on the charts and may be able to quantify the impact and spread of vaccinations and other health innovations. We have found for example a peak in births around 9 months after the end of the American Civil war, and a peak in deaths at the time of the Battle of Flodden.
Assessing the impact of key religious events such as The Reformation (1517) on the frequency of marriages and births.
Analysing how families moved and split geographically – in the past did entire families (grown siblings) re-locate together? Was it common for children or grandchildren to move back to an ancestral family home?
Identifying genetic diseases by searching for patterns on age of death within the same family.
Combining two or more sources on a famous bloodline, for instance by using the WikiTree data alongside Wikipedia.
The results of such investigations will be of interest to a range of branches of social science.
- URL http://www.wikitree.com/.
- Zerubavel, E. Ancestors and Relatives (Oxford University Press, 2012).
- Duff, W. & Johnson, C. Where is the list with all the names? information-seeking behavior of genealogists. The American Archivist 66, 79–95 (62003).
- Cortada, W. Everyday Information, chap. Genealogy as a hobby (MIT Press).
- URL http://www.geni.com/.
- Fire, M. & Elovici, Y. Data mining of online genealogy datasets for revealing lifespan patterns in human population. ACM Transactions on Embedded Computing Systems x, x (2014).
- URL http://globalfamilyreunion.com/.
- Office of National Statistics Mortality in england and wales: Average life span 2010 (2010). URL http://www.ons.gov.uk/ons/rel/␣mortality-ageing/mortality-in-e%Ωngland-and-wales/␣average-life-span/rpt-average-life-span.html.
- Mitchell, B. et al. Heritability of life span in the old order amish. American journal of medical genetics 102, 346–352 (2001).
- Gögele, M. et al. Heritability analysis of life span in a semi-isolated population followed across four centuries reveals the presence of pleiotropy between life span and reproduction. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences 66A, 26–37 (2011).
- Wattenberg, M. Baby names, visualization, and social data analysis. In IEEE Symposium on Information Visualization (Minneapolis, MN, USA, 2005).
- Fellman, J. & Eriksson, A. Biometric analysis of the multiple maternities in finland 1881-1990 and in sweden since 1751. Human Biology 65, 463–479 (1993).
- Fellman, J. & Erikson, A. On the history of hellin’s law. Twin Research and Human Genetics 12, 183–190 (2009).
- URL http://www.brown.edu/Departments/Italian_Studies/␣dweb/society/%Ωsex/sex-spouses.php.
- Lillard, L. A. & Waite, L. J. Til death do us part: Marital disruption and mortality. American Journal of Sociology 100, 1131–1156 (1995).
- Lazer, D. et al. Computational social science. Science 323, 721–723 (2009).
- James, W. H. Evidence that mammalian sex ratios at birth are partially controlled by parental hormone levels around the time of conception. Journal of Endocrinology 198, 3–15 (2008).