Data Mining of Online Genealogy Datasets for Revealing Lifespan Patterns in Human Population
Online genealogy datasets contain extensive information about millions of people and their past and present family connections. This vast amount of data can assist in identifying various patterns in human population. In this study, we present methods and algorithms which can assist in identifying variations in lifespan distributions of human population in the past centuries, in detecting social and genetic features which correlate with human lifespan, and in constructing predictive models of human lifespan based on various features which can easily be extracted from genealogy datasets.
We have evaluated the presented methods and algorithms on a large online genealogy dataset with over a million profiles and over 9 million connections, all of which were collected from the WikiTree website. Our findings indicate that significant but small positive correlations exist between the parents’ lifespan and their children’s lifespan. Additionally, we found slightly higher and significant correlations between the lifespans of spouses. We also discovered a very small positive and significant correlation between longevity and reproductive success in males, and a small and significant negative correlation between longevity and reproductive success in females. Moreover, our machine learning algorithms presented better than random classification results in predicting which people who outlive the age of 50 will also outlive the age of 80.
We believe that this study will be the first of many studies which utilize the wealth of data on human populations, existing in online genealogy datasets, to better understand factors which influence human lifespan. Understanding these factors can assist scientists in providing solutions for successful aging.
Keywords. Genealogy Data Mining, Aging, Gerontology, Human Population Lifespan, Lifespan Prediction, Date Mining, Machine Learning, WikiTree
In the last decade, Web 2.0 websites, such as Wikipedia
The family tree structure and the family members’ personal details that are stored in these genealogy websites create large-scale datasets, which contain billions of entries [10, 2] on human life and death properties. These datasets can be utilized to reveal interesting patterns regarding lifespan changes over the centuries. Additionally, these datasets can also assist in better understanding and identifying characteristics which are correlated with human lifespan changes. For example, these datasets can be explored and utilized to answer the following questions: Does having more children extend one’s lifespan? Does having long-lived ancestors prolong life? Does getting married lengthen one’s lifespan? Answering these types of questions can assist scientists in providing insights and solutions for successful aging.
In this study, we present data mining algorithms for analyzing large genealogy datasets in order to examine human population lifespan variations over a substantial length of time (see Section 3.3.1). Moreover, we introduce methods to utilize these types of datasets to identify features which correlate with human lifespan (see Section 3.3.2). Additionally, we also present Machine Learning (ML) algorithms based on features extracted from genealogy datasets, which can assist in predicting if a particular 50-year-old individual will reach the age of 80 (see Section 3.3.4).
To test and evaluate our algorithms, we developed a web crawler which crawled and parsed public profiles from the WikiTree website. WikiTree is a free, collaborative family-history website, which contains more than 5 million user-contributed profiles  of individuals who have lived in the past centuries, and many of the profiles contain personal details about each individual. Using the collected data from WikiTree, we were able to construct a dataset (referred to as the WikiTree dataset) of over a million public profiles, out of which at least 416,030 profiles were of individuals who were born in the United States (see Section 3.4).
By analyzing the WikiTree dataset, we calculated various statistics on variations of population lifespan over the last centuries, including specific statistics on the lifespan variations of the United States population (see Section 3.3 and Figures 4 and 4). As a result of this analysis, we discovered several interesting historical lifespan change patterns (see Section 5); for example, we discovered that the average lifespan of females who were born in the United States and lived beyond the age of ten increased sharply in just a half-century: from 62.66 in 1850 to 72.5 in 1900 (see Figure 4).
Using the WikiTree dataset, we constructed a social network directed multigraph which contains over 1.38 million vertices and over 9.19 million links (see Section 3.1 and Table 3). We then analyzed the social network graph and extracted 21 features, such as parents’ and grandparents’ ages of death, for each vertex in the graph (see Section 3.2). By using the extracted features and simple linear regression models, we discovered significant correlations with low coefficients of determination between the individuals’ ages of death and the ages of death of their siblings, parents, spouses, and grandparents (see Table 5). We also discovered a slighter higher significant correlation between the individual’s age of death and the age of death of his or her spouse (see Table 5). Additionally, we constructed multiple linear regression models for predicting an individual’s age of death based on various features which were extracted from the individual’s personal details. Our multiple linear models were with high significance and Multiple Adjusted R-squared values up to 0.085 (see Table 6).
Our ML classifiers have presented better than random results in predicting which individuals who outlived the age of fifty and passed the age of menopause will also outlive the age of 80 (see Section 4.3).
The remainder of the paper is structured as follows: In Section 2 we give a brief overview of previous relevant studies on characteristics which were found to be correlated with human lifespan. In this section, we also introduce several studies which used similar data mining algorithms as this study. Next, in Section 3 we present the methods and algorithms we developed for studying genealogy datasets. In this section, we also describe our constructed WikiTree dataset. Then, in Section 4 we present our algorithm evaluations results on the WikiTree dataset. Lastly, in Section 5 we discuss our results, and we also offer future research directions.
2 Related Work
The factors that influence human lifespan have been thoroughly studied over the past decades [20, 14, 13, 11, 8]. In this section we give a brief overview of recent genealogical studies that are most relevant to this study, pinpointing similar factors. Additionally, we also give a short overview of recent studies in the field of social network analysis and data mining, which used a similar methodology to the one used throughout this study.
In recent years, many studies have tried to find correlations between parents’ and childrens’ lifespans, as well as correlations between lifespans of parents and their number of children: In 1998, Westendorp and Kirkwood used a historical dataset, from the British aristocracy, to study the connection between longevity and reproductive success. They discovered that longevity was positively correlated with age at first childbirth, and negatively correlated with number of children. In 2000, Thomas et al.  studied the connection between longevity and fertility using a statistical dataset of 153 countries. They concluded that “humans who invest heavily in reproduction while young will, on average, pay for this reproductive success with a shortened lifespan.” In 2001, Mitchell et al.  used genealogical data of Old Order Amish members to estimate the parent-child correlations in lifespan. They also estimated the child age of death as a function of parent age at death. They discovered significant but small correlations between parental and child ages at death.
In 2006, McArdle et al.  studied the correlation between the number of children and lifespan using genealogical data of 2,015 individuals who were members of an Old Order Amish community. In their study they discovered lifespans of fathers increased linearly with increasing number of children, while lifespans of mothers increased linearly up to 14 children but decreased with each additional child beyond 14. In 2007, Le Bourg  presented a thorough review of studies which researched the relationship between fertility and longevity under various conditions. According to Le Bourg, the review results indicated that “in natural fertility conditions longevity does not decrease when the number of children increases but, in modern populations, mortality could slightly increase when women have more than ca 5 children.” In 2011, Gögele et al.  conducted a comprehensive genealogical study with a thorough assessment of the heritability of lifespan and longevity in three villages in Italy. Their research, which included studying more than 50,000 individuals across four centuries, discovered “a general low inheritance of human lifespan, but which increases substantially when considering long-living individuals, and a common genetic background of lifespan and reproduction.”
Many studies found connections between an excess in mortality and bereavement, also known as the “widow effect.” In 1969, Parkes et al.  followed 4,486 widowers at the age of 55 for nine years. Out of these widowers, 213 died during their first six months of bereavement, 40% above the expected rate for married men of the same age. In 1996, Martikainen et al.  conducted a large scale study of 1,580,000 married Finnish individuals and also discovered excess mortality among the bereaved. In 2008, Elwert and Christakis  studied 373,189 elderly married couples in the United States. They discovered that the death of a spouse from almost all causes increased the mortality of the bereaved partner to varying degrees.
In our research we used several regression and ML techniques for lifespan prediction. In order to carry out our work, we mainly used attributes which could be extracted from genealogy datasets in order to construct the genealogy social network and extract features from the network (see Section 3). Similar techniques that involve social network analysis and regression were used by Christakis  in researching the spread of obesity, by Altshuler et al.  in predicting the individual parameters and social links of smart-phone users, and by Fire et al.  in predicting students’ final exam scores.
3 Methods and Experiments
To cope with the challenge of analyzing a huge online genealogy dataset with ten of millions of records on individuals’ personal data and their connections, we first chose to convert the dataset into a social network represented by a directed multigraph where vertices represent people and links represent connections among family members (see Section 3.1). Next, we used the constructed social network graph and extracted various features from each vertex, such as the vertex’s number of children, year of birth, and gender (see Section 3.2). We then used the extracted features to determine various statistics on the population lifespan variations over time (see Section 3.3.1). After that, we used linear regression to find the features that significantly influence human lifespan. We also constructed multi-linear regression models for lifespan prediction (see Section 3.3.2). Lastly, we used ML algorithms to construct classifiers which can predict if a person from the United States who outlives the age of fifty will also reach the age of eighty (see Section 3.3.4).
To perform our statistical calculations and to construct our predictive models, we used various datasets that were extracted from a large genealogy dataset. These datasets are defined in Table 1. Additionally, the methods and algorithms we have used throughout this study are summarized in Table 2.
3.1 Constructing the Genealogy Social Networks
We constructed the social network directed multigraph from the genealogy dataset in the following manner: First, we assembled the graph vertices set by adding a new vertex for each profile in the genealogy dataset. We then defined as the multiset of links in the graph, with each link defined to be a tuple , where ; is the link type, which can be one of the following values:
; and is the creation date of the link. For example, if the genealogy dataset contains the profiles of Queen Elizabeth II and Prince Charles, then the social network graph will contain the following vertices: , and the following edges: and . In case the genealogy dataset contains link of type between a public profile and a private profile, we added to the multigraph a new vertex to and added new link to .
3.2 Feature Extraction
After constructing the social network graph, we can extract, if possible, three types of features for each vertex: The first type is the vertex general profile features, which include basic information about the vertex, such as birth year, gender, and full name. The second type is the nuclear family features, which include information about the vertex’s children and spouses. The third and last type of features is the extended family features, which include information about the vertex’s parents, siblings, and grandparents. In this study, we extracted a total of 21 features for each vertex . In the remainder of this section, we introduce and give formal definitions for each one of these features.
Full-Name() - The full name of .
Birth-Year() - The birth year of .
Death-Year() - The death year of .
Gender() - The gender of converted to an integer, where male is set to 1, female is set to 2, and unknown gender is set to 0.
Birth-Country() - The country in which was born.
Death-Country() - The country in which died.
Age-of-Death() - The age of death of (also referred to as the lifespan of ) which is calculated, if accurate dates are available, by subtracting the birth date of from the death date of .
Nuclear Family Features
Children-Number() - the number of children which had. The formal Children-Number() definition is:
Spouse-Number() - the number of individuals to which was married to. The formal Spouse-Number() definition is:
Min-Spouse-Age-of-Death() - the minimum age of death of ’s spouses. The formal Min-Spouse-Age-of-Death() definition is:
where the function min() returns the minimum value among set members, or 0 if is empty.
Max-Spouse-Age-of-Death() - the maximum age of death of ’s spouses. The formal Max-Spouse-Age-of-Death() definition is:
where the function max() returns the maximum value among set members, or 0 if is empty.
Avg-Spouse-Age-of-Death() - the average age of death of ’s spouses. The formal Avg-Spouse-Age-of-Death() definition is:
where the function avg() returns the average value of set members, or 0 if is empty.
Extended Family Features
Father-Age-of-Death() - ’s father age of death. The formal Father-Age-of-Death() definition is:
where the function Father returns the father vertex of , if one exists. Namely, .
Mother-Age-of-Death() - ’s mother age of death. The formal Mother-Age-of-Death() definition is:
where the function Mother returns the mother vertex of , if one exists. Namely, .
Paternal-Grandfather-Age-of-Death() - ’s paternal grandfather’s age of death, if one exists. The formal Paternal-Grandfather-Age-of-Death() definition is:
Maternal-Grandfather-Age-of-Death() - ’s maternal grandfather’s age of death, if one exists. The formal Maternal-Grandfather-Age-of-Death() definition is:
Paternal-Grandmother-Age-of-Death() - ’s paternal grandmother’s age of death, if one exists. The formal Paternal-Grandmother-Age-of-Death() definition is:
Maternal-Grandmother-Age-of-Death() - ’s maternal grandmother’s age of death, if one exists. The formal Maternal-Grandmother-Age-of-Death() definition is:
Sibling-Number() - the number of brothers and sisters had. The formal Sibling-Number() definition is:
Max-Sibling-Age-of-Death() - the maximum age of death of ’s siblings. The formal Max-Sibling-Age-of-Death() definition is:
Avg-Sibling-Age-of-Death() - the average age of death of ’s siblings. The formal Avg-Sibling-Age-of-Death() definition is:
Using the features defined above, we specify the following feature sets, which will later be used to construct our multiple linear regression models and ML classifiers: (a) All-Numeric-Features - a set which contains all the defined-above features that return numeric values, except the Death-Year feature; (b) Heritage-Features - a set which includes all the extended family features, including the Birth-Year and Gender features; and (c) Nuclear-Family-Features - a set which includes all the nuclear family features, including Birth-Year and Gender.
3.3 Statistical and Predictive Analysis
In this study, we used various algorithms and methods to calculate the variations in human lifespan over the past centuries, to identify which features are correlated with human lifespan and longevity, and to create predictive models which can assist in predicting human lifespan.
In the remainder of this subsection, we describe in detail each one of our methods and algorithms.
Lifespan Variations over Time
After we had extracted the features for each vertex in the graph, we could utilize these features to calculate the variations in human lifespan over an extended period of time. To perform these calculations, we created two vertices datasets. The first dataset was the All-Dataset, which included all the vertices with valid values of Age-of-Death, while the second dataset was the <Country>-Dataset which included only vertices with valid values of Age-of-Death of people who were born in a specific country - in this study, we chose to take a closer look at people born in the United States.
We utilized the All-Dataset and the United-States-Dataset to specifically look at the lifespan of people who were born in each quarter of a century between 1650 and 1900. For each quarter of a century on each dataset, we calculated the Age-of-Death distribution of those people born in the chosen quarter. For example, in the second dataset, we had a group of 22,021 people who were born in the United States and lived between 1700 and 1724; we then calculated the percent of the population that died at each age between 0 and 122.
Additionally, for the All-Dataset-10 and for the United-States-Dataset-10, and for each year from 1650 to 1900, we calculated both the average and median lifespans of the people who were born in each year and outlived the age of 10. We also repeated these average and median calculations for each gender, using the Male-Dataset-10, Female-Dataset-10, Male-United-States-Dataset-10, and Female-United-States-Dataset-10 datasets.
One of the main goals of this study was to identify features which are correlated with lifespan and with longevity. To identify features correlated with an inheritance of human lifespan, we computed for each extended family feature, which was defined in Section 3.2.3, a simple linear regression , where was set to be the Age-of-Death vector, and was set to be selected feature values. For each feature we chose only vertices from the <Feature>-Dataset-10, in which both the Age-of-Death value was greater or equal ten
To identify if an individual’s lifespan was correlated with the lifespan of his or her spouse(s), we repeated the same process of constructing a simple linear regression between the Age-of-Death feature and the Avg-Spouse-Age-of-Death, Max-Spouse-Age-of-Death, and Min-Spouse-Age-of-Death features. However, this time we used the Married-Dataset to include only individuals who were married at least once.
To identify if longevity is correlated with reproductive success, we repeated the same process of constructing a simple linear regression between the Age-of-Death feature and the Children-Number feature. However, with respect to Westendorp and Kirkwood’s  results in mind, we used Children-Number-Dataset-50, Male-Children-Number-Dataset-50, and Female-Children-Number-Dataset-50 datasets, which only contained vertices with age of death of at least 50, namely after menopause.
Multiple Linear Regression
In this study, we used backward stepwise multiple linear regression to create models for predicting the Age-of-Death of individuals who had been born by 1900. We constructed these regression models by using the All-Numeric-Features, the Heritage-Features, and the Nuclear-Family-Features sets, which were defined at the end of Section 3.2. For constructing our first two models, we only used vertices from the No-Missing–Dataset-10 dataset with valid complete values, including defined gender values, for each selected features set of vertices who outlived the age of 10. Additionally, to prevent bias due to the tendency of people to get married and have children in later stages of life, for the Nuclear-Family-Features set we only used vertices from the No-Missing–Dataset-50 dataset, i.e., those who outlived the age of fifty.
We evaluated these multiple linear regression models by calculating the P-value, as well as the Multiple R-squared, Adjusted R-squared, and Residual Standard Error (RSE) values.
Machine Learning Algorithms
One of the major drawbacks of using online genealogy datasets is the issue of missing values. In many genealogy datasets not all the profile data is complete; many profiles contain missing values due to nonexistent data or privacy considerations . To overcome the issue of missing values and still gain predictive information from the profiles with nonexistent data, we chose to use Machine Learning algorithms, such as decision trees and Naive-Bayes algorithms, which can deal with missing values.
We evaluated various supervised learning algorithms in an attempt to predict which individuals who were born in the United States between 1650 and 1900, and outlived the age of fifty, will also outlive the age of 80. We constructed our classifiers using Weka , a popular suite of ML, and the features defined in the United-States–Dataset-50 dataset. We used all numeric features in each dataset, except the Age-of-Death and Death-Year features. Additionally, we also treated unknown gender values as missing values, instead of replacing them with 0 values. Using these datasets as training sets, we used Weka’s OneR, C4.5 (J48) decision tree, K-Nearest-Neighbors (IBk; with K=3,5), Naive-Bayes, RandomForest, and Bagging implementations of the corresponding algorithms. For each of these algorithms, most of the configurable parameters were set to their default values except for the J48 decision tree classifier, in which the pruning option was not enabled. We evaluated each classifier using the 10-folds cross validation method and calculated the True-Positive, False-Positive, F-Measure, and the Area-Under-Curve (AUC) measure. The AUC is a standard way to compare classifier performances , in which 0.5 a value represents a random classifier.
Additionally, to obtain an indication of the usefulness of the various features, we analyzed their importance using Weka’s information gain attribute selection algorithm.
3.4 WikiTree Dataset
To test and evaluate our methods and algorithms, we chose to use information collected from the WikiTree website. This is a free and accessible collaborative family history website started by Chris Whitten , and it contains more than 5 million profiles  of individuals who primarily lived in the past. WikiTree contains many profile pages of people who lived in the previous centuries, and many of the profiles contain the following details about each individual: full name, gender, date of birth, date of death, location of birth, location of death, parents’ profiles, children’s profiles, spouses’ profiles, and siblings’ profiles. Often, in order to maintain the privacy of still-living people, the website limits access to their profile personal details . In order to maintain the integrity of WikiTree profile data, many profiles give reference to the source of the data presented in the profile, and most profiles have a profile manager who has primary responsibility for WikiTree profiles . In addition, to prevent editing of profiles by untrusted users, each WikiTree profile has an independent “Trusted List” of people who can edit and view the profile , making the data in many profiles only editable to a limited number of people.
To collect profile information from WikiTree, we developed a web crawler which crawled and parsed only public profiles from the website. Using our crawler, we have downloaded and parsed 1,070,189 public profile pages. Using these profiles, we were able to construct a directed multigraph with 9,192,212 links and 1,382,752 vertices, out of which 118,590 vertices represented at least distinct 28,011 private profiles. Moreover, the constructed multigraph contained at least 416,030 vertices represent individuals born in the United States, according to their profile pages. These vertices were connected by 5,168,275 links to other vertices in the multigraph (see Figure 1, and Tables 3 and 4).
In the following subsections, we present the results obtained using the algorithms and methods described in Section 3. The results consist of three parts: First, we present the results of calculating lifespan variations over time. Second, we present the results of the simple linear regression and multi-linear regression analysis techniques which were described in Section 3.3.2. Finally, we present the results of the ML algorithms mentioned in Section 3.3.4.
4.1 Lifespan Variations over Time Results
As described in Section 3.3.1, we utilized the All-Dataset and the United-States-Dataset to compute the changes of lifespan over each quarter of a century between 1650 and 1900. Then, we used these same datasets to take a closer look at the people who had been born during this 250-year span. For each quarter of a century on each dataset, we calculated the Age-of-Death distribution of the people who were born in the chosen quarter. The results showing the lifespan variations over time for the All-Dataset are presented in Figure 1(a), and for the United-States-Dataset in Figure 1(b).
We also used the All-Dataset-10 and the United-States-All-Dataset-10 to calculate the average and the median lifespans for each gender, and for both genders, in each year between 1650 and 1900. The results of these calculations are presented in Figures 4 and 4.
4.2 Regression Analysis Results
Using R-project software , we ran several simple linear and multi-linear regression algorithms based on the features we defined in Section 3.2. From the regression algorithms, we generated and evaluated several prediction models in order to determine the vertices’ Age-of-Death.
Simple Linear Regression Analysis Results
As described in Section 3.3.2, we computed several simple linear regression models in order to predict linear correlations between the Age-of-Death and other features. We first identified features correlated with the inheritance of human lifespan by computing a simple regression model for each feature in the Extended Family Features set, in order to predict the Age-of-Death feature. In these calculations, we used only the vertices who outlived the age of 10 and who also had valid existing information for each vertex’s selected feature; i.e., we only used vertices which exist in the <Feature>-Dataset-10 for each selected feature.
The simple regression results revealed that positive small but significant correlations exist between most of the Extended Family Features and the vertices’ lifespans (see Table 5). These correlations have R-squared values ranging from 0.0015 to 0.05, with a very low P-value of indicating that the correlation is highly significant, where the highest R-squared values were obtained for the Avg-Sibling-Age-of-Death (R-squared=0.05) and Max-Sibling-Age-of-Death (R-squared=0.0272) features, and the lowest R-squared values were obtained for the grandparents’ lifespan features (R-squared ranging from 0.0015 to 0.0028). Additionally, we also discovered a small negative correlation between the Sibling-Number feature and the vertices’ lifespan, with a slope of -0.155, R-squared of 0.0021, and P-value of .
We then repeated the simple linear regression calculation to identify correlations between the vertices’ lifespans and their spouses’ lifespans by using the Avg-Spouse-Age-of-Death, Max-Spouse-Age-of-Death, and Min-Spouse-Age-of-Death features with the Married-Dataset. We discovered that each one of these features demonstrated a significant correlation, with a low P-value of and a maximum R-squared value of 0.0564 (see Table 5).
Lastly, to identify if longevity is correlated with reproductive success, we computed simple linear regressions between the Age-of-Death feature and the Children-Number feature on the following datasets: Children-Number-Dataset-50, Male-Children-Number-Dataset-50, and Female-Children-Number-Dataset-50. Using the simple linear regression, we obtained the following correlation results: (a) on the Children-Number-Dataset-50 dataset () the regression returned a negative slope of -0.006, with a R-squared of and a P-value of ; (b) on the Male-Children-Number-Dataset-50 dataset () the regression returned a positive slope of 0.044, with a R-squared of 0.0002 and a P-value of ; and (c) on the Female-Children-Number-Dataset-50 dataset () the regression returned a negative slope of -0.079, with a R-squared of 0.0006 and a P-value of .
Multi-Linear Regression Analysis Results
To create models which can estimate a vertex age of death based on the vertex’s features, we chose to use the backward stepwise multiple linear regression technique. By combining this technique with the various predefined features sets, we created three multiple regression models which presented Multiple R-squared values of 0.085, 0.042, and 0.025, for the All-Numeric-Features set, Heritage-Features set, and Nuclear-Family-Features sets respectively (see Table 6). We also obtained the following multiple linear regression models for each features set: For the All-Numeric-Features set, we computed the following model using data collected from 59,893 vertices who were born by 1900 and outlived the age of 10:
For the Heritage-Features set, we computed the following model using data collected from 59,893 vertices who were born by 1900 and outlived the age of 10:
For the Nuclear-Family-Features set, we computed the following model using data collected from 349,118 vertices who were born by 1900 and outlived the age of 50:
4.3 Machine Learning Algorithms Results
We evaluated various supervised learning algorithms in an attempt to predict which individuals who were born in the United States between 1650 and 1900 and outlived the age of fifty, will also outlive the age of eighty. We constructed our classifiers using the United-States–Dataset-50 dataset, which contained features of 183,494 vertices who outlived the age of fifty, out of which 58,975 vertices outlived the age of 80. To better understand which features were most useful to our classification algorithms, we analyzed the various features’ importance using Weka’s information gain features selection algorithm. For the United-States–Dataset-50 dataset, the top eight features with the highest rank retrieved from Weka’s information gain features selection algorithm were: (a) Birth-Year (0.0058), (b) Max-Sibling-Age-of-Death (0.0047), (c) Avg-Sibling-Age-of-Death (0.0023), (d) Max-Spouse-Age-of-Death (0.0019), (e) Gender (0.0018), (f) Avg-Spouse-Age-of-Death (0.0016), (g) Min-Spouse-Age-of-Death (0.0016), (h) Father-Age-of-Death (0.0008), (i) Mother-Age-of-Death (0.0006), and (j) Parental-Grandfather-Age-of-Death (0.0002). At the end of the list, the Maternal-Grandfather-Age-of-Death, the Children-Number, and the Sibling-Number features received an information gain score of 0.
On this dataset, the RandomForest classifier received the maximum AUC of 0.632, better than a random classifier with AUC of 0.5, while the maximum True-Positive value of 0.976 was obtained by the Decision-Tree classifier, and the minimum False-Positive of 0.774 was obtained by K-Nearest-Neighbors (K=3) classifier (see Table 7). We used T-tests with a significance of 0.05 to compare the AUC results of the RandomForest and the naive OneR classifiers. According to the T-test result, RandomForest classifier performed better in terms of AUC, than the naive OneR classifier.
To our knowledge, this study is the largest study to date which utilizes genealogical datasets to better understand factors that correlate with human lifespan. The algorithms and methods presented throughout this study, which were evaluated on the WikiTree dataset, reveal several interesting patterns and correlations.
Firstly, our results of lifespan variations over time, presented in Section 4.1 and in Figure 2, demonstrate how the lifespans of human population changed over the previous centuries. The lifespan graphs presented in Figure 2 show high infant and children death rates as well as local maximum values between the ages of 70 and 80; these results resemble the lifespan graphs presented in Mitchell et al.  and by the UK Office of National Statistics . This resemblance supports our assumptions regarding the integrity of the WikiTree dataset, which indeed contains data on human population with largely accurate birth and death dates. However, the infant death rates presented in these graphs are not entirely accurate; according to Wegman , in 1900 the infant mortality rates in the United States were about 15%, which is higher than the values presented in our results. We assume that the main reason for this discrepancy was the lack of a uniform, formal definition of “live births,” which was not standardized until 1951 . Therefore, in most of this study we used as a sample set only people who outlived the age of ten. Nevertheless, by analyzing these graphs, we can observe that over time, lifespans increased and fewer people passed away at young ages. Another observation that can be concluded from these graphs is that even in the second half of the seventeenth century, people who outlived the age of ten would likely outlive the age of sixty. Indeed, according to our median lifespan analysis, presented in Figure 3(b), the median age in 1650 for people who were born in the United States and outlived the age of ten was 62.46 for males and 62.04 for females.
Secondly, our median and average population lifespan calculations, presented in Figures 4 and 4, reveal some interesting patterns. By analyzing the graphs, we can locate several years in which the average lifespans sharply decreased for both males and females. For example, for people who were born in the United States in 1800 and outlived the age of 10, the average lifespans for males and females were 66.39 and 64.45, respectively. However, for people born in the United States ten years later, in 1810, the average lifespan was reduced by around 2 years: males’ lifespans decreased to 64.31, and females’ lifespans decreased to 62.20. An additional and even more interesting reoccurring pattern can be identified in Figure 3(b) in which, for a specific time period, the median lifespan for males increased while the median life span for females suddenly decreased, or vice versa. For example, from 1650 to 1660 the male median lifespan increased from 62.46, to 66.82 while in the same period of time the female median lifespan decreased from 62.04 to 60.24. Similar patterns reoccur between 1770 and 1780, only this time the female average lifespan increased from 65.57 to 68.69, while the male average lifespan decreased from 66.87 to 64.79 (see Figure 2(b)). Another interesting pattern can be found between 1850 and 1900 where in just a half a century the female average lifespan sharply increased from 62.66 to 72.5. We hope to discover underlying reasons for these patterns in our future research.
Thirdly, using simple linear regression algorithms, we uncovered small but significant correlations between various features and the Age-of-Death feature which are presented in Table 5. We found small positive significant correlations between the Extended Family Features and the Age-of-Death feature. For all these correlations, R-squared values were small and ranged from 0.0015 to 0.05 with a P-value of , and these may indicate that lifespan can “run in the family.” However, due to the small R-squared values in our results, we can conclude that the influence of inherited lifespan is limited and, in fact, negligible after more than one generation. Alternately, the observed correlation could be explained due to socioeconomic reasons: ancestors with long lifespan might also indicate a higher socioeconomic status, which can be passed on to their offspring. We also found significant correlations between the Avg-Spouse-Age-of-Death, Max-Spouse-Age-of-Death, and Min-Spouse-Age-of-Death features and the Age-of-Death feature, with a low P-value and a maximum R-squared value 0.0564 (see Table 5). This indicates that correlations between the lifespans of spouses exist, supporting the claims for the existence of the “widow effect.” We hope to confirm this observation in a future study by taking a closer look at the time intervals between the deaths of married couples. Using simple linear regression models, we also identified small significant correlations between longevity and reproductive success. Namely, we discovered negligible negative correlation between females and their number of children (R-squared = 0.0006), and negligible positive correlation between males and their number of children (R-squared = 0.0002).
Fourthly, using multiple linear regression models, we were able to construct models which can predict a person’s age of death using various features that were extracted from the WikiTree social network directed multigraph. Our models presented a low P-value of with Adjusted R-squared of up to 0.085 (see Table 6), indicating that the extracted features can indeed assist in predicting a person Age-of-Death based on data which was extracted from the WikiTree dataset. However, the relative low Adjusted R-squared values indicate that other external factors are also responsible for influencing an individual’s lifespan. We hope to test these assumptions in future studies by merging WikiTree genealogy datasets with other datasets that contain additional information about individuals’ habits and lifestyles.
Fifthly, our machine learning classifiers presented better than random performances, with AUCs up to 0.632 (see Table 7), in identifying which people who were born in the United States and outlived the age of 50 would also outlive the age of 80. These results support our previous claims that the data collected from genealogy datasets can be utilized to predict a person’s lifespan. Additionally, the information gain algorithm results revealed that the Max-Sibling-Age-of-Death, Avg-Sibling-Age-of-Death, and Max-Spouse-Age-of-Death (see Section 4.3) were among the most useful features. These results also indicate that a correlation exists both between spouses’ lifespans and between siblings’ lifespans. In our future studies, we hope to use similar techniques to predict other personal attributes based on data collected from online genealogy datasets.
The study presented here is among the first of its kind and offers many future research directions to pursue. One possible research direction is to analyze not only the structured data which appear in the WikiTree profile pages, but also to use Natural Language Processing (NLP) algorithms to analyze content data which appear in these pages. Another possible research direction is to compare the results presented in this study from the WikiTree dataset to other online genealogy datasets, such as FamiLinx,
The authors would like to thank Carol Teegarden for her editing expertise and helpful advice.
- During this study, we have utilized private profiles to calculate public profiles’ features, such as Children-Number() and Spouse-Number() more accurately (see Section 3.2). In many cases, we cannot distinguish if two or more private profiles are in fact represent the same single profile in the genealogy dataset. Nevertheless, we can estimate the number of distinct private profiles by utilizing the private profile single link of type . Namely, due to the fact that most people have two parents, we can conclude that private profiles with single link of type “Child” represent at least distinct profiles.
- In most online genealogy websites, a profile usually contains the following information about each individual: gender, birth and death dates, location of birth, location of death, parents’ names, spouses’ names, siblings’ names, and children’s names.
- The youngest mother on record was a 5-year-old Peruvian girl .
- 122 is the maximum confirmed human lifespan .
- We chose to use a minimum lifespan of 10 to avoid adding infant and child mortality, which might be misreported.
- For features such as Max-Sibling-Age-of-Death and Min-Spouse-Age-of-Death, which involved calculation of minimum, maximum or average, we ignored vertices with missing values, although by definition these features returned a valid value of 0.
- Y. Altshuler, N. Aharony, M. Fire, Y. Elovici, and A. S. Pentland. Incremental learning with accuracy prediction of social and individual properties from mobile-phone data. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pages 969–974. IEEE, 2012.
- Ancestry.com. Ancestry.com inc. reports q3 2012 financial results. 2013. (last accessed on November 2th, 2013).
- A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
- N. A. Christakis and J. H. Fowler. The spread of obesity in a large social network over 32 years. New England journal of medicine, 357(4):370–379, 2007.
- D. Eastman. Wikitree reaches five million profiles. http://blog.eogn.com/eastmans_online_genealogy/2013/04/wikitree-reaches-five-million-profiles.html, 2013. (last accessed on November 2th, 2013).
- F. Elwert and N. A. Christakis. The effect of widowhood on mortality by the causes of death of both spouses. Journal Information, 98(11), 2008.
- M. Fire, G. Katz, Y. Elovici, B. Shapira, and L. Rokach. Predicting student exam’s scores by analyzing social network data. In Active Media Technology, pages 584–595. Springer, 2012.
- M. Gögele, C. Pattaro, C. Fuchsberger, C. Minelli, P. P. Pramstaller, and M. Wjst. Heritability analysis of life span in a semi-isolated population followed across four centuries reveals the presence of pleiotropy between life span and reproduction. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences, 66(1):26–37, 2011.
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11:10–18, November 2009.
- J. Knowles. Myheritage hits 1 billion profiles and announces new features for historical research. 2012. (last accessed on November 2th, 2013).
- É. Le Bourg. Does reproduction decrease longevity in human beings? Ageing research reviews, 6(2):141–149, 2007.
- P. Martikainen and T. Valkonen. Mortality after the death of a spouse: rates and causes of death in a large finnish cohort. American Journal of Public Health, 86(8_Pt_1):1087–1093, 1996.
- P. F. McArdle, T. I. Pollin, J. R. O’Connell, J. D. Sorkin, R. Agarwala, A. A. Schäffer, E. A. Streeten, T. M. King, A. R. Shuldiner, and B. D. Mitchell. Does having children extend life span? a genealogical study of parity and longevity in the amish. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences, 61(2):190–195, 2006.
- B. D. Mitchell, W.-C. Hsueh, T. M. King, T. I. Pollin, J. Sorkin, R. Agarwala, A. A. SchaÈffer, and A. R. Shuldiner. Heritability of life span in the old order amish. American journal of medical genetics, 102(4):346–352, 2001.
- N. H. Murdock. Teenage pregnancy. Journal of the National Medical Association, 90(3):135, 1998.
- MyHeritage. Myheritage members map. http://www.myheritage.com/myheritage-member-map, 2013. (last accessed on November 2th, 2013).
- O. of National Statistics. Mortality in england and wales: Average life span. http://www.ons.gov.uk/ons/dcp171776_292196.pdf, 2012. (last accessed on Nov. 13th, 2013).
- C. M. Parkes, B. Benjamin, and R. G. Fitzgerald. Broken heart: a statistical study of increased mortality among widowers. British Medical Journal, 1(5646):740, 1969.
- R. D. C. Team et al. R: A language and environment for statistical computing, 2005.
- F. Thomas, A. Teriokhin, F. Renaud, T. De Meeûs, and J.-F. Guégan. Human longevity at the cost of reproductive success: evidence from global data. Journal of Evolutionary Biology, 13:409–414, 2000.
- M. E. Wegman. Infant mortality in the 20th century, dramatic but uneven progress. The Journal of Nutrition, 131(2):401S–408S, 2001.
- R. G. Westendorp and T. B. Kirkwood. Human longevity at the cost of reproductive success. Nature, 396(6713):743–746, 1998.
- C. R. WHITNEY. Jeanne calment, world’s elder, dies at 122. New York Times, 1997.
- Wikipedia. Wikipedia:statistics. http://en.wikipedia.org/wiki/Wikipedia:Statistics, 2013. (last accessed on November 2th, 2013).
- WikiTree. About wikitree. http://www.wikitree.com/wiki/About_WikiTree, 2013. (last accessed on November 2th, 2013).
- WikiTree. Trusted list. http://www.wikitree.com/wiki/Trusted_List, 2013. (last accessed on November 2th, 2013).
- WikiTree. Wikitree manager. http://www.wikitree.com/wiki/Profile_Manager, 2013. (last accessed on November 2th, 2013).
- WikiTree. Wikitree privacy. http://www.wikitree.com/wiki/Privacy, 2013. (last accessed on November 2th, 2013).