Large-scale diversity estimation through surname origin inference

Large-scale diversity estimation through surname origin inference


The study of surnames as both linguistic and geographical markers of the past has proven valuable in several research fields spanning from biology and genetics to demography and social mobility. This article builds upon the existing literature to conceive and develop a surname origin classifier based on a data-driven typology. This enables us to explore a methodology to describe large-scale estimates of the relative diversity of social groups, especially when such data is scarcely available. We subsequently analyze the representativeness of surname origins for 15 socio-professional groups in France.

Onomastics, machine learning, diversity, representativeness, geographical origins

Antoine Mazières, Centre Marc Bloch Berlin e.V., Friedrichstraße 191, 10117 Berlin, Germany

1 Introduction

Surnames have the objective property of designating a path in the ancestry tree, up to a point in time and space where the name was first coined and made hereditary. While they are usually distant markers of an historical and geographical context, surnames still exhibit connections with present features and have thus been considered as a valuable proxy in population studies. For one, surnames correlate with genetic proximity within populations jobling2001name; king2006genetic; lasker1985surnames and have been diversely used to analyze human population biology lasker1980surnames, identify cohorts of ethnic minority patients in bio-medical studies shah2010surname; polednak1993estimating; choi1993use, improve research in genealogy king2009s or describe the migration rates of human populations piazza1987migration. Social sciences more recently made use of surnames to statistically and indirectly appraise the composition of populations in various situations mateos2007review; mateos2014names, including the demography of onlinechang2010epluribus; mislove2011understanding and researchwu2014science communities, or the history of social mobility clark2014also; guell2012intergenerational.

The purpose of the present article is twofold. First, it aims at assessing the possibility of building a general-purpose, worldwide surname origin classifier. Our approach combines elements which are already available in literature, and endeavors at enhancing both the learning data quality and broadening the geographical breadth and universality of surname origin typology. Second, we use this classifier to show that, despite its limitations at the individual level, it nonetheless enables simple and pertinent applications to the estimation of representation biases in origins in populations where no such data is explicitly available. We further illustrate its potential relevance for discrimination studies by comparing surname origin distributions for various sets of occupational groups and exam candidates in France.

2 Statistically inferring a surname origin

2.1 Surname origin vs. ethnicity

Our approach relies essentially on the notion of surname origin rather than ethnicity. Indeed, ethnicity is often defined weber1978economy; barth1998ethnic; tonkin2016history as a subjective feeling of membership to one or several groups or self-defined identities, composed of linguistic, national, regional and religious criteria. A quick glance at the present paper’s bibliography reveals how much the academic literature aimed at inferring information from surnames relies on ethnicity to put names and individuals into groups, and derive subsequent analyses.

By contrast, a surname objectively corresponds to a genealogical and traditionally patrilineal path whose origin coincides with the first appearance of this socially hereditary property in the family tree. These moments vary much from one region to another, spanning from about 5,000 years ago in China to less than a century ago in Turkey.

Over 20 generations, the unique path of a name is one among more than a million (for about double the ancestors). Thus, in a randomly mating population, i.e. without any kind of endogamy, this marker would assuredly carry extremely little information: given these figures, someone bearing a surname of a specific origin would not be more likely to exhibit characteristics found in other bearers of a surname of the same origin. However, the existence of a strong endogamy among humans –albeit probably decreasing rosenfeld2008racial– entails a correlation between surnames and the preferences that characterize this endogamy: geographical proximity, social and economical status, languages, political, genetic, regional and religious criterias. Put simply, as a result of, say, geographical endogamy, the correlation between the geographical origins of the father and the mother of a person induces a correlation between the geographical origins of their surnames, whereby the father name partly informs on the geographical origin of the mother. This phenomenon is likely the common cause behind the significance of the results found in the above-cited studies.

With this in mind, ethnicity appears as a potentially uncertain detour through a context-dependent and highly subjective matter, while the reference to an origin offers a more objective description of the variations in features extracted from surnames. To speak of origins nonetheless demands that we make a decision on how we partition the world into distinct regions. At the very low level, to make matters simple and comparable, we first decided to use the present-day list of countries, acknowledging that no spatial or temporal partition of the world would be likely to take into account the wide diversity of overlaps between territories and populations at various points in time.

2.2 Crafting the learning data

Figure 1: Clusters of surname origins
Countries marked by a star (*) are interpreted as misclassified and reassigned in the following manner: Philippines, Japan and Indonesia are assigned to the Asian cluster, Ethiopia to African. Papua New Guinea, Madagascar, Jamaica, Chad and Armenia are deleted from the dataset as they represented a very low number of initial observations.

How could we, humans, be able to form an intuition on the origin of some surnames ? If one has never encountered the name “Toriyama”, one might still correctly make a guess on its Japanese origin, for instance because of the way it sounds when being pronounced, or the pattern of letter ordering. This admittedly hints at the existence of a second, closely-related proxy: surnames were originally coined (and have also been modified) by speakers belonging to a given linguistic space. Some structural and recurrent linguistic properties are more likely to be found in surnames of the same origin.

Thus, we aim at creating a classifier able to infer sufficiently well the probable origin of a surname from its spelling. To take a simple example, the distribution of letters in a text usually yields a good prediction of its language, assuming sufficiently many words and prior knowledge of empirical distributions for a set of languages. While it would be ambitious to expect a decent precision from surname single letter distributions, the use of subsets of letters, including morphemes, appears much more promising. To define learning features, we thus decompose all surnames into various subsets of letters of size , or “n-grams”. This eventually constitutes the feature set for the whole dataset. We then describe a given surname by its distribution on these features.

Building a statistical model able to reproduce the above intuition at large scale for all origins means that we must first fit the model by using a large and diversified number of surnames labeled with their origins, or training dataset. To gather such learning examples, previous works relied on a variety of explicitly labeled sources including census data mislove2011understanding, Olympic game participant records leename, phone books mateos2014names or even Wikipedia data ambekar2009name.

Another study used the PubMed search engine to extract scientific bibliographical recordstorvik2016ethnea. We follow a similar approach since this open data source1 enables easy reproductibility of our research and provides an extensive volume of references with more than 25 million publications. For each record, we extracted author surnames and their affiliations when they were related to one of the 176 countries of the Natural Earth dataset2.

We assume that surnames whose affiliation distribution is heavily peaked for a given country are more likely to originate from that country. However, using PubMed data suffers from several biases, among which:

  • The increased nomadism of the scientific population, lowering the quality of the affiliation as a reliable origin.

  • The heterogeneous academic activity of countries, over-sampling the most productive ones at the expense of others.

  • The potential bias of medical publication databases in favor of Anglo-Saxon publication venuesNieminen:aa, under-sampling the rest of the world.

A first obvious step for counterbalancing these biases consists in considering surname frequencies, i.e. normalizing surname occurrences in a given country by the total number of occurrences for that country. Then, in an effort to restrain our training dataset to true positives, we use a measure of statistical dispersion, the Herfindahl–Hirschman Index (HHI)herfindahl1950concentration; hirschman1980national, to identify names whose presence is highly concentrated in one country only. We request a HHI of at least 0.8 as well as a maximal frequency over all countries of at least 0.0001 %. Even though this method eliminates some of the most common names, for they are susceptible to have spread all over the world, it narrows our focus to a set of about 650k surnames which we call “core names” and which we assign to the country where frequency is maximal.

2.3 A data-driven typology of surname origins

Nonetheless, the number of these core names remains unevenly distributed across countries, partly as a result of the above-mentioned under-sampling. It goes from 163 names for Montenegro to 41k names for Spain, with an overall average of 5 145. Before training our model, we thus need to introduce coarser categories to achieve a minimal significance for each geographic area.

Keeping in mind the eventual goal of appraising over- and under-representation of origins in socio-professional groups, we conservatively decide to categorize countries into a relatively small number of world regions. To do so, we first cluster countries according to the training features. More precisely, we created a large “country / n-gram” matrix whose rows are countries and columns are n-grams of core names: a cell indicates the frequency of a given n-gram among the core names of a given country. We then performed hierarchical clustering on this matrix using Ward linkageward1963hierarchical. This yields the dendrogram shown in Figure 1 from which we may extract 7 rough categories of surname origins. We concretely aggregate countries by following the dendrogram in a monotonous manner from the bottom to the top while avoiding to merge categories belonging to strongly unrelated geographical areas. This process creates what appears to be an interpretable regionalization of the world at the cost of a very limited number of inconsistencies.

We relabel the original “surname-country” associations according to these clusters. We eventually train a classifier on this new “surname-world region” dataset, using the same learning features. Broadly, a classifier is a model (and, in practice, a function) which takes as inputs the learning features for a given observation (in our case, a surname and its letter subsets) and outputs a guessed label (in our case, an origin under the form of a world region).

The state of the art features a variety of methods such as hidden markov models and decision trees ambekar2009name, recurrent neural networks leename or logistic regressions torvik2016ethnea. We focus on one of the most classical classifiers, called Naive Bayes, which in our case yielded the best overall results among a variety of other traditional approaches available in Scikit-learnpedregosa2011scikit, the python classification algorithm library we used.3 Naive Bayes is a simple classifying technique consisting in estimating the probability that an object belongs to a certain class given a set of observed features. It applies the Bayes theorem on the probabilities that surnames exhibit certain features knowing that they belong to some origin. It additionally relies on the assumption that these features are statistically independent, i.e. the contributions of each of these features to the target probability are independent from one another, hence the “naive” qualification. In practice, we train the model on about 85% of the core name dataset while keeping aside about 15% of the core name dataset to evaluate model performance.

Cluster Core names Class. Perf.
Total Evaluation Precision Recall
African 30 748 4 529 0.43 0.61
Arabian 31 272 4 596 0.52 0.72
Asian 44 658 6 754 0.61 0.77
CS-European 189 624 28 668 0.81 0.71
Indian 68 145 10 067 0.63 0.72
N-European 216 465 32 469 0.78 0.62
Slavic 65 259 9 843 0.64 0.84
Total 646 171 96 926
Table 1: Number of core names (totals, while around 15% are used for the evaluation) and classifier performances for each cluster in terms of precision and recall.

Classification performance is shown in Table 1 and is expressed in terms of precision and recall, along with the corresponding set sizes. For instance, the model achieves a precision of 61% for Asian and a recall of 77%, meaning that 61% of names guessed as “Asian” belong to the Asian cluster, while 77% of names belonging to the Asian cluster are correctly guessed (recalled by the model) as “Asian”. Success differs significantly from one class to another, with very satisfying results for the Central/South European and Slavic clusters and quite moderate performance for the African cluster. How much of this error is due to the lack of academic data in certain areas or the difficulty to identify pattern in surnames of a specific area is yet to be determined.

Notwithstanding, since we are interested in comparing the over- and under-representation of surname origins between two socio-professional populations of a given country, we contend that this type of error does not significantly jeopardize our aim. We first postulate that classification errors for a given surname origin remain homogeneous from one dataset to the other, i.e. that the names of a given origin are globally going to be classified (and misclassified) with the same success in both datasets. In other words, irrespective of their proportion within a given dataset, we assume that all surnames of, say, Indian origin, will be as often correctly recalled by our algorithm as Indian in all datasets, i.e. 71.8% of the time (and errors will be distributed across other origins in similar proportions for all datasets). Put differently, we suppose that names which pose inference problems w.r.t. our model are roughly distributed homogeneously and are not biased across datasets (for example, if “Toriyama” is misclassified, we assume that it is no more or less present among Asian names in one dataset than in another one).

We nonetheless have to consider that classification errors vary across origins. This is shown by the confusion matrix on Table 2. Here, names of Arabian origin are guessed as Asian 2.46% of the time, while it is about 7.04% for names of African origin. Even if the above assumption enables us to use the same confusion matrix for all datasets, we still have to adjust guesses knowing that the algorithm exhibits some propension to over-/under-estimate depending on the origin. In other words, knowing that a proportion of names which actually belong to a given origin are guessed as belonging to another origin, we correct guesses to infer back the probability of actual origin for a given guess . In practice, we multiply guessed numbers of surname origins by this probability which we extract from by Bayesian inference.4

Guessed Actual origin
origin Afr. Arab. Asian CSE Indian NE Slavic
African 2763 165 381 1081 460 1441 157
% 61.0 3.59 5.64 3.77 4.60 4.44 1.60
Arabian 159 3292 84 577 598 1549 77
% 3.51 71.6 1.24 2.01 5.94 4.77 0.78
Asian 319 113 5200 831 716 1147 174
% 7.04 2.46 77.0 2.90 7.11 3.53 1.77
CS-Eur. 258 128 274 20364 299 3535 324
% 5.70 2.79 4.06 71.0 2.97 10.9 3.29
Indian 273 487 420 991 7226 1862 191
% 6.03 10.6 6.22 3.46 71.8 5.73 1.94
N-Eur. 643 351 315 3254 609 20183 670
% 14.2 7.64 4.66 11.4 6.05 62.2 6.81
Slavic 114 60 80 1570 159 2752 8250
% 2.52 1.31 1.18 5.48 1.58 8.48 83.8
4529 4596 6754 28668 10067 32469 9843
% 100 100 100 100 100 100 100
Table 2: Confusion matrix . This matrix shows the number of names from the evaluation sets (see Tab. 1) of an actual origin (in columns) which are guessed as belonging to a given origin (in rows). The first subrow indicates total numbers, the second subrow refers to proportions within an actual origin.

3 Estimating origin-based discrimination in France

3.1 Datasets and estimation methodology

Figure 2: Over-/under-representation of surname origins among all datasets. Each graph shows the ratios between the target dataset and the reference dataset (Brevet) for each origin category. A logarithmic scale is used to depict equivalent over- or under-representation ratios at equal distance from the y=1 reference line.

We now illustrate the method on 15 datasets representing various areas of French society, see Table 3. Three datasets are linked to political functions (Mayors, Parliament Members and Senators), five of them represent various types of occupations (Pharmacists, Lawyers, Accountants, Veterinarians, Researchers), and six are made of lists of candidates to various state exams (Brevet, Baccalauréat, BEP, CAP, BTS, Professional Baccalauréat). The École Polytechnique dataset lists students at one of the most highly-ranked engineering school in France.

From the list of surnames of each dataset, we apply the classifier to obtain vectors of values representing the guessed distributions of surname origins according to our typology. Note that this approach works by construction at the level of groups and may not be used at the level of individuals: to take an example from a distinct context, if we know that the given name “Camille” is about 80% of the time a female name, we are not able to draw a precise conclusion on the gender of a given Camille, while we can say that a group of 100 Camille is likely to be around 80% female.

Name List of surnames of all … nb. obs.
Brevet Candidates to Diplôme National du Brevet in 2008 5 562,952
Baccalauréat Candidates to the nationwide Baccalauréat (Général and Technologique) in 2008 435,645
BEP Candidates to Brevets d’Études Professionnelles in 2008 116,814
CAP Candidates to Certificats d’Aptitude Professionnelle in 2008 98,364
BTS Candidates to Brevets de Technicien Supérieur in 2008 87,917
Professional Baccalauréat Candidates to Baccalauréats Professionnels in 2008 80,672
Pharmacists Pharmacists registered in their Ordre Professionnel in 2017 6 73,422
Mayors Mayors of French cities (“communes”) in 2014 7 36,628
Parisian Lawyers Lawyers registered in the Parisian Bar Association in 2017 8 32,021
École Polytechnique Students at École Polytechnique (1958-2016) 9 23,058
Accountants Accountants registered in their Ordre Professionnel in 2017 10 20,946
Veterinarians Veterinary physicians registered in their Ordre Professionnel in 2017 11 15,710
Researchers Researchers at Centre National de la Recherche Scientifique in 2017 12 12,657
Parliament Members Parliament Members of Assemblée Nationale (1958-2016) 13 8,326

Table 3: List of datasets along with the corresponding numbers of observations.

In order to show how the diversity in terms of surname origins of certain subgroups of the population departs from that of a common reference point, we rather focus on dataset-to-dataset comparisons rather than raw distributions. In other words, comparing surname origin distribution across datasets enables us to assess the extent and magnitude of the divergence in the representativeness of groups of people with a given surname origin and, more broadly, the fact that some datasets and some origins exhibit the same pattern of divergence, likely indicative of similar underlying processes.

There is no public and unbiased source of data which covers surnames of the French population in order to perform such comparaisons. Therefore, we chose the Brevet dataset as a point of comparison since it represents the most widely passed exam in France and therefore, a wide sample of people who lived in France and were generally aged 14-15 as of 2008, hence 23-24 as of 2017. As such, it is also likely to exhibit a bias towards younger people.

Simply calculating the ratio between each target dataset and Brevet yields the results shown in figure 2, which enables the observation of several profiles of representativeness among the datasets described in table 3. As such, values higher (resp. lower) than 1 correspond to surname origins which are over-represented (resp. under-represented) compared with their presence in Brevet (logically, Brevet exhibits a flat profile where all origins have a ratio of 1). Of course, these ratios do not render the fact that some categories are significantly more populated than others: this is typically the case for “North European”, which is the most common surname origin found in these French datasets. As a result, large under- or over-representation of less populated categories may have a relatively marginal effect on the over- or under-representation of the most populated category. The graphs of figure 2 should be read with this provision in mind: because of their sheer presence in all datasets, North European surnames ratio are often grouped around 1, while other categories may vary significantly below or above 1. In other words, these ratios tend to emphasize the over- or under-representation of minority categories, rather than the strong presence of the majority category — this can prove useful in the context of discrimination studies.

Besides, datasets and origins can be grouped according to the similarity of their divergence profiles, using for instance a simple hierarchical clustering based on the Canberra distance. Graphs in figure 2 have been organized according to this proximity in order to display and emphasize datasets or origins behaving in a comparable manner.

3.2 Preliminary results: observations

While we do not aim at discussing in detail the implications of such and such bias in some dataset, we may emphasize a few trends to illustrate the interpretation of the results.

All elective political functions (Mayors, Parliament Members, Senators) together with Veterinarians, exhibit a marked over-representation of Northern European surnames. On the other hand under-representation, when it appears in these four datasets, is much more pregnant than in other datasets. It is actually spread among the remaining origins by following a comparable pattern across all four datasets, with Arabian names being the most significantly affected, closely followed by Slavic, Indian and African surnames.

State exams show an overall smoother profile, with Baccalauréat, BTS and CAP having the almost exact same distributions while Professional Baccalauréat and BEP display slightly different configurations. Interestingly, some datasets exhibit specific over-representation peaks for a single surname origin, such as Asian for École Polytechnique and Slavic for Researchers.

Some additional patterns emerge by examining these results along origins rather than datasets. For one, the under-representation of Arabian names is constant across all datasets, to the exception of BEP. Surnames of Asian origin are generally under-represented in elective functions, while their strongest over-representation occurs for two groups related to higher education, École Polytechnique and Researchers. As said above, North European surnames represent, in absolute numbers, the bulk of inferred origins. Their representativeness is generally close to 1, indicating no remarkable relative variation across datasets, from Brevet to Parliament Members, even though the ratio rises slightly above 1 for the last four datasets, possibly as a result of the strong under-representation of other origins.

3.3 Contribution scope and future work

There is an undefined leap between the statistical observation of representativeness and fairness, between under-representation and discrimination and between over-representation and privilege. For instance, while we can say that a discrimination often implies an under-representation, the inverse is not necessarily truejobard2007color and discrimination is usually evaluated on multiple complementary dimensionsdelattre2013introduction, both qualitative and quantitative. Moreover, our method does not yet take into account other socio-demographic variables to control for the existence of common causes to the results. This would make it possible to say that, for instance, there is more of such surname origin in a given dataset because such origin is over-represented in such socio-demographic segment, which is itself over-represented in the population of the said dataset. Taking for example the most under-represented case in our results — mayors with Arabian surnames — one may conclude that it illustrates a well-known discrimination in Franceforoni2016discrimination; cediey2007discriminations towards people of Arabian origins in elective functions. However, it is unclear how much of this ratio may be explained by discrimination or, for instance, by a uneven geographical presence of a given group of immigrants, or descendants thereof brutel2016localisation.

The results presented here should be further examined and perhaps challenged both for their statistical significance and historical relevance. In this respect, our article simply acknowledges that the study of representativeness is part of discrimination studies, whereby methods for large-scale estimation of the former contribute to the latter. Our results could be nuanced with expertise of the demography of each socio-professional group considered here together with in-depth knowledge of the history of colonization and immigration in Francebeauchemin2016trajectoires.

Finally, while we applied our methodology to datasets representative of certain groups in French society, comparisons with other contexts (countries, world regions, transnational entities) with the help of relevant surname datasets could yield fruitful insights on both our method and on possible interpretations.

3.4 Concluding remarks

The aim of this article lies in demonstrating the feasibility of a technique of estimation of representativeness based on a combination of open data sources, in contexts where data explicitly documenting individual origins may be difficult to process. We endeavored at showing that these methods can work in the absence of public data and/or data specifying distribution priorschang2010epluribus or a priori ethnic taxonomiesambekar2009name.

By making the model available to anyone and relying on data open sources, we hope to encourage further exploration and improvements of such techniques, especially in the context of discrimination studies and the discussion of the specific biases corresponding to present and future datasets.


The authors would like to thank Telmo Menezes, Mikaela Keller, Elian Carsenat, Jean-Philippe Cointet, Élise Marsicano, Fabien Jobard, Jérémy Levy and Mélanie Bourgeois for their help with this research.


This paper has been partially supported by the “Algodiv” grant (ANR-15-CE38-0001) funded by the ANR (French National Agency of Research).




  1. endnote: Using the query 1800:2020[dp] on
  2. endnote: Natural Earth Data, 1:110m Cultural Vectors,
  3. endnote: We concretely apply a multinomial naive Bayes model with an additive (Laplace/Lidstone) smoothing parameter of 0.1. A programming notebook is available to observe and reproduce all steps described here:
  4. endnote: More precisely, we compute as . Moreover, since the confusion matrix is computed using prior proportions of surname origins extracted from Pubmed, it is likely to be based on priors which very significantly diverge from the average proportions of surname origins in the “general” French population. To accommodate for the Pubmed bias as much as possible, we adjust the priors of the confusion matrix so that they match a distribution guessed initially by the uncorrected classifier on the Brevet dataset. This uncorrected distribution yields respectively 4.8, 8.3, 3.1, 20.7, 3.4, 57.1 and 2.6 % for each of the origins: African, Arabian, Asian, Central SE, Indian, NE, and Slavic. We thus correct the original confusion matrix of Table 2 by making column sample sizes proportional to these figures. In other words, the confusion matrix that we eventually use exhibits a structure more similar to that of the initially guessed Brevet proportions than the Pubmed ones.
  5. endnote: Source for all 2008 exams:
  6. endnote: Source: Online directory of the Ordre National des Pharmaciens,
  7. endnote: Source: French gouvernment open data repository,
  8. endnote: Source: Online directory of the Parisian Bar Association,
  9. endnote: Source: Alumni online directory of École Polytechnique,
  10. endnote: Only independent, salaried and honorary accountants. Source: Online directory of the Ordre National des Experts-Comptables,
  11. endnote: Source: Online directory of the Ordre National des Vétérinaires,
  12. endnote: Only tenured researchers. Source: CNRS Online directory,
  13. endnote: Source: French National Assembly online databasem
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Comments 0
The feedback must be of minumum 40 characters
Add comment
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description