A Machine Learning Approach to Modeling Human Migration
Abstract
Human migration is a type of human mobility, where a trip involves a person moving with the intention of changing their home location. Predicting human migration as accurately as possible is important in city planning applications, international trade, spread of infectious diseases, conservation planning, and public policy development. Traditional human mobility models, such as gravity models or the more recent radiation model, predict human mobility flows based on population and distance features only. These models have been validated on commuting flows, a different type of human mobility, and are mainly used in modeling scenarios where large amounts of prior ground truth mobility data are not available. One downside of these models is that they have a fixed form and are therefore not able to capture more complicated migration dynamics. We propose machine learning models that are able to incorporate any number of exogenous features, to predict origin/destination human migration flows. Our machine learning models outperform traditional human mobility models on a variety of evaluation metrics, both in the task of predicting migrations between US counties as well as international migrations. In general, predictive machine learning models of human migration will provide a flexible base with which to model human migration under different whatif conditions, such as potential sea level rise or population growth scenarios.
A Machine Learning Approach to Modeling Human Migration
Caleb Robinson and Bistra Dilkina Georgia Institute of Technology School of Computational Science and Engineering Atlanta, Georgia. 30332 dcrobins@gatech.edu, bdilkina@cc.gatech.edu
1 Introduction
Models of human mobility in their different forms are important for many reasons. Models of human commuting can help reduce traffic congestion and pollution, and can be used to drive land use policy and development choices (?). Models of human migration are equally important to policy makers as they can give broader estimates of how the population of an area will change in upcoming years, how labor markets might be affected (?), how infectious diseases spread (?; ?), and how international trade will change (?). Much recent research focuses on modeling human commuting flows (?; ?); however little has focused on explicitly modeling human migration.
Human mobility has been traditionally modeled with the socalled gravity model, which posits that the probability of a trip between two locations decays directly as a function of the distance between them. This model was introduced in its modern form in 1946 (?) and has been used in many applications since (?; ?; ?; ?; ?; ?). More recently, the radiation model (?) has been shown to capture long range trips better than gravity based models, and is described as ‘a universal model for mobility and migration patterns’. The radiation model posits that the probability of a trip will decay indirectly with distance and directly with the amount of intervening opportunities, a notion first proposed by Stouffer (?). The radiation model has been extended several times since being proposed in 2012 (?; ?; ?). In general, gravity models have been shown to be more capable of reproducing commuting flows, i.e. human mobility at small spatial scales (?), while radiation models have been shown to be better at reproducing migration flows, i.e. human mobility at larger spatial scales (?). Additionally, human migrations have been estimated by fitting generalized linear models derived from the gravity model (?; ?). Both gravity and radiation models are analytical models with crafted functional forms and limited input data requirements. These models are focused on explaining human migration, rely on linear relationships between independent variables, and use hand crafted features for each zone. These approaches, while useful for explaining human migration, trade predictive power for interpretability. Data sources such as the World Bank and US Census provide many zonebased features that can be algorithmically combined in a nonlinear manner by tree or neural network based models to best predict human migration.
Our key contributions are as follows:

We develop the first general machine learning formulation of the human migration prediction problem.

We develop a pipeline for training machine learning models to tackle this problem that includes procedures to deal with dataset imbalance, hyperparameter tuning, and performance evaluation.

We develop a custom loss function for training artificial neural networks that is more suitable for the migration prediction task.

We compare the performance of machine learning models to traditional models of human migration on two datasets, and show that the machine learning models outperform the traditional models in all cases.
2 Traditional Migration Models
In human mobility modeling, we are usually given zones with the goal of predicting the number of people, , that move from every zone to every other zone . Traditional mobility models, such as the radiation model (?) and the gravity model (?), break the problem of estimating into two pieces: estimating the total number of people that leave zone (also referred to as a production function), and estimating the probability of a move occurring from to . The predicted number of migrants from to would be . As a convention, the probabilities are normalized such that, given an origin , the probabilities of traveling to all other destinations sum to , i.e. . If prior information about the number of incoming and outgoing travelers per zone is known, then a constrained framework, such as the one described in Lenormand et al (?), can be used. If this information is not known, which is the case in predicting migrations, then a production function must be used to estimate the number of outgoing travelers per zone instead. A simple production function for a dataset can be found by expressing the number of outgoing migrants of a zone as a constant fraction of the population of that zone (for counties in the US, this percentage is (?)). Given prior timestep’s data on the population and the corresponding number of outgoing migrants in the current timestep for every zone , the production function is expressed as , where is the slope of the line of best fit through the pairs (, ).
The traditional models of human mobility that we include in this study are: the radiation model (?), extended radiation model (?), and gravity models with both power and exponential distance decay functions (?). The only information used by these models is: , population of a zone for both the origin and destination zones; , the distance between two zones and ; and , a metric of intervening opportunities measured as the total population of all intervening zones between and , defined as all zones whose centroid falls in the circle centered at with radius (not including zones or ). See Table 1 for a description of each model.
Model  Equation 

Radiation  
Extended Radiation  
Gravity with Power Law  
Gravity with Exponential Law 
All of these models have two parameters that must be tuned using historical data: the production function parameter, , that determines what fraction of the population of a zone will migrate away in a given year, and a model parameter, . From a socioeconomic point of view, traditional models have the advantage of being interpretable, however we will show that this interpretability comes at a cost of predictive accuracy, as machine learning models can use similar historic data to achieve better results.
3 Evaluation Methods
To evaluate how well alternative models perform, we use four metrics that compare how well a predicted migration matrix, , recreates the ground truth values, . The first two of these (, ) have been used in previous literature to evaluate human mobility models (?; ?), and the other two are standard regression metrics:
 Common Part of Commuters ()

This metric directly compares numbers of travelers between the predicted and ground truth matrices. It will be when the two matrices have no entries in common, and when they are identical. We note that this metric, as used in previous studies of commuting flows, is identical to the BrayCurtis similarity score used to compare abundance data in ecological studies (?; ?).
(1)  Common Part of Commuters Distance Variant ()

This metric measures how well a predicted migration matrix recreates trips at the same distances as the ground truth data. In this definition, is a histogram where a bin contains the number of migrants that travel between and kilometers. It will be when the two matrices do not have any migrations at the same distance, and when all fall within the same distances.
(2)  Root mean squared error ()

This is a standard regression measure that will “punish” larger errors more than small errors. This score ranges from in a perfect match, to arbitrarily large values as the predictions become worse.
(3)  Coefficient of determination ()

This score measures the goodness of fit between a set of predictions and the ground truth values. This score ranges from , in a perfect fit, to arbitrarily negative values as a fit becomes worse, and is when the predictions are equivalent to the expectation of the ground truth values.
(4)
In addition to the previous four metrics, we compare the ground truth number of incoming migrants and the predicted number of incoming migrants per zone using mean absolute error () and . The predicted number of incoming migrants for a zone, , is calculated as . We argue that it is important to explicitly measure how well each model performs at estimating the number of incoming migrants, because the number of incoming migrants to a location will be the most important measure for policy makers in that area. Incoming migrant predictions can inform population growth estimates and hence infrastructure planning and job analysis.
4 Learning Migration Models
Formally, the problem of modeling human migration is as follows: given zones, features describing each zone, , and joint features describing features between a pair of zones, , at some timestep , the objective is to predict the origin/destination migration matrix at the next timestep, , where an entry represents the estimated number of migrants relocating from zone to zone . Our goal is to estimate a function , which takes the features of zone and , as well as the joint features between them, as input, and directly outputs the estimated number of migrants that travel from to . This approach is different from how the traditional migration models work as it does not require a production function, but instead directly predicts . This formulation contains the simplifying assumption that migrant flows are static in time, meaning that they can be entirely determined by the features from the previous timestep. In reality migrant flows will be dependent on temporal features, such as long term developmental trends, however many places will not have enough data to take advantage of these patterns. With this formulation, our models can be applied more broadly to predict future migration patterns in locations that have only collected a single year of data.
Hyperparameter optimization
To fit for a given dataset, we will train two machine learning models, “extreme” gradient boosting regression (XGBoost model) (?), and a deep learning based artificial neural network model (ANN model) (?). Each of these models contains several hyperparameters that must be tuned to obtain good performance on a given learning task. Our first model, the XGBoost model, is a standard machine learning model based on gradient boosting trees (?) that often performs very well on many regression and classification tasks. One benefit of this model is that it gives a ranking of the relative feature importances (?). The parameters of the XGBoost model that we consider for hyperparameter tuning are the maximum tree depth, number of estimators, and learning rate. Our ANN model is composed of densely connected layers with ReLU activation layers^{1}^{1}1Our model is implemented in Python with the Keras library: https://keras.io/. We tune the following ANN parameters: loss function, number of layers, layer width, number of training epochs, and training minibatch size.
Dealing with zeroinflated data
We observe that migration data is heavily zeroinflated, where in any given year, most pairs of zones do not have any migrants traveling between them, i.e. for most pairs. Considering migrations between US counties (?), less than 1% of the possible pairs of counties have migrations between them. This imbalance will cause problems for machine learning models. To address this problem, when creating a training dataset we undersample “negative” samples between pairs of zones for which there are no observed migrations. This is a naïve technique that will necessarily throw out available information (?). To offset this, we introduce a hyperparameter that determines the number of “negative” examples of migrations to train with. If there are pairs of zones where there are observed migrations, “positive examples”, we include all , and an additional randomly chosen zone pairs where there are no observed migrants. This hyperparameter is included in both the XGBoost and ANN model parameter searches. We give further details on the hyperparameter tuning process in Section 5.
Custom ANN loss function
Previously we mentioned that our ANN model will consider different loss functions as a hyperparameter. Common loss functions for regression tasks include “mean squared error” (MSE), “mean absolute error” (MAE), and “mean absolute percentage error” (MAPE). Our preliminary experimental results show that MAE and MAPE loss functions perform poorly, partially due to their inability to punish large errors and deal with many zeros respectively. To contrast with the aforementioned zeroinflation problem, we observe that the distribution of migrant counts has a long tail, whereby few pairs of zones consistently experience large volumes of migrations. Considering these observations, and because the CPC metric is one of the key metrics of interest (described in detail in Section 3), we derive a new loss function based on CPC to train the ANN model with. This loss function is given as:
(5) 
Where is a migration flow entry from . The gradient update for this loss function is:
(6) 
Intuitively, this loss will ‘reward’ predictions where the number of migrants matches the ground truth, while also enforcing per link absolute error to be minimized. This loss function is not exactly equivalent to the , as during the ANN training it will only consider a single minibatch worth of samples at a time (in our case , where as the metric is a function of the entire migration matrix). Empirically this custom loss function results in better validation performance and faster training times than MSE loss.
5 Experiments
Datasets
We perform experiments comparing the performance of traditional models to the performance of our machine learning on two datasets, the USA Migration dataset and the Global Migration dataset.
The USA Migration dataset consists of yearly intracounty migrations in the USA between 3106 counties from the IRS TaxStats data (?) for the 11 years in the range from 2004 to 2014. We supplement the migration data with the following 7 percounty features (taken from the US Census estimates and calculated from the Census TIGER line maps of US county boundaries): population, land area, population density, median household income, county water area, is a coastal county, and number of neighboring counties. In addition to these 7 percounty features, we add the following betweencounty features: distance, intervening population, intervening land area, intervening number of counties, intervening population density, and intervening median income. The intervening features are calculated based on the idea of “intervening opportunities” presented in the radiation model. For any given countylevel variable, , e.g. population, the intervening amount of that variable between counties and is defined as , the sum of all in the intervening counties that fall within the circle centered at county with a radius to county (excluding and ).
The Global Migration dataset consists of decadal intercountry migration data between 207 countries from the World Bank Global Bilateral Migration Database (?). The Global Migration dataset contains 5 timesteps, one every 10 years from 1960 to 2000. In the Global Migration dataset we add the following 5 percountry features (taken from World Bank World Development Indicators data (?)): population, population density, population growth, agricultural emissions, and land area. Additionally, we include 3 betweencountry features: distance, intervening population, and intervening land area.
For each year of data in the USA Migration and Global Migration datasets we create an ‘observation’ for each pair of zones, an origin zone and destination zone (counties in the USA Migration dataset and countries in the Global Migration dataset). Each observation consists of the perzone features for both the origin zone and destination zone (population of origin, population of destination, etc.) and the betweenzone features of the origin and destination. This corresponds exactly to the of the function (described in Section 4) that we want to learn. An observation is associated with the target number of migrants, , traveling from the origin to the destination. Formally, for a given year, , number of zones, , number of perzone features, , and number of betweenzone features, , we create a matrix of observations and vector of targets , capturing the migration flows observed in year .
USA Migrations  Metrics on full matrix 


Production Function  
Gravity Model Exponential Decay  0.53 +/ 0.01  0.66 +/ 0.02  87.4 +/ 9.0  1.48 +/ 0.28  1,216 +/ 128  0.67 +/ 0.03  
Gravity Model Power Law Decay  0.56 +/ 0.01  0.78 +/ 0.02  75.7 +/ 8.0  0.86 +/ 0.26  1,129 +/ 129  0.72 +/ 0.04  
Radiation Model  0.53 +/ 0.01  0.76 +/ 0.02  47.6 +/ 5.0  0.26 +/ 0.09  1,346 +/ 148  0.80 +/ 0.02  
Extended Radiation Model  0.58 +/ 0.01  0.83 +/ 0.01  35.6 +/ 3.0  0.59 +/ 0.03  1,123 +/ 117  0.86 +/ 0.02  
XGBoost model  traditional features  0.51 +/ 0.08  0.74 +/ 0.07  28.6 +/ 5.4  0.72 +/ 0.10  1,151 +/ 249  0.86 +/ 0.04  
ANN model  traditional features  0.63 +/ 0.01  0.86 +/ 0.02  35.1 +/ 3.2  0.60 +/ 0.03  911 +/ 107  0.91 +/ 0.01  
XGBoost model  extended features  0.58 +/ 0.03  0.78 +/ 0.02  24.2 +/ 1.4  0.81 +/ 0.02  968 +/ 56  0.89 +/ 0.02  
ANN model  extended features  0.68 +/ 0.01  0.89 +/ 0.02  29.8 +/ 2.7  0.71 +/ 0.02  935 +/ 98  0.91 +/ 0.02  
No Production Function  
XGBoost model  traditional features  0.54 +/ 0.11  0.99 +/ 0.02  18.5 +/ 6.1  0.88 +/ 0.08  3,091 +/ 1,740  0.41 +/ 0.85  
ANN model  traditional features  0.63 +/ 0.02  0.88 +/ 0.06  35.3 +/ 3.5  0.60 +/ 0.04  1,188 +/ 259  0.84 +/ 0.16  
XGBoost model  extended features  0.62 +/ 0.04  0.99 +/ 0.02  13.0 +/ 1.5  0.94 +/ 0.02  2,060 +/ 622  0.76 +/ 0.28  
ANN model  extended features  0.69 +/ 0.01  0.93 +/ 0.05  28.0 +/ 3.6  0.75 +/ 0.03  909 +/ 48  0.92 +/ 0.04 
Global Migrations  Metrics on full matrix 


Production Function  
Gravity Model Exponential Decay  0.16 +/ 0.00  0.16 +/ 0.00  62,218 +/ 5,341  0.02 +/ 0.03  651,194 +/ 80,220  0.00 +/ 0.02  
Gravity Model Power Law Decay  0.16 +/ 0.00  0.15 +/ 0.00  61,523 +/ 5,278  0.05 +/ 0.00  628,678 +/ 79,474  0.03 +/ 0.03  
Radiation Model  0.16 +/ 0.00  0.14 +/ 0.00  62,173 +/ 5,277  0.02 +/ 0.00  614,483 +/ 79,378  0.04 +/ 0.02  
Extended Radiation Model  0.16 +/ 0.00  0.14 +/ 0.00  62,108 +/ 5,299  0.03 +/ 0.00  618,576 +/ 76,150  0.03 +/ 0.02  
XGBoost model  traditional features  0.18 +/ 0.01  0.14 +/ 0.01  58,377 +/ 5,141  0.14 +/ 0.01  597,478 +/ 79,178  0.10 +/ 0.01  
ANN model  traditional features  0.19 +/ 0.01  0.15 +/ 0.01  60,272 +/ 4,610  0.08 +/ 0.02  589,789 +/ 79,542  0.26 +/ 0.03  
XGBoost model  extended features  0.21 +/ 0.01  0.16 +/ 0.01  57,909 +/ 5,409  0.16 +/ 0.02  573,090 +/ 56,987  0.15 +/ 0.02  
ANN model  extended features  0.22 +/ 0.02  0.17 +/ 0.01  58,887 +/ 4,477  0.12 +/ 0.02  563,259 +/ 74,127  0.24 +/ 0.06  
No Production Function  
XGBoost model â traditional features  0.33 +/ 0.02  0.59 +/ 0.03  52,729 +/ 5,455  0.26 +/ 0.26  938,905 +/ 172,834  0.18 +/ 0.28  
ANN model â traditional features  0.33 +/ 0.01  0.37 +/ 0.04  56,005 +/ 882  0.20 +/ 0.11  537,351 +/ 44,034  0.53 +/ 0.16  
XGBoost model â extended features  0.43 +/ 0.03  0.64 +/ 0.02  47,329 +/ 5,073  0.42 +/ 0.13  577,473 +/ 77,315  0.48 +/ 0.34  
ANN model â extended features  0.40 +/ 0.02  0.43 +/ 0.02  50,921 +/ 3,556  0.33 +/ 0.13  459,841 +/ 55,479  0.52 +/ 0.30 
USA Features  Importance  Global Features  Importance 

Intervening number of counties  25.3% +/ 2.4%  Population growth of origin  19.5% +/ 16.0% 
Population of origin (trad)  15.7% +/ 1.7%  Intervening population (trad)  12.3% +/ 3.7% 
Population of destination (trad)  14.2% +/ 0.9%  Agricultural emissions of destination  10.6% +/ 5.8% 
Intervening population (trad)  6.1% +/ 1.2%  Intervening land area  8.7% +/ 5.6% 
Is destination coastal  4.3% +/ 4.6%  Population growth of destination  7.9% +/ 6.2% 
Distance between counties (trad)  3.7% +/ 0.9%  Population of destination (trad)  6.9% +/ 0.8% 
Intervening area  3.6% +/ 0.9%  Distance between counties (trad)  6.6% +/ 1.9% 
Area of destination  3.5% +/ 0.5%  Population of origin (trad)  6.1% +/ 1.6% 
Number of neighbors destination  3% +/ 1.6%  Population density of destination  5.7% +/ 4.5% 
Water area of origin  3% +/ 1.3%  Land area of origin  5.2% +/ 3.1% 
Experimental Setup
To select the hyperparameters of the models that we described in Sections 2 and 4, we consider triplets of “years” of data as training, validation, and testing sets. Specifically, for three years of data , we call the training set, the validation set, and the test set. We tune the hyperparameters of the models using a randomized grid search with sampled hyperparameters using the training and validation sets. We select the best set of hyperparameters according to the score, then use those parameters to train a model on the validation set and record its performance on the test set. We repeat this process for each triplet of years in each dataset. Our final results are reported as averages over the test set results.
A hyperparameter present in both XGBoost and ANN models is the downsampling rate, . As a preprocessing step for a given year of training data , we include all observations, where (let this number of samples be ), and choose random samples with replacement from the remaining observation (where ). This sampling process only takes place when training a model. When testing a model on the validation or test sets, the full datasets are always used. For experiments with the USA Migration dataset we consider values of in a uniform distribution of integers from to , while for experiments with the Global Migration dataset we consider the uniform distribution of integers from to , because the average percentage of nonzeros is and respectively.
For the XGBoost models, we sample the following parameters: maximum tree depth from ^{2}^{2}2 is the discrete uniform distribution., number of estimators from , and learning rate uniformly in the range from to . For the ANN models we sample the following parameters: network loss function uniformly from , number of layers uniformly from , layer width from , number of training epochs from , and training minibatch size uniformly from .
We calibrate the parameters of the traditional models in a similar manner. Every traditional model, except for the radiation model, has two parameters, and , that must be calibrated to give useful results (the radiation model only uses ). For each pair of data (that we refer to as training and testing sets respectively), we find the value of that gives the best production function on the training set. Similarly, we find the value of that maximizes the score of each traditional models on the training set. We then use these and values to run each model on the testing set, and report the results as averages over all test set results.
To directly compare how the ML models and traditional models perform under the same conditions, we perform experiments where the ML models are used with the same production functions as the traditional models. This imposes an artificial constraint on the ML models, as these models are able to directly estimate the number of migrations between two zones, without supplemental information on the number of outgoing migrants from each zone. To apply a production function, , to the predictions made by a ML model, , we create a new set of predictions, , where an entry .
Results
Tables 2 and 3 show the average results over all years of data of the ML models and traditional models in the USA Migration and Global Migration datasets respectively. From these tables we observe that the best traditional model for the USA Migration dataset is the Extended Radiation model, beating the other traditional models in all metrics. None of the traditional models are able to capture the migration dynamics in the Global Migration dataset; they all have an score near , meaning that a model which predicts the average number of migrants for every link would perform just as well. The ML models perform much better. In the case where the ML models are constrained to the same conditions as the traditional models, using traditional features and a production function, the ANN model beats all of the traditional models in 5 out of the 6 measures in the USA Migration datasets, and the XGBoost model beats all the traditional models in 4 out of the 6 metrics. Similarly, the ML models considerably outperform the traditional models in the Global Migration, outperforming them in all metrics. Considering the extended feature set results, the ML models perform even better. The ANN and XGBoost models without a production function outperform the same models with a production function in 5 out of the 6 metrics. The XGBoost model outperforms the ANN model in 3 out of the 4 metrics that evaluate the models’ per link predictions, however the ANN model performs better on the two metrics that evaluate the aggregate incoming migrant prediction performance.
These results suggest that more features than those which are used by the traditional models, are needed to accurately predict human migrations. The ability of ML models to incorporate any number of additional features is one of the key motivations for using them to obtain more accurate results. Considering this, it will be insightful to understand, which of the features are most informative to the ML models. Since feature importance analysis for ANNs is quite challenging, we report in Table 4 the top 10 most important features for both datasets (based on information gain) in the XGBoost model trained on the extended feature set, averaged over all years of data. In both datasets, the intervening population feature is in the top 4 important features which validates the intuition that intervening opportunities are important in predicting migrations. The most important feature in the USA Migration dataset is the number of intervening counties between two locations, a simpler form of the intervening population idea. In the Global Migration dataset, the population growth of the origin is the most important feature on average, with a large standard deviation. In some years this feature is very important, however in other years it is less so. Intuitively, population growth will be correlated with the amount of incoming migration. During relatively stable years, with small population growth, other features will be more predictive of migration.
In Figure 1 we show the difference between the actual and predicted numbers of incoming migrants per county for the two best traditional models, and all of the ML models without production functions. From these maps we can see that between ML models, those trained with the extended feature set perform better than those trained with only the traditional features. Specifically, without the extended features, the ML models underpredict the number of migrations to the western portion of the United States. When the extended features are taken into account, the models are able to correct for this spatial bias. Holding with the experimental results, we can see that the ANN model with extended features best captures the incoming migrant distributions per county. The ANN model is able to more accurately match the number of migrants that travel to rural areas (e.g. to the midwestern US), compared to the traditional models that consistently over estimate the numbers of migrants to rural areas. In general, these maps agree with our empirical results, that the ANN model (with the lowest average incoming migrants ) is able to best predict migrations.
6 Conclusion
With the increasing availability of high resolution socioeconomic data in countries that also record human migrant flows, it is possible to use machine learning models of human migration rather than traditional gravity or radiation models. Machine learning models offer greater levels of modeling flexibility, as they can combine many input features in nonlinear ways that can not be captured by static equations. Furthermore, machine learning models can be easily customized to the problem or country at hand.
We develop two machine learning based models for the task of predicting human migration flows, for both between counties in the US and between countries across the world. We compare these models to traditional human migration models using two sets of features and show that our models outperform the traditional models in most of the evaluation metrics.
We would like to extend this work to better explain human migration through a more complete analysis of features included in the model, and through incorporating different models. While the XGBoost model can provide a ranking of feature importances, this does not fully explain the dynamics that drive human migration. Additionally we would like to study how these migration models could be specialized to predict migrations under extreme weather events. Hurricanes and other natural disasters can displace large populations, and determining where these populations will resettle would provide an unique planning tool for policy makers. To achieve these goals, higher resolution migration data, on both spatial and temporal scales, will need to be obtained. Extreme weather events, by definition, are short lived and their effects will be better estimated and predicted at local scales.
References
 [Balcan et al. 2009] Balcan, D.; Colizza, V.; Gonçalves, B.; Hu, H.; Ramasco, J. J.; and Vespignani, A. 2009. Multiscale mobility networks and the spatial spreading of infectious diseases. Proceedings of the National Academy of Sciences 106(51):21484–21489.
 [Chen and Guestrin 2016] Chen, T., and Guestrin, C. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794. ACM.
 [Clark and Ballard 1980] Clark, G. L., and Ballard, K. P. 1980. Modeling outmigration from depressed regions: the significance of origin and destination characteristics. Environment and Planning A 12(7):799–812.
 [Cohen et al. 2008] Cohen, J. E.; Roig, M.; Reuman, D. C.; and GoGwilt, C. 2008. International migration beyond gravity: A statistical model for use in population projections. Proceedings of the National Academy of Sciences 105(40):15269–15274.
 [De Montis et al. 2010] De Montis, A.; Chessa, A.; Campagna, M.; Caschili, S.; and Deplano, G. 2010. Modeling commuting systems through a complex network analysis: A study of the italian islands of sardinia and sicily. Journal of Transport and Land Use 2(3).
 [Dennett and Wilson 2013] Dennett, A., and Wilson, A. 2013. A multilevel spatial interaction modelling framework for estimating interregional migration in europe. Environment and Planning A 45(6):1491–1507.
 [Dinkelman and Mariotti 2016] Dinkelman, T., and Mariotti, M. 2016. The longrun effects of labor migration on human capital formation in communities of origin. American Economic Journal: Applied Economics 8(4):1–35.
 [Fagiolo and Mastrorillo 2013] Fagiolo, G., and Mastrorillo, M. 2013. International migration network: Topology and modeling. Physical Review E 88(1):012812.
 [Faith, Minchin, and Belbin 1987] Faith, D. P.; Minchin, P. R.; and Belbin, L. 1987. Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69(13):57–68.
 [Friedman 2001] Friedman, J. H. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics 1189–1232.
 [Greenwood 1985] Greenwood, M. J. 1985. Human migration: Theory, models, and empirical studies*. Journal of regional Science 25(4):521–544.
 [Guo et al. 2008] Guo, X.; Yin, Y.; Dong, C.; Yang, G.; and Zhou, G. 2008. On the class imbalance problem. In Natural Computation, 2008. ICNC’08. Fourth International Conference on, volume 4, 192–201. IEEE.
 [LeCun, Bengio, and Hinton 2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.
 [Lee 1966] Lee, E. S. 1966. A theory of migration. Demography 3(1):47–57.
 [Legendre and Legendre 2012] Legendre, P., and Legendre, L. F. 2012. Numerical ecology, volume 24. Elsevier.
 [Lenormand, Bassolas, and Ramasco 2016] Lenormand, M.; Bassolas, A.; and Ramasco, J. J. 2016. Systematic comparison of trip distribution laws and models. Journal of Transport Geography 51:158–169.
 [Lenormand et al. 2012] Lenormand, M.; Huet, S.; Gargiulo, F.; and Deffuant, G. 2012. A universal model of commuting networks. PloS one 7(10):e45985.
 [Letouzé et al. 2009] Letouzé, E.; Purser, M.; Rodríguez, F.; Cummins, M.; et al. 2009. Revisiting the migrationdevelopment nexus: a gravity model approach. Human Development Research Paper 44.
 [Mason et al. 1999] Mason, L.; Baxter, J.; Bartlett, P. L.; and Frean, M. R. 1999. Boosting algorithms as gradient descent. In NIPS, 512–518.
 [Masucci et al. 2013] Masucci, A. P.; Serras, J.; Johansson, A.; and Batty, M. 2013. Gravity versus radiation models: On the importance of scale and heterogeneity in commuting flows. Physical Review E 88(2):022812.
 [Noulas et al. 2012] Noulas, A.; Scellato, S.; Lambiotte, R.; Pontil, M.; and Mascolo, C. 2012. A tale of many cities: universal patterns in human urban mobility. PloS one 7(5):e37027.
 [Özden et al. 2011] Özden, Ç.; Parsons, C. R.; Schiff, M.; and Walmsley, T. L. 2011. Where on earth is everybody? the evolution of global bilateral migration 1960–2000. The World Bank Economic Review 25(1):12–56.
 [Pierce 2015] Pierce, K. 2015. SOI migration data. A new approach: Methodological improvements for SOICâs United States population migration data, calendar years 20112012. Technical report, Statistics of Income, Internal Revenue Service.
 [Ren et al. 2014] Ren, Y.; ErcseyRavasz, M.; Wang, P.; González, M. C.; and Toroczkai, Z. 2014. Predicting commuter flows in spatial networks using a radiation model based on temporal ranges. Nature communications 5.
 [Schneider 1959] Schneider, M. 1959. Gravity models and trip distribution theory. Papers in Regional Science 5(1):51–56.
 [Simini et al. 2012] Simini, F.; González, M. C.; Maritan, A.; and Barabási, A.L. 2012. A universal model for mobility and migration patterns. Nature 484(7392):96–100.
 [Sorichetta et al. 2016] Sorichetta, A.; Bird, T. J.; Ruktanonchai, N. W.; zu ErbachSchoenberg, E.; Pezzulo, C.; Tejedor, N.; Waldock, I. C.; Sadler, J. D.; Garcia, A. J.; Sedda, L.; et al. 2016. Mapping internal connectivity through human migration in malaria endemic countries. Scientific Data 3.
 [Stouffer 1940] Stouffer, S. A. 1940. Intervening opportunities: a theory relating mobility and distance. American sociological review 5(6):845–867.
 [World Bank 2017] World Bank. 2017. http://databank.worldbank.org/data/reports.aspx.
 [Yang et al. 2014] Yang, Y.; Herrera, C.; Eagle, N.; and González, M. C. 2014. Limits of predictability in commuting flows in the absence of data for calibration. Scientific reports 4.
 [Zipf 1946] Zipf, G. K. 1946. The p 1 p 2/d hypothesis: on the intercity movement of persons. American sociological review 11(6):677–686.