MultiSeason Analysis Reveals the Spatial Structure of Disease Spread
Abstract
Understanding the dynamics of infectious disease spread in a heterogeneous population is an important factor in designing control strategies. Here, we develop a novel tensordriven multicompartment version of the classic SusceptibleInfectedRecovered (SIR) model and apply it to Internet data to reveal information about the complex spatial structure of disease spread. The model is used to analyze statelevel Google search data from the US pertaining to two viruses, Respiratory Syncytial Virus (RSV), and West Nile Virus (WNV). We fit the data with correlations of , and for RSV and WNV, respectively. Although no prior assumptions on spatial structure are made, human movement patterns in the US explain 2730% of the estimated interstate transmission rates. The transmission rates within states are correlated with known demographic indicators, such as population density and average age. Finally, we show that the patterns of disease load for subsequent seasons can be predicted using the model parameters estimated for previous seasons and as few as weeks of data from the current season. Our results are applicable to other countries and similar viruses, allowing the identification of disease spread parameters and prediction of disease load for seasonal viruses earlier in season.
1 Introduction
The spread of infectious diseases through a population of susceptible individuals varies in time, space, and according to the characteristics of the susceptible individuals and the disease[1]. Modeling this spread can provide information on the spreading mechanism of the infection and can assist in designing strategies for control of the disease. One common method to model infections is compartmental models[2, 3], which in their basic form assume a single population divided into 3 subgroups of susceptible, infected, or recovered (SIR) individuals. This model describes a global spreading mechanism where each infected individual can infect any susceptible individual. Another common model describes local spreading[4], where infected individuals only spread infection to a limited subset of susceptible individuals.
In reality, many epidemics spread through a combination of two or more spreading mechanisms; hybrid spreading[5]. Such spreading can be modeled by multiple compartments encompassing different demographic characteristics[4], various disease strains[6, 7, 8, 9, 10], and crossimmunity in an age structured model[11, 12, 13]. However, fitting data from a given epidemic to these models remains a challenge owing to a dearth of data of sufficiently high spatial and temporal resolution. This problem is exacerbated the higher the resolution of the model, since the number of model parameters grows with the number of compartments of the model. Thus, in many cases researchers introduce simplifying assumptions in order to reduce the dimensionality of the models[14, 15, 16]. In the absence of actual data, the spreading topology can generally be analyzed by numerical simulation of models using synthetic data, for example, of the patterns of human movements at the desired scale[16, 17]. It can also be analyzed through the use of theoretical tools to detect critical phenomena in networks[18, 19, 5].
Another approach to overcome a lack of epidemiological data is to use proxy data, such as mobile phone data[20] or Internet data[21]. Internet data, including search engine queries, social media postings[22, 23], or Wikipedia visits can serve as a proxy for such information and provide data at high spatial and temporal resolutions from very large populations. One demonstration of the value of these data is the recent evaluation of the childhood influenza vaccine campaign in England[23], which showed that vaccination of primary schoolage children significantly reduces influenza rates as estimated from Twitter posts and Bing queries.
In this study, we focus on modeling the spatial and temporal spread of infectious disease using a tensordriven multicompartment version of the classic SusceptibleInfectedRecovered (SIR) model. We illustrate the ability of the model to capture disease parameters by fitting it to Internet data on two common viruses, Respiratory Syncytial Virus (RSV) and West Nile Virus (WNV) in the United States (US). Our results demonstrate that the spatial and temporal dynamics of these viruses can be captured by this model with almost no apriori assumptions on the spreading topology and interaction. This allow us to characterize the spread of the virus of future seasons. The inferred model parameters are shown to be correlated with population indices such as human movement dynamics, and with demographic and environmental factors which are likely modulating transmission.
2 The model
The proposed model is a generalization of the single population SIR model[2], that describes the average evolution of an infectious disease in a single population. In the SIR model, disease spread is represented by a system of ordinary differential equations for the susceptible , infected , and recovered individuals. It accounts for only one pathogen and assumes equal probability of being infected by any individuals regardless of his position i.e. mean field interaction.
The model equations are:
(1) 
where is the infection rate and is the recovery rate. This model, Eq. (1), can be generalized to account for multiple viruses and spatial effects due to groups of subpopulations and virus strains, by transforming to a multidimensional representation in which , , , , and are tensors (Eq. (2)). These tensors act as timedependent state representation of the entire system, reflecting its various attributes or dimensions. That is, each twodimensional projection represents different subgroups e.g. viruses, spatial regions. Here, the system describes the evolution of an infectious subpopulation interacting with other subpopulations under the influence of multiple viruses. The dynamics for the systemâs state tensors evolution is represented by the following system of ordinary differential equations:
(2) 
where boldface represents a tensor. We refer to this model as a multicompartment SIR model (mcSIR). This model, and its reduction to account for multiple viruses and virus strains in one population, is presented by Levy et al.[6]. Thus, the multicompartment model is represented by a multidimensional set of equations. This representation of sub population and viruses can also be generalized to other epidemic models, such as the SIRS and the SIS[1].
In this work, we focus on the spreading mechanism in a group of subpopulations (different spatial regions) analyzed separately for each virus. Within each subpopulation the interaction between individual is well mixed i.e., of mean field type. In this case, , , and take the form of square matrices with dimension . The specific elements in these matrices, e.g. , are defined as the fraction of susceptible individuals traveling from region to at time , where , represents the fraction of susceptible individuals in subpopulation present at time . The offdiagonal elements of the matrix account for the infection rates between subpopulations and the diagonal elements represent the infection rates within each subpopulation. Note that, is not a symmetric matrix, as the probability to move from to is not assumed to be the same as moving from to . The matrix , under the assumption of distinct subpopulations, is defined as a diagonal matrix. Each element, , is the average recovery rate for individuals belonging to the th subpopulations. For our purposes, we also assume that the number of infected people in each region is the sum of all the infected people traveling to this region, and that the number of susceptible people traveling is negligible compared to the sizes of the susceptible subpopulations in the region. In this case, the model is reduced to a traditional subpopulation model[4].
The advantage of using a subpopulation model is that in contrast to situations where populations might be well mixed (a common assumption of models without spatial structure) local disease spread is taken into account. This means that most of the population is not exposed to the infection immediately upon its introduction. Rather, the disease may have to pass through many intermediate individuals before reaching all members of the population. This is mainly due to the fact that at a larger scale, individuals that are close together in space are likely to come into contact with each other more frequently than individuals that are more distant. In addition, this locally structured populations may exhibit clique behavior. Moreover, in many cases one can find a spatial epidemic wave pattern depending on the virus type and location[24, 21, 25]. This assumption, which to some extent holds true even today when people move longer distances over shorter times, may smoothout the clique behavior. Understanding the spatial nature of disease transmission has important consequences for control measures aimed at a disease per location.
In order to validate the use of subpopulation models presented in Eq. (2) to describe the observed Internet data, we tested the model on seasonal data derived from search engine queries about two commonly spread viruses in the US, RSV and WNV. In this case, the subpopulations are the 50 states. We chose these two viruses to fit to an SIR model since there is no vaccine or medicine for prevention or treatment. Moreover the symptoms are usually mild meaning people are less likely to visit a doctor or have the case reported to the national health care authorities. On the other hand, Internet search data in these cases may provide real time high resolution estimate for the disease activity.
3 Methods
3.1 Data sources
We used two Internet data sources in our analysis: Google Trends[26] and Twitter. Google Trends was used to determine the relative number (per state in the US) of queries made about each of the two viruses as described below. Data was extracted at a weekly resolution. Twitter was used to estimate human mobility patterns as described below.
3.1.1 Respiratory Syncytial Virus (RSV)
Data about RSV were extracted for the years 2013 to 2018 i.e., 5 seasons. A season starts at the beginning of October every year. The data comprised of the number of queries for the term “RSV”. A single term was sufficient in this case because of the observation[21] that this number is in high correlation with the number of people reported as infected in each state, as published by the Centers for Disease Control and Prevention (CDC). Our data included all US states excluding Hawaii.
3.1.2 West Nile Virus (WNV)
Data about WNV were extracted for the years 2014 to 2018 i.e., 4 seasons. A season starts at the beginning of April every year. Preliminary analysis of Google Trends data showed that the number of queries for the term ”West Nile” from each state during 2017 was highly correlated with the yearly total number of infected cases reported by the CDC, reaching (pvalue=). Thus, we used the number of people querying for ”West Nile” from 2014 to 2018 as a proxy for WNV disease load. Our data included all contiguous US states, excluding Hawaii and Alaska.
3.1.3 Data on human mobility patterns in the US
We collected all messages from Twitter having a GPS location in the US from October 1, 2015 through April 31, 2016. For each message we extracted an anonymized user identifier, the time of the message, and the location from where it was made. These data comprised of approximately 50 million messages. The exact GPS location of each message was mapped as its encompassing US state. From these data we created a matrix of mobility patterns, where the th cell comprised of the number of people whose location in one tweet was state and in their following tweet was state . This matrix was normalized by dividing the number of people moving from state to state by the total number of people who moved from state to any other state.
3.1.4 Demographic information
We extracted from the US census[27] variables that may affect the spreading rate of a virus within a population. These data, extracted per state, included:

Population size (number of people)

Density per square mile of land area.

Children in the following age groups: 04, 511,1214, 1417 (%)

Percentage of people in each of 6 main racial groups

Average income per household and per family (Dollars)

Poverty rate in adults and child under 18 (%)

Senior people (65 years and over) in the population (%)

Average age

Urban population (%)

Unemployment rate (%).
3.2 Parameter matching
Fitting is carried out in two steps. In the first step, a dictionary of functions is generated. Each function is the infection rate function from an instantiation of a solution to the single population SIR model (Eq. (1)) chosen from a wide range of parameter values. Specifically, we used with steps and with steps. Each function entry is intended to be proportional to the number of infected individuals per state over time sampled at a weekly resolution. Out of the dictionary, one function is chosen, such that it has the highest correlation with the data from each state over a period of one season.
In the second step, we fit the data to the multicompartment SIR (mcSIR) model presented in Eq. (2). The parameters found in the previous step are used as an initialization to the mcSIR fitting process. Specifically, the ondiagonal elements of the matrix were initialized to the values found from the single SIR model. The offdiagonal values of were initialized to uniformly distributed random values between 0 and the average found from the fit to the single population SIR. We perform a search in the parameter space for the solution of the mcSIR model which maximizes the highest geometric mean across states of the correlations between the mcSIR estimated infection rates and data acquired from Google Trends calculated over four seasons.
The use of multiple seasons provides us with more data to increase robustness to noise, under the assumption that the change in the infection rate from season to season and the recovery rate over the years can be accounted for by a single seasondependence multiplicative scaling factor .
Another assumption made in the fitting of the mcSIR model is the use of a scalar recovery rate for all seasons and all states. This assumption was made because of the observation from the single SIR fit that the recovery rates are similar for all states and seasons.
The gradient descent optimization process was limited to iterations. We repeated the optimization process times, each time with random initialization on the offdiagonal terms of the matrix. The run with highest correlation between the data and the infected rate estimated by the mcSIR model was chosen.
The single population SIR and mcSIR models are solved numerically. All analyses were performed using Matlab R2017b[28].
4 Results
The algorithm reached an average correlation (across states) of 0.70 for RSV and of 0.52 for WNV between the Google Trends data and the mcSIR model.
Figure 1 shows the distribution of the model fit () between the Google Trends data and the mcSIR estimated infection rate in each state over seasons for RSV ( state and season pairs), and seasons for WNV ( state and season pairs). For both viruses, the majority of states show high correlation between mcSIR model and the Google Trends data. For RSV (Figure 1a) the correlation distribution is centered around . Although, the data for WNV (Figure 1b) was noisier, there was still a significant number of states with a correlation higher than . We hypothesize that WNV data is noisier than RSV due to the lower incidence of the former.
4.1 Infection rate within states
After fitting the mcSIR model to the data, the infection rate matrix and the recovery rate of each virus could be examined. We note that the best value found for the recovery rate is for both RSV and WNV.
The values along the main diagonal of the matrix represent the infection rate within each subpopulation normalized by the subpopulation size[4]. Google Trends data is normalized to the maximal number of queries in each state in a given season. This, the infection rate within each state require similar normalization. Since the number of queries in a state are correlated with its population size, in order to estimate the infection rate in each state we multiply the diagonal elements of by the state’s population size.
Then, in order to explain the estimated infection rate values, we built a rank regression model where the independent variables were the known demographic variables of each state (see Methods) and the estimated infection rates in each state.
In the case of RSV, the model fit is (pvalue). Statistically significant variables are population density (slope: ) and average age (slope: ). The model for WNV reached a fit of (pvalue). Statistically significant variables are population density (slope: ), poverty (slope: ), and average age (slope: ).
Infection rates of viruses increase in areas of denser population[29] and so both viruses have a positive correlation with population density. For both viruses there is a negative correlation with average age meaning that younger people are more susceptible. RSV is known to be more severe in infants[30]. WNV is commonly spread by mosquitoes, which may explain the positive correlation with poverty, as mosquitoes could be more prevalent in such areas[31].
4.2 Interstate infection rates
The offdiagonal elements of the infection rate matrix represent the rate of infection between states. That is, these elements represent the rate at which susceptible people in one state are likely to be infected by people from another state. Thus, we hypothesized that these elements should be correlated with human mobility patterns among states. To estimate the correlation between the statelevel mobility patterns estimated from Twitter data and the values of , we normalized by the average number of infected people estimated by the mcSIR model in each state during a particular season i.e., the infection rate matrix is multiplied by a diagonal matrix of the average number of people transmitting the disease in a season in each state. This quantity is defined as the average transmission of the infection between different states.
The Spearman correlation between the movement patterns, as estimated from Twitter, and the estimate normalized infection rates for RSV was (pvalue ) and for WNV was (pvalue ). Thus, approximately a significant portion of of the variance in transmission rates of the examined viruses is explained by human mobility patterns.
Embedding the transmission rate matrix on a map of the US provides a clearer understanding of disease spread. Figure 2 shows the transmission rate matrix averaged over the first of the season for the RSV virus in 2016. The color of each state represents the week of peak infection as estimated by the model. Thus, the model predicts Florida would be the first state where the disease peaks. This also identifies Florida as the source of the virus. This is in line with our knowledge of the virus spread in the US[30]. The color of the arrows in Figure 2 represents the strength of the interaction between each two states given by the elements of the transmission matrix . Since the transmission matrix is correlated with the movements of people, the matrix elements (arrows on the map) represent the beginning of spatial virus spread in the US predicted by the model.
4.3 Characterization of the disease’s spread in the following seasons
Our results suggest that the parameters of disease spread change little from year to year. This can be observed by looking at the season dependent scaling factors, (defined in the Methods). This factor approaches a value of when there is seasonal invariance. For RSV the average value of , over seasons was , and for WNV the average value over seasons was . Thus, we sought to predict infection rates by applying the model with the estimated parameters from previous seasons, corrected using the first few weeks of data from the current season. This allow us to predict the infection rates of the entire season and the temporal location of the infection peak in each state. The latter is simply the extremal point of the model Eq. (2).
Here, we used the estimates of and from the previews seasons, together with increasing amounts of data from the 2018 season to predict the seasonal dynamics for the entire 2018 season. Specifically, we used the parameters from the previews seasons as initial parameters for the model and adjusted them using the first weeks of data from the 2018 season, by running a few iterations of the optimization algorithm.
Figures 3a and 3c show the correlation between the actual and predicted infection rates given increasing amounts of data for the current season (increasing ), for RSV and WNV, respectively. Figures 3b and 3d show the error in prediction of peak infection for RSV and WNV, respectively. Figure 3 shows that within 7 weeks from the beginning of the season it is possible to predict the progression of the entire season with reasonably high accuracy, both for RSV and WNV.
5 Discussion
Multicompartment epidemic models provide a useful tool in the understanding of disease spread. However, limits on availability of epidemiological data has made the testing and validation of such models complicated. Here, we demonstrated the utility of matching proxy data from search queries to theoretical models. This matching allows us to confirm the spatial and temporal dynamics of disease spread for two sample viruses, RSV and WNV. Additionally, this matching provides information on the parameters of disease spread. It reveals and emphasizes the importance of accounting for the spatial structural complexity of the spreading mechanism. These parameters were shown to be correlated with the demographics of people in each state, and the mobility of people across states. In addition, we show that learning the parameters of the disease spread from proxy data over multiple seasons generalizes relatively well to the next season.
Our work has drawbacks, especially, in the use of proxy data, rather than epidemiological data. Still, although less accurate, the proxy data are derived at higher spatial resolution and from a much larger, and arguably a more representative, population. The correlation between the parameters we find and the demographics of people provide an indication for the validity of this data. Therefore, this approach can be used to estimate factors influencing disease intervention and control especially in cases where epidemiological data is lacking or insufficient providing more information about the spreading mechanism.
The modeling approach we proposed makes minimal assumptions on the structure of the population, interventions, or behavior. Our approach can be used to track and model other viruses and subpopulations. Future work will focus on modeling subpopulations (for example, different age groups) within a geographic area, as well as the effects of interventions such as vaccination.
References
 [1] Keeling, M. J. & Rohani, P. Modeling infectious diseases in humans and animals (Princeton University Press, 2011).
 [2] Kermack, W. & Mckendrick, A. A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society A. 115 (1927).
 [3] Hamer, W. H. The Milroy lectures on epidemic disease in England: the evidence of variability and of persistency of type (Bedford Press, 1906).
 [4] Sattenspiel, L. The Geographic Spread of Infectious Diseases: Models and Applications: Models and Applications (Princeton University Press, 2009).
 [5] Zhang, C., Zhou, S., Miller, J. C., Cox, I. J. & Chain, B. M. Optimizing hybrid spreading in metapopulations. Scientific reports 5, 9924 (2015).
 [6] Levy, N., Iv, M. & YomTov, E. Modeling Influenzalike illnesses through composite compartmental models. Physica A: Statistical Mechanics and its Applications 494, 288–293 (2018).
 [7] Andreasen, V., Lin, J. & Levin, S. A. The dynamics of cocirculating influenza strains conferring partial crossimmunity. Journal of mathematical biology 35, 825–842 (1997).
 [8] Gog, J. R. & Grenfell, B. T. Dynamics and selection of manystrain pathogens. Proceedings of the National Academy of Sciences 99, 17209–17214 (2002).
 [9] Gupta, S., Ferguson, N. & Anderson, R. Chaos, persistence, and evolution of strain structure in antigenically diverse infectious agents. Science 280, 912–915 (1998).
 [10] Crépey, P., de Boer, P. T., Postma, M. J. & Pitman, R. Retrospective public health impact of a quadrivalent influenza vaccine in the United States. Influenza and other respiratory viruses 9, 39–46 (2015).
 [11] CastilloChavez, C., Hethcote, H. W., Andreasen, V., Levin, S. A. & Liu, W. M. Epidemiological models with age structure, proportionate mixing, and crossimmunity. Journal of mathematical biology 27, 233–258 (1989).
 [12] Dawes, J. & Gog, J. The onset of oscillatory dynamics in models of multiple disease strains. Journal of mathematical biology 45, 471–510 (2002).
 [13] Gog, J. & Swinton, J. A statusbased approach to multiple strain dynamics. Journal of mathematical biology 44, 169–184 (2002).
 [14] Grenfell, B., Bjørnstad, O. & Kappey, J. Travelling waves and spatial hierarchies in measles epidemics. Nature 414, 716 (2001).
 [15] Viboud, C. et al. Synchrony, waves, and spatial hierarchies in the spread of Influenza. science 312, 447–451 (2006).
 [16] Balcan, D. et al. Multiscale mobility networks and the spatial spreading of infectious diseases. Proceedings of the National Academy of Sciences pnas–0906910106 (2009).
 [17] Ferguson, N. M. et al. Strategies for containing an emerging influenza pandemic in southeast asia. Nature 437, 209 (2005).
 [18] Wang, Z. et al. Statistical physics of vaccination. Physics Reports 664, 1–113 (2016).
 [19] Keeling, M. J. & Eames, K. T. Networks and epidemic models. Journal of the Royal Society Interface 2, 295–307 (2005).
 [20] Tizzoni, M. et al. On the use of human mobility proxies for modeling epidemics. PLoS computational biology 10, e1003716 (2014).
 [21] Oren, E., Frere, J., YomTov, E. & YomTov, E. Respiratory syncytial virus tracking using internet search engine data. BMC public health 18, 445 (2018).
 [22] YomTov, E., JohanssonCox, I., Lampos, V. & Hayward, A. C. Estimating the secondary attack rate and serial interval of influenzalike illnesses using social media. Influenza and other respiratory viruses 9, 191–199 (2015).
 [23] Wagner, M., Lampos, V., YomTov, E., Pebody, R. & Cox, I. J. Estimating the population impact of a new pediatric Influenza vaccination program in England using social media content. Journal of medical Internet research 19, e416 (2017).
 [24] Chattopadhyay, I., Kiciman, E., Elliott, J. W., Shaman, J. L. & Rzhetsky, A. Conjunction of factors triggering waves of seasonal Influenza. eLife 7 (2018).
 [25] Riley, S. Largescale spatialtransmission models of infectious disease. Science 316, 1298–1301 (2007).
 [26] Google Trends. 2017, available from: https://trends.google.com/trends/.
 [27] United States Census Bureau, available from https://www.census.gov.
 [28] Matlab statistics and machine learning toolbox (2017b). The MathWorks, Natick, MA, USA.
 [29] Jones, K. E. et al. Global trends in emerging infectious diseases. Nature 451, 990 (2008).
 [30] Centers for Disease Control and Prevention, available from: https://www.cdc.gov/rsv/about/index.html.
 [31] LaDeau, S. L., Leisnham, P. T., Biehler, D. & Bodner, D. Higher mosquito production in lowincome neighborhoods of baltimore and Washington, DC: understanding ecological drivers and mosquitoborne disease risk in temperate cities. International journal of environmental research and public health 10, 1505–1526 (2013).