Deep reinforcement learning for largescale epidemic control
Abstract
Epidemics of infectious diseases are an important threat to public health and global economies. Yet, the development of prevention strategies remains a challenging process, as epidemics are nonlinear and complex processes. For this reason, we investigate a deep reinforcement learning approach to automatically learn prevention strategies in the context of pandemic influenza. Firstly, we construct a new epidemiological metapopulation model, with patches (one for each administrative district in Great Britain), that adequately captures the infection process of pandemic influenza. Our model balances complexity and computational efficiency such that the use of reinforcement learning techniques becomes attainable. Secondly, we set up a ground truth such that we can evaluate the performance of the “Proximal Policy Optimization” algorithm to learn in a single district of this epidemiological model. Finally, we consider a largescale problem, by conducting an experiment where we aim to learn a joint policy to control the districts in a community of 11 tightly coupled districts, for which no ground truth can be established. This experiment shows that deep reinforcement learning can be used to learn mitigation policies in complex epidemiological models with a large state space. Moreover, through this experiment, we demonstrate that there can be an advantage to consider collaboration between districts when designing prevention strategies.
1 Introduction
Epidemics of infectious diseases are an important threat to public health and global economies. The most efficient way to combat epidemics is through prevention. To develop prevention strategies and to implement them as efficiently as possible, a good understanding of the complex dynamics that underlie these epidemics is essential. To properly understand these dynamics, and to study emergency scenarios, epidemiological models are necessary. Such models enable us to make predictions and to study the effect of prevention strategies in simulation. The development of prevention strategies, which need to fulfil distinct criteria (i.a., prevalence, mortality, morbidity, cost), remains a challenging process. For this reason, we investigate a deep reinforcement learning (RL) approach to automatically learn prevention strategies in an epidemiological model. The use of modelfree deep reinforcement learning is particularly interesting, as it allows us to set up a learning environment in a complex epidemiological setting (i.e., large state space and nonlinear dependencies) while imposing few assumptions on the policies to be learned. In this work, we conduct our experiments in the context of pandemic influenza, where we aim to learn optimal school closure policies to mitigate the epidemic.
Pandemic preparedness is important, as influenza pandemics have made many victims in the (recent) past [47] and the ongoing COVID19 epidemic is yet another reminder of this fact [65]. Contrary to seasonal influenza epidemics, an influenza pandemic is caused by a newly emerging virus strain that can become pandemic by spreading rapidly among naive human hosts (i.e., human hosts with no prior immunity) worldwide [47]. This means that at the start of the pandemic no vaccine will be available and it will take several months before vaccine production can commence [59]. For this reason, learning optimal strategies of nontherapeutic intervention measures, such as school closure policies, is of great importance to mitigate pandemics [39].
To meet this objective, we consider a reinforcement learning approach. However, as the stateoftheart of reinforcement learning techniques require many interactions with the environment in order to converge, our first contribution entails a realistic epidemiological model that still has a favourable computational performance.
Specifically, we construct a metapopulation model that consists of a set of interconnected patches, where each patch corresponds to an administrative region in Great Britain and is internally represented by an agestructured stochastic compartmental model. To conduct our experiments, we establish a Markov decision process with a state space that directly corresponds to our epidemiological model, an action space that allows us to open and close schools on a weekly basis, a transition function that follows the epidemiological model’s dynamics, and a reward function that is targeted to the objective of reducing the attack rate (i.e., the proportion of the population that was infected). In this work, we will use “Proximal Policy Optimization” (PPO) [51] to learn the school closure policies.
First, we set up an experiment in an epidemiological model that covers a single administrative district. This setting enables us to specify a ground truth that allows us to empirically assess the performance of the policies learned by PPO. In this analysis, we consider different values for the basic reproductive number (Definition 1) and the population composition (i.e., proportion of adults, children, elderly, adolescents) of the district. Both parameters induce a significant change of the epidemic model’s dynamics.
Definition 1 (Basic reproductive number).
The basic reproductive number, , is the number of infections that is, on average, generated by one single infected individual that is placed in an otherwise fully susceptible population.
Through these experiments, we demonstrate the potential of deep reinforcement learning algorithms to learn policies in the context of complex epidemiological models, opening the prospect to learn in even more complex stochastic models with large action spaces. In this regard, we consider a large scale setting where we examine whether there is an advantage to consider the collaboration between districts when designing school closure policies. We conduct an experiment in our epidemiological model with 379 districts and attempt to learn a joint policy to control the districts in the CornwallDevon community, a set of tightly coupled districts. To this end, we assign an agent to each of the 11 districts of the CornwallDevon community and use a reinforcement learning approach to learn a joint policy. We compare this joint policy to a noncollaborative policy (i.e., aggregated independent learners).
2 Related work
The closing of schools is an effective way to limit the spread of an influenza pandemic [39]. For this reason, the use of school closures as a mitigation strategy has been explored in variety of modelling studies [26, 11, 43, 14, 12, 22, 27, 25, 17], of which the work presented in [22] is the most recent and comprehensive study.
The concept to learn dynamic policies by formulating the decision problem as a Markov decision process (MDP) was first introduced in [60]. The proposed technique was used to investigate dynamic tuberculosis casefinding policies in HIV/tuberculosis coepidemics [61]. Later, the technique was extended towards a method to include costeffectiveness in the analysis [62], and applied to investigate mitigation policies (i.e., school closures and vaccines) in the context of pandemic influenza in a simplified epidemiological model. The work presented in [60, 62] uses a policy iteration algorithm to solve the MDP. To scale this approach to larger problem settings, we explore the use of online reinforcement learning techniques (e.g., TDlearning, policy gradient). Note that the ”Deep Qnetworks” algorithm was recently used to investigate culling and vaccination in farms in a simple individualbased model to delay the spread of viruses in a cattle population [48]. However, to our best knowledge, the work presented in this manuscript is the first attempt to use deep reinforcement learning algorithms directly on a complex metapopulation model. Furthermore, we experimentally validate the performance of these algorithms using a ground truth, in a variety of model settings (i.e., different census compositions and different ’s). This is the first validation of this kind and it demonstrates the potential of online deep reinforcement learning techniques in the context of epidemic decision making. Finally, we present a novel approach to investigate how intervention policies can be improved by enabling collaboration between different geographic districts, by formulating the setting as a multiagent problem, and by solving it using deep multiagent reinforcement learning algorithms.
3 Epidemiological model
We construct a metapopulation model that consists of patches, where each patch represents one administrative region in Great Britain. Great Britain consists of three countries with the following administrative regions: 325 districts in England, 22 unitary authorities in Wales and 32 council areas in Scotland. Each patch consists of a stochastic agestructured compartmental model, which we describe in subsection 3.1, and the different patches are connected via a mobility model, as detailed in subsection S3. In subsection 3.3 we discuss how we validate and calibrate the model. We analyse the model’s computational complexity and discuss the model’s performance in the Supplementary Information.
3.1 Intrapatch model
We consider a stochastic compartmental SEIR model from which we sample trajectories. We first describe the model in terms of ordinary differential equations (i.e., a deterministic representation) that we then transform to stochastic differential equations [5] to make a stochastic evaluation possible. An SEIR model divides the population in a susceptible, exposed, infected and recovered compartment, and is commonly used to model influenza epidemics [17]. More specifically, we consider an agestructured SEIR model (see Figure 1 for a visualization) with a set of disjoint age groups [17, 21]. This model is formally described by this system of ordinary differential equations (ODEs), defined for each age group :
(1) 
Every susceptible individual in age group is subject to an agespecific and timedependent force of infection:
(2) 
which depends on:

The probability of transmission when a contact occurs.

The timedependent contact matrix , where is the average frequency of contacts that an individual in age group has with an individual in age group .

The frequency at which contacts with infected individuals (in age group ) occur: .
Once exposed, individuals move to the infected state according to the latency rate . Individuals recover from infection (i.e., get better or die) at a recovery rate .
We omit vital dynamics (i.e., births and deaths that are not caused by the epidemic) in this SEIR model, as the epidemic’s time scale is short and we therefore assume that births and deaths will have a limited influence on the epidemic process [54]. Therefore, at any time:
(3) 
where the total population size corresponds to agespecific census data. Our model considers 4 age groups: children (04 years), adolescents (518 years), adults (1964 years) and elderly (65 years and older).
Note that the contact frequency is timedependent, in order to model school closures, i.e., a different contact matrix is used for school term and school holiday. Following [17], we consider conversational contacts, i.e., contacts for which physical touch is not required. As we aim to model the effectiveness of school closure interventions, we use the United Kingdom contact matrices presented in [17], which encodes a contact matrix for both school term and school holiday. These contact matrices are the result of an internetbased social contact survey completed by a cohort of participants [17]. The contact matrices encode for the same age groups as mentioned before: children, adolescents, adults and elderly.
We defined the SEIR model in terms of a system of ODEs which implies a deterministic evaluation of the system. However, for predictions, stochastic models are preferred, as they to account for stochastic variation and allow us to quantify uncertainty [29]. In order to sample trajectories from this set of differential equations, we transform the system of ODEs to a system of stochastic differential equations (SDEs), using the transformation procedure presented in [5]. This procedure introduces stochastic noise to the system by adding a Wiener process to each transition in the ODE. We evaluate the SDE at discrete time steps using the EulerMaruyama approximation method [5].
Each compartmental model is representative of one of the
administrative districts and as such the compartmental model is parametrised with the census data of the respective district, i.e., population counts stratified by age groups.
We use the 2011 United Kingdom census data made available by NOMIS
3.2 Interpatch model
Our model, that is comprised of a set of connected SEIR patches, is inspired by the recent BBC pandemic model [32]. The BBC pandemic model was in its turn motivated by the model presented in [23].
At each time step, our model checks whether a patch becomes infected. This is modulated by the patch’s force of infection, which combines the potential of the infected patches in the system, weighted by a mobility model, that represents the commuting of adults between the different patches:
(4) 
where is the set of patches in the model, is the mobility flux between patch and , is the probability of transmission on a contact, is the susceptible population of adults in patch at time and its contribution is modulated by parameter (range in ), and is the infectious potential of patch at time . We define this infectious potential as,
(5) 
where is the size at time of the infectious adult population in patch and is the average number of contacts between adults, as specified in the contact matrix (see subsection 3.1).
is a matrix based on the mobility dataset provided by NOMIS
In general, this interpatch model is constructed from first principles i.e., census data, a mobility model, the number of infected individuals and the transmission potential of the virus. However, for the parameter that modulates the contribution of the susceptibles in the receptive patch (while it is commonly used in literature [23, 18, 31]) no such intuition is readily available. Therefore, this parameter is typically fitted to match the properties of the epidemic that is under investigation [23, 18, 31]. We will calibrate this parameter such that it can be used for a range of values, as detailed in the next subsection.
Given this timedependent force of infection, we model the event that a patch becomes infected with a nonhomogeneous Poisson process [58]. As the process’ intensity depends on how the model (i.e., the set of all patches) evolves, we cannot sample the time at which a patch becomes infected a priori. Therefore, we determine this time of infection using the time scale transformation algorithm [13]. Details about this procedure can be found in Supplementary Information. Following [32], we assume that a patch will become infected only once.
By using this time scale transformation algorithm and evaluating the stochastic differential equation at discrete time steps, we produced a model with favourable performance, i.e., in our experiments we can run about simulation runs per second on a MacBook Pro (CPU: 2,3 GHz Intel Core i5). We analyse the model’s computational complexity in the Supplementary Information.
3.3 Calibration and validation
Our objective is to construct a model that is representative for contemporary Great Britain with respect to population census and mobility trends. This model will be used to study school closure intervention strategies for future influenza pandemics. While in many studies [31, 23, 18] a model is created specifically to fit one epidemic case, we aim for a model that is robust with respect to different epidemic parameters, most importantly , the basic reproduction number.
To validate our model according to these goals, we conduct two experiments. In the first experiment, we compare our patch model to an SEIR compartmental model that uses the same contact matrix and age structure, but with homogeneous spatial mixing (i.e., no spatial structure). While we do not expect our model to behave exactly like the compartmental model, as the patches and the mobility network that connects them induces a different dynamic, we do observe similar trends with respect to the epidemic curve and peak day. This experiment also enables us to calibrate the parameter. We present a detailed description of this analysis and report the results in Supplementary Information. In the second experiment we show that our model is able to reproduce the trends that were observed during the 2009 influenza pandemic, commonly known as the swineorigin influenza pandemic, i.e., A(H1N1)v2009, that originated in Mexico. The 2009 influenza pandemic in Great Britain is an interesting case to validate our model for two reasons. Firstly, the pandemic occurred quite recently and thus our model’s census and mobility scheme are a good fit, as both the datasets on which we base our census and mobility model were released in 2011. Secondly, due to the time when the virus entered Great Britain, the summer holiday started 11 weeks after the emergence of the epidemic. The timing of the holidays had a severe impact on the progress of the epidemic and resulted in a epidemic curve with two peaks. This characteristic epidemic curve enables us to demonstrate the predictive power of our agestructured contact model with support for school closures. In Figure S13, we show a set of model realisations in conjunction with the symptomatic case data, which shows that we were able to closely match the epidemic trends observed during the British pandemic in 2009 (details on this case study in the Supplementary Information). Note that our model reports the number of infections while the British Health Protection Agency only recorded symptomatic cases. Therefore we scale the epidemic curve with a factor of . This large number of asymptomatic cases produced by our model is in line with earlier serological surveys [41] and with previous modelling studies [33].
4 Learning environment
In order to apply reinforcement learning, we construct an MDP based on the epidemiological model that we introduced in the previous section. This epidemiological model consists of patches that correspond to administrative regions.
We have an agent for each patch that we attempt to control, and for each agent we have an action space that allows us to open and close schools for one week. Each agent has a predefined budget of school closure actions it can execute. Once this budget is depleted, executing a close action will default to executing an open action. We refer to the remaining budget at time as . In the epidemiological model, when schools are closed we use a contact matrix that is representative for school holidays and when schools are open we use a contact matrix that is representative for school term (details in Section 3.1).
For each patch, we consider a state space that combines the state of the SEIR model and the remaining budget of school closures . For the SEIR model, we have 16 state variables (i.e., ), as we have an SEIR model (4 state variables) for each of the four age groups. The remaining school closure budget is encoded as an integer, resulting in a combined state space of 17 variables. We refer to the state space of one patch , that thus combines the epidemiological states and the budget, as . The state space of the MDP corresponds to the aggregation of the state space of each patch that we attempt to control:
(6) 
where is the set of patches that we control.
The transition probability function stochastically determines the state of the epidemic in the next week, taking into account the school closure actions that were chosen, using the epidemiological dynamics as defined in the previous section.
To reduce the attack rate, we consider an immediate reward function that quantifies the negative loss in susceptibles over one simulated week:
(7) 
where is the function that determines the total number of susceptible individuals given the state of the epidemiological model.
For PPO, we have both a policy and value network. The policy network accepts the state of the epidemiological model as input (details in Section 4) and the output of the network contains 1 unit, which is passed through a sigmoid activation function. This output thus represents the probability of keeping the schools open in the district. Every hidden layer in the PPO network uses the hyperbolic tangent activation function. The value network has the same architecture as the policy network, with the exception that the output is not passed through an activation function. We will refer to this setting throughout this work as the singledistrict PPO agent.
PPO’s hyperparameters are tuned (hyperparameter values in Supplementary Information) on a singledistrict (i.e., the Greenwich district) learning environment with . To this end, we performed a hyperparameter sweep using Latin hypercube sampling () [52].
We conduct two kinds of experiments: in the context of a single district and in the context of the Great Britain model that combines all 379 districts. We consider two values for the reproductive number, i.e., , to investigate the effect of distinct reproductive numbers. represents an epidemic with moderate transmission potential [19] and represents an epidemic with high transmission potential [38]. We investigate the effect of different school closure budgets, i.e., weeks. The epidemic is simulated for a fixed number of weeks, chosen beforehand, to ensure that there is enough time for the epidemic to fade out after its peak. Following [6], we use a latent period of one day () and an infectious period of 1.8 days (). Given the contact matrix , the latency rate , the recovery rate , we can compute for an , as specified in the Supplementary Information.
5 PPO versus ground truth
We now establish a ground truth for different population compositions, i.e., the proportion of the different age groups in a population. We will use this ground truth to empirically validate that PPO converges to the appropriate policy.
To establish this ground truth
In this analysis, we consider different values for the basic reproductive number and the population composition of the district, both parameters that induce a significant change of the epidemic modelâs dynamics. To this end, we select 10 districts that are representative of the population heterogeneity in Great Britain: one district that is representative for the average of this census distribution and a set of nine districts that is representative for the diversity in this census distribution. Details on this selection procedure can be found in the Supplementary Information.
To evaluate PPO with respect to the ground truth, we repeat the experiment for which we established a ground truth (i.e., , 10 districts and ) and learn a policy using PPO, in the stochastic epidemic model. For each experimental setting (i.e., the combination of a district, an value, and a school closure budget ), we run PPO 5 times (5 trials), to asses the variance of the learning performance. Each PPO trial is run for episodes of weeks. We show the learning curves, i.e., total reward at the end of the episode, for the district that is representative for the average of the census distribution (i.e., the Barnsley district in England), with in Figure 3, for the other settings we report similar learning curves in the Supplementary Information.
To compare each of the learned policies to its ground truth (one for each district), we take the learned policy and apply it 1000 times in the stochastic model, which results in a distribution over model outcomes (i.e., attack rate improvement: the difference between the attack rate produced by the model and the baseline when no schools are closed). We then compare this distribution to the attack rate improvement that was recorded for the ground truth. We show these results, for the setting with a school closure budget of 6 weeks and , in Figure 3, and for the other settings in Supplementary Information. These results show that PPO learns a policy that matches the ground truth for all districts and combinations of and .
Note that for these experiments, we use the same hyperparameters for PPO that were introduced in Section 4. This demonstrates that, for different values of and for different census compositions (which induce a significant change in dynamics in the epidemic model) these hyperparameters work well. This indicates that these hyperparameters are adequate to be used for different variations of the model.
In this section, we compare to the ground truth (that has been found through an exhaustive policy search) to a policy learned by PPO, a deep reinforcement learning algorithm. This allows us to empirically validate that PPO converges to the optimal policy. This experimental validation is important, as it demonstrates the potential of deep reinforcement learning algorithms to learn policies in the context of complex epidemiological models. This indicates that it is possible to learn in even more complex stochastic models with large action spaces, for which it is impossible to compute the ground truth. In Section 6, we investigate such a setting, where we aim to learn a joint policy for a set of districts, using deep multiagent reinforcement learning.
6 Multidistrict reinforcement learning
To investigate the collaborative nature of school closure policies, we apply deep multiagent reinforcement learning algorithms. In our model, we have 379 agents, one for each district, as agents represent the district for which they can control school closure. As the current stateoftheart of deep multiagent reinforcement learning algorithms is limited to deal with about agents [28], we thus need to partition our model into smaller groups of agents, such that deep multiagent reinforcement learning algorithms become feasible. To this end, we analyse the mobility matrix to detect clusters of districts that represent closely connected communities (details in Supplementary Material). Through this analysis we identify a community with 11 districts, to which we will refer as the CornwallDevon community, as it is comprised of the Cornwall and Devon regions.
We now examine whether there is an advantage to consider the collaboration between districts when designing school closure policies. We conduct an experiment in our epidemiological model with 379 districts, and attempt to learn a joint policy to control the districts in the CornwallDevon community. To this end, we assign an agent to each of the 11 districts of the CornwallDevon community, and use a reinforcement learning approach to learn a joint policy. We compare this joint policy to a noncollaborative policy (i.e., aggregated independent learners).
We refer to the state space of one patch as , as detailed in Section 4. The state space of the MDP corresponds to the aggregation of the state space of the set of patches that we attempt to control. In this experiment, corresponds to the 11 districts in the CornwallDevon community.
In order to learn a joint policy, we need to consider an action space that combines the actions for each district that we attempt to control. This results in a joint action space with a size that is exponential with respect to the number of agents. To approach this problem, we use a PPO superagent that controls multiple districts simultaneously, to learn a joint policy. To this end, we use a custom policy network that gets as input the combined model state of each district (Equation 6), and as a result, the input layer has input units. In contrast to the singledistrict PPO, that was introduced in Section 4, the output layer of the policy network of this agent has a unit for each district that we attempt to control. Again, each output unit is passed through a sigmoid activation function, and hence corresponds to the probability of closing the schools in the associated district. Similar to the singledistrict PPO, each hidden layer uses the hyperbolic tangent activation function. The value network has the same architecture for the input layers and hidden layers, but only has a single output unit that represents the value for the given state. We will refer to this agent as multidistrict PPO.
We conduct experiments for (i.e., moderate transmission potential) and (i.e., high transmission potential), and we consider a school closure budget of 6 weeks, i.e., . We run multidistrict PPO 5 times, to assess the variance of the learning signal, for episodes of weeks, and we show the learning curves in Figure 4. These learning curves demonstrate a stable and steady learning process. For the reward curve is still increasing, while for the reward curve indicates that the learning process has converged.
To investigate whether these joint policies provide a collaborative advantage, we compare it to the aggregation of single district policies, to which we will refer as the aggregated policy. To construct this aggregated policy, we learn a distinct school closure policy for each of the 11 districts in the CornwallDevon community, using PPO, following the same procedure as in Section 5. To evaluate this aggregated policy, we execute the distinct policies simultaneously. For the districts that are not controlled (both for the joint and aggregated policy) we keep the schools open for all time steps. For both and , respectively, we simulate the joint and the aggregated policy 1000 times, and we show the attack rate improvement distribution in Figure 5. These results corroborate that there is a collaborative advantage when devising school closures policies, for both and . However, the improvement is most significant for . We conjecture that this difference is due to the fact that there is less flexibility when the transmission potential of the epidemic is higher, since there is less time to act. Although, we observe an improvement when a joint policy is learned, it remains challenging to interpret deep multiagent policies, and we discuss in Section 7 possible directions for future work with respect to multiagent reinforcement learning.
In this analysis, where we have a limited number of actions per agent, the use of multidistrict PPO proved to be successful. However, the use of more advanced multiagent reinforcement learning methods is warranted when a more complex action space is considered or a larger number of districts needs to be controlled. For this reason, we also investigated the recently introduced QMIX [50] algorithm, but the resulting learning curve proved to be quite unstable (shown in Supplementary Information).
We conducted our experiments in the setting of school closures, and our findings are of direct relevance with respect to the mitigation of pandemic influenza. Furthermore, our novel approach to investigate the collaborative nature of prevention strategies as a multiagent reinforcement learning problem, can be applied to other epidemiological settings, such as for example the ongoing COVID19 pandemic.
7 Discussion
We demonstrate the potential of deep reinforcement learning in the context of epidemiological decision making by conducting experiments that show that PPO converges to the optimal policy. Next, we investigate and show that there is a collaborative advantage when devising school closures policies, by formulating this hypothesis as a multiagent problem.
The work conducted in this manuscript indicates that there is the potential to use reinforcement learning in the context of complex stochastic epidemiological models. For future work, it would be interesting to investigate how well these algorithms scale to even larger state and/or action spaces. In this regard, the use of attentionbased multiagent reinforcement learning algorithms could be explored [37].
Another important concern is to scale these reinforcement learning methods to individualbased epidemiological models, as such models can be easily configured to approach a variety of research scenarios, i.a., vaccine allocation, telecommuting, antiviral drug allocation. However, the computational burden that is associated with individualbased models complicates the use of reinforcement learning methods [64]. To this end, it would be interesting to devise methods to automatically learn a surrogate model from the individualbased model, such that the reinforcement learning agent can learn in this computationally leaner surrogate model.
While we show that deep reinforcement learning algorithms can be used to learn optimal mitigation strategies, the interpretation of such policies remains challenging [24]. This is especially the case for the multidistrict setting we considered, where state and time do not match, and the infection onset of the patches is highly stochastic. To this end, further research into explainable reinforcement learning, both in a singleagent and multiagent setting, is warranted.
Acknowledgments
Pieter Libin and Timothy Verstraeten were supported by a PhD grant of the FWO (Fonds Wetenschappelijk Onderzoek  Vlaanderen). We thank Jelena Grujic (Vrije Universteit Brussel) for her comments on the network analysis.
Supplementary Information
S1 Census data
Each compartment model is representative of one of the
districts defined in the main manuscript, and as such the compartment model is parametrised with the census data of the respective district, i.e., population counts stratified by age groups.
We use the 2011 United Kingdom census data made available by NOMIS
To be compatible with our model, we need to map this census data to the age structure imposed by the Eames contact matrix: i.e., 04 years (children), 518 years (adolescents), 1964 years (adults), 6590+ years (elderly). To define this mapping, we will refer to the number of individuals with the symbol , subscripted with the dataset type (i.e., NOMIS or Eames) and the age group.
For the age group 04 and 6590+ we have a direct mapping:
(S1) 
However, as for the contact matrix the adolescents and adults are split between age 18 and 19, and for the census data these 2 age groups are aggregated, we need to make a custom mapping. We will aggregate all shared age groups and divide the common age group in two:
(S2) 
When restructuring the census data according to the Eames age groups, we observe clear trends over the districts with respect to the proportion of children, adolescents, adults and elderly, as shown in Figure S1. However, the histograms in Figure S1 only show the marginalized distribution per age group. To reason about the distribution over all age groups, consider that we have a proportion of each of the age groups, and we thus can represent this data as a positive simplex [2], as defined in Definition 2.
Definition 2 (Unit simplex).
A unit simplex [1], with components, corresponds to the set:
(S3) 
This representation enables us to reason about this data in a statistical framework, and to visualize the fourdimensional data in a threedimensional space by using the Barycentric coordinate system, as shown in Figure S2. Figure S2 shows that the census distribution exhibits a dense region with a limited number of outliers.
Note that we use the 2011 census dataset, rather than the more recent 2018 census dataset, to be fully compatible with the mobility dataset used to inform our betweenpatch transition model (see Section S3).
For each district, the base contact matrix is corrected to make it reciprocal, using that district’s census data.
S2 Parametrising the model with
In influenza modelling literature, it is common to parametrise the model in terms of a specific value. We will now introduce an equation that enables us to compute the transmission probability for a given value.
To this end, we need to determine the nextgeneration matrix, which summarizes the number of secondary infections between age groups [57], and determine the spectral radius of this matrix (see Definition 3).
Definition 3 (Spectral radius).
The spectral radius of a matrix , , is the dominant eigenvalue of that matrix .
As the contact matrix is a square matrix with positive real entries, according to the PerronFrobenius theorem, this eigenvalue exists and is unique. Note that, in a graph, the spectral radius is a measure of the graph’s connectivity [36]. As our context matrix can be seen as a graph that represents how strongly the different age groups are connected, this notion of connectivity applies here as well.
Following [16] and [17], we construct the nextgeneration matrix for our SEIR model:
(S4) 
Given , we can compute as:
(S5) 
Using equation S5, we can now compute the transmission risk for a given , and contact matrix .
Note that for each district, we have a contact matrix that is corrected for reciprocity by using that district’s census data. Therefore, we have a distribution over , where is a contact matrix for district . We would expect that this distribution is centred around , where the contact matrix that is corrected for reciprocity using the census data representative for GreatBritain in its entirety (i.e., an aggregation of all the districts). This is confirmed in Figure S3, which shows that the median of the distribution over coincides with . Furthermore, note that the contact matrix denotes the average frequency of contacts that an individual in age group has with an individual in age group . Figure S3 thus shows a limited variance ().
S3 Betweenpatch model
Our model, that is comprised of a set of connected SEIR patches, is inspired by the recent BBC pandemic model [32]. The BBC pandemic model was in its turn motivated by the model presented in [23].
At each time step, our model decides whether a patch becomes infected. This is modulated by the patch’s force of infection, which combines the potential of the infected patches in the system, weighted by a mobility model:
(S6) 
where is the set of patches in the model, is the mobility flux between patch and , is the probability of transmission on a contact, is the susceptible population of adults in patch at time and its contribution is modulated by parameter , and is the infectious potential of patch at time . We define this infectious potential as,
(S7) 
where is the size at time of the infectious adult population and is the average number of contacts between adults.
is a matrix based on the mobility dataset provided by NOMIS
In general, this betweenpatch model is constructed from first principles i.e., census data, a mobility model, the number of infected individuals and the transmission potential of the virus. However, for the parameter that modulates the contribution of the susceptibles in the receptive patch, while it is commonly used in literature [23, 18, 31], no such intuition is readily available. Therefore, this parameter is fitted to match the properties of the epidemic that is under investigation [23, 18, 31].
To validate our model, we conduct two experiments. Firstly, we compare our model to the original compartment model and perform a sensitivity analysis with respect to parameter . Secondly, we show that our model fits the recent influenza pandemic of 2009, by choosing an appropriate value for all model parameters.
Given this timedependent force of infection, we model the event that a patch becomes infected with a nonhomogeneous Poisson process [58, 53, 8]. A Poisson process can be used to model the occurrence of events with a given intensity (see Definition 5), and nonhomogeneous Poisson processes generalize this concept to timedependent intensities (see Definition 6). As the process’ intensity depends on how the model (i.e., the set of all patches) evolves, we cannot sample the time at which a patch becomes infected a priori. Therefore, we determine this time of infection using the time scale transformation algorithm [13]. Firstly, we explain the generic time scale transformation algorithm (Section S3.1). Secondly, we adjust the algorithm to our setting (Section S3.2).
s3.1 Time scale transformation algorithm
The time scale transformation algorithm enables us to determine the time at which an event, modelled by a nonhomogeneous Poisson process, will take place [13].
We will start by formally defining the homogeneous and nonhomogeneous Poisson process. A Poisson process is a counting arrival process, defined on a sample space with probability measure .
Definition 4 (Arrival process).
An arrival process is a stochastic process [13],
(S8) 
such that for any , the mapping , has , is nondecreasing, increases only by integer jumps and is right continuous.
Definition 5 (Homogeneous Poisson process).
From this definition, we can show that for each homogeneous Poisson process :
(S10) 
for some constant , where signifies the intensity (i.e., rate) of the process.
The concept of a Poisson process can be generalized to a nonhomogeneous Poisson process by removing the timehomogeneity requirement:
Definition 6 (Nonhomogeneous Poisson process).
A nonhomogeneous Poisson process is an arrival process [13],
(S11) 
for which these axioms hold:

for almost all , jumps in steps of size 1

the number of arrivals within any interval , is independent of the history of arrivals prior to
has a timedependent rate that is specified by the intensity function , where .
We define the process’ cumulative intensity function:
Definition 7 (Cumulative intensity function).
A nonhomogeneous Poisson process with intensity function has a cumulative intensity function:
(S12) 
Furthermore, we can show that [46]:
(S13) 
and thus we have that is the expectation function of :
(S14) 
From Definition 7, it is clear that will be a nondecreasing function and at least rightcontinuous.
The crucial theorem that underlies the time scale transformation algorithm denotes that the arrival times in a nonhomogeneous Poisson process can be mapped to a homogeneous Poisson process with rate (Theorem 1). We present an example that demonstrates this theorem in Figure S4.
Theorem 1 (Mapping nonhomogeneous Poisson processes).
Let be a continuous nondecreasing cumulative intensity function. Then,
(S15) 
are the arrival times in a nonhomogeneous Poisson process if and only if
(S16) 
are the arrival times in a homogeneous Poisson process with rate [13].
The time scale transformation algorithm uses the relation in Theorem 1 to transform a homogeneous Poisson process with into a nonhomogeneous Poisson process with expectation function . The homogeneous process is formed by sampling from an exponential probability distribution with . To make this transformation possible, a time inverse function of is required:
Definition 8 (Time inverse of ).
The time inverse of an expectation function for a nonhomogeneous Poisson process :
(S17) 
In Algorithm 1, we formalize this procedure. At each step , we obtain a sample from an exponential probability distribution with rate , which is added to the set of samples . The sum of the elements in represents the arrival in the homogeneous Poisson process, and is transformed into the arrival in the nonhomogeneous Poisson process using the inverse time function .
s3.2 Time scale transformation algorithm to model the infection of patches
In order to use the time scale transformation algorithm (Algorithm 1) in our epidemiological model, note that the patches’ internal state is updated in a discrete number of time steps. We determine a patch’s intensity (Equation S6) at the end of each day. This results in a sequence of intensities between which we linearly interpolate to obtain a piecewise linear intensity function:
(S18) 
where
(S19) 
interpolates linearly between and .
This piecewise linear intensity function is continuous and thus its cumulative counterpart is continuous as well. Furthermore, as for all , is nondecreasing.
As depends on the simulation state at time , it is clear that we cannot evaluate this function beyond the current simulator time step. However, the definition of the time inverse (Definition 8) shows that we can use the arrival time in the homogeneous Poisson process as a threshold for the arrival time in the nonhomogeneous Poisson process. We formalize this thresholdbased time scale transformation algorithm in Algorithm 2. Note that this algorithm approximates the original algorithm as we check whether the threshold is surpassed at discrete time steps.
S4 Model validation
Our objective is to construct a model that is representative for contemporary Great Britain with respect to population census and mobility trends. This model is to be used to study school closure intervention strategies for future influenza pandemics. While in many studies [31, 23, 18], a model is created specifically to fit one epidemic case, we aim for a model that is robust with respect to different epidemic parameters, most importantly , the basic reproduction number.
To validate our model according to these goals, we conduct two experiments. In the first experiment, we compare our patch model to a SEIR compartment model that uses the same contact matrix and age structure. While we do not expect our model to behave exactly like the compartment model, as the patches and the mobility network that connects them induces a different dynamic, we do expect to see similar trends with respect to the epidemic curve and peak day. In the second experiment we show that our model is able to reproduce the trends that were observed during the 2009 influenza pandemic, commonly known as the swineorigin influenza pandemic, that originated in Mexico. The 2009 influenza pandemic in Great Britain is an interesting case to validate our model for three main reasons. Firstly, the pandemic occurred quite recently and thus our model’s census and mobility scheme should be a good fit, as both the datasets on which we base our census and mobility model were released in 2011. Secondly, due to the time when the virus entered Great Britain, the summer holiday started 11 weeks after the emergence of the epidemic. The timing of the holidays had a severe impact on the progress of the epidemic and resulted in a epidemic curve with two peaks. This characteristic epidemic curve enables us to demonstrate the predictive power of our agedependent contact model with support for school closures. Thirdly, the number of symptomatic cases that occurred in Great Britain during the 2009 pandemic was recorded meticulously and is publicly available [33].
s4.1 Comparison to the Eames SEIR compartment model
In this experiment, we compare our patch model to a simple SEIR model that encompasses the same age structure and contact matrix [17], to which we will refer as EamesSEIR from this point onwards. We consider a stochastic implementation of the EamesSEIR. This experimental setting will be central to the reinforcement learning experiments, related to finding optimal school closure policies, that we present in the main manuscript.
Following [17] and [6], we use a latent period of one day () and an infectious period of 1.8 days (). We perform our experiment for a set of values within the range of to , in steps of 0.2. This range is considered representative for the epidemic potential of influenza pandemics [9, 40].
Furthermore, we need to choose a value for the parameter in the betweenpatch model, i.e., , that modulates the contribution of susceptible adults in the receiving patch (see Section S3). This parameter is typically fitted towards data, however, in this experiment and in the reinforcement learning experiments in the main manuscript, we consider a model to investigate future epidemics. Our goal is to calibrate our model such that it produces peak days that are similar to the peak days in EamesSEIR [17], which is a prominent model for pandemic influenza that moreover generates peak days that are in agreement with earlier work [19]. Therefore, we investigate the effect of in this setting, through a sensitivity analysis. We consider in the interval , where the left end of the interval (i.e., ) signifies that the contribution of susceptible adults is ignored and the right end of the interval (i.e., ) signifies that the contribution of adults is not modulated. In Figure S6, we show the results for the sensitivity analysis for , together with the peak days for the EamesSEIR model. From these results, it is clear that the different values for form a gradient within the interval .
However, no value of provides a good fit for all of the considered ’s, when comparing the peak days to the EamesSEIR model. Rather, we can discern a logrelationship between and the best fit for the different ’s. Based on this observation, we propose to define:
(S20) 
where is a scaling factor. For this experimental setting, we find that provides a good fit for all of the considered ’s, which we show in Figure S7.
Provided this choice of , when we compare the epidemic trajectories of our spatial model with the SEIREames model in Figure S8, we observe similar trends with respect to the shape of the trajectory distributions. The main difference is that the epidemic curves grow slower in our spatial model than in the EamesSEIR model and also achieve a lower peak incidence. This is expected, as we constrain mixing in our spatial model within the districts, and thus increase the resolution of our model, which has been shown to more accurately predict peak incidence [42].
Furthermore, in Figure S9, we show the number of districts that get infected over time for different values. This shows that all districts get infected, and the time it takes for all districts to reach this point depends mainly on the transmissionability of the virus strain.
We expect the attack rate to be similar for the SEIREames and spatial model. When all districts get infected, the attack rate in the spatial model is the sum of the attack rates of a set of SEIREames models (i.e., one EamesSEIR model per district). We verified that the attack rates are indeed nearly identical, as shown in Figure S10, with little variance for either of the models.
s4.2 2009 influenza pandemic in Great Britain
The virus responsible for the 2009 influenza pandemic arrived in Great Britain during the first week of May 2009 (week 19). Following this introduction, the epidemic grew for 11 weeks until the summer school holidays started, after which the epidemic showed its first peak. After the school holidays, the epidemic was rekindled and grew to a second peak. In Figure S11 we show the weekly case count, as recorded by the British Health Protection Agency (HPA) and the time at which the school holidays take place.
To reproduce this distinctive epidemic curve, we use our original model as it was described in the main manuscript. We consider two free parameters: the basic reproduction number and the time of the infectious period. The general consensus is that the basic reproduction number was moderate during the 2009 influenza pandemic, with estimates ranging from to [33, 56, 15, 20, 63, 44, 7]. We present a detailed overview of the reported basic reproduction number estimates in Table 1. For the period of infectiousness we found estimates of , and days [17, 7, 56]. We present a detailed overview of the infectious period estimates in Table 2.
Source  

1.221.58  [20] 
1.31.7  [63] 
1.211.35  [44] 
1.75  [7] 
1.872.07  [15] 
1.31  [56] 
1.161.59  [33] 
Infectious period  Source 

1.8  [17] 
2.5  [7] 
3.38  [56] 
Given these prior estimates, we parametrize our model with a basic reproduction number that is in the range of to and consider a duration of infectiousness of respectively , and days.
For this experiment, we found,
(S21) 
to be a good fit for the overall comparison. In Figure S12 we show the epidemic curve for our model with respect to these parameters. In general, the epidemic curves that result from using an infectious period of days are insufficient to reproduce the trends of the 2009 pandemic. For the other infectious periods (i.e., and ), we show that for all but the highest reproductive numbers we observe an epidemic curve with 2 peaks. Furthermore, we observe a deeper trough in the epidemic curve when an infectious period of days is chosen.
In Figure S13, we show a set of model realisations in conjunction with the symptomatic case data, which shows that we were able to closely match the epidemic trends observed during the British pandemic in 2009. This model was configured with a basic reproductive number of and infectious period of . The reproductive number is in good concordance with the general consensus that the virus responsible for the 2009 pandemic exhibited a moderate infectiousness. While the infectious period slightly differs from the value reported by [7] (i.e., days), it lies well within the confidence bounds reported in this study (confidence interval: 1.14.0 days). Note that our model reports the number of infections, while the HPA only recorded symptomatic cases. Therefore we scale the epidemic curve with a factor of . While atypical, this large number of asymptomatic cases produced by our model is in line with earlier serological surveys [41] and with previous modelling studies [33].
S5 Computational complexity and performance
An analysis of the computational complexity of our model needs to consider that the model incorporates two components. On the one hand, infection in the patches is triggered via the time scale transformation algorithm (see Section S3.2). On the other hand, once infected, each patch in the system evolves independently, and we use the EulerMaruyama approximation method to obtain samples from the stochastic differential equation that is associated with the patch. The time scale transformation algorithm samples a random threshold for each patch, which is compared to the force of infection of the associated patch. This comparison occurs at each time step that the patch was not yet infected. Computing the force of infection in Equation S6 considers all model patches, and thus has a worst case complexity that is linear in the number of model patches, i.e.,
(S22) 
In worst case, at each time step , if only one of the model patches is infected, and we need to compute the force of infection for each patch, which has a quadratic complexity in the number of model patches, i.e.,
(S23) 
However, when each patch only needs to be infected once, we observe that we have a complexity in terms of infected and uninfected patches . After all, at each time step, we only need to consider the force of infection of the uninfected patches, and this force of infection only takes into account the infected patches, i.e.,
(S24) 
Since we have,
(S25) 
we can see that while this expression has the same worst case complexity as in Equation S23, in practice less operations will be required.
For both the complexity in Equation S23 and Equation S24, it is clear that as long as the number of patches is limited, as is the case in our model, this procedure will be computationally efficient, as this can be implemented as a vector product, on vectors that all fit in memory (RAM).
When a patch is infected, at each time step it will be advanced by using a number of operations that is proportional to the number of compartments in the the agedependent SEIR model.
This model was implemented in Python, and the performance critical sections were either implemented using NumPy when a vector representation was possible (e.g., to compute the force of infection) [45], or JITcompiled using Numba (e.g., to evolve the agedependent SEIR model in a patch) [34]. This implementation performs well, resulting in about simulation runs per second on a MacBook Pro.
S6 Selecting districts to establish a ground truth
To establish a ground truth, we select 10 districts that are representative of the population heterogeneity in Great Britain. To this end, we remind the reader that in Section S1, we analysed the population heterogeneity by representing the population structure as a positive simplex. We select 10 districts: one district that is representative for the average of this distribution and a set of nine districts that is representative for the diversity in this distribution. To determine the average district, we consider the population heterogeneity distribution over all districts, and determine the Aitchison’s mean (Definition 9) of this distribution [4]. We then select the district that is closest to the Aitchison’s mean (Definition 9) according to the Aitchison distance (Definition 10), as shown in Figure S14.
Definition 9 (Aitchison’s mean).
Definition 10 (Aitchison distance).
Next, we determine the outer extreme points, as these represent the most diverse census points. To do this, we compute the convex hull of the point cloud (i.e., the smallest convex set of points that contains the point cloud), as shown in Figure S15.
We proceed by taking the points that belong to the surface of the convex hull, of which we make a subselection of 9 census points. As the convex hull consists out of 21 points, we consider all kcombinations (with ) and select the set of points that maximizes the minimum Aitchison distance between the selected points, as shown in Figure S16.
S7 Finding communities
To investigate the collaborative nature of school closure policies, we apply deep multiagent reinforcement learning algorithms. In our model, we have 379 agents, one for each district, as agents represent the district for which they can control school closure. As the current stateoftheart of deep multiagent reinforcement learning algorithms is limited to deal with about agents [28], we thus need to partition our model into smaller groups of agents, such that deep multiagent reinforcement learning algorithms become feasible.
To this end, we consider the mobility matrix and define a directed commute graph for (Definition 11).
Definition 11 (Commute graph).
For a commuting matrix that describes the mobility flux between a set of districts , we define a commute graph,
(S31) 
where is the set of vertices, with a vertex for each of the districts in , and is the adjacency matrix that specifies the vertices that are connected:
(S32) 
Each pair of connected vertices and has a weight .
To detect communities in the commute graph, we used the Leiden algorithm [55], an algorithm that searches for communities that maximize the network modularity [35].
We found a partition of which we demonstrated the robustness (value ) using a bootstrapping approach presented in [49].
Furthermore, by rendering this partition on top of the map of Great Britain, as is shown in Figure S17, we show that the districts belonging to the same community are close to each other geographically, as we would expect.
Moreover, when we overlay the NUTS2 administrative regions
We conduct our multiagent reinforcement learning experiments in the community with 11 districts, to which we will refer as the CornwallDevon community (see Figure S17), as it is comprised of the Cornwall and Devon NUTS2 regions.
S8 PPO learning curves ()
S9 PPO learning curves ()
S10 Comparing PPO to the ground truth (attack rate)
S11 QMIX reward curves
S12 PPO hyperparameters

Number of local steps:

Batch size: (i.e., minibatches)

Clipped Surrogate Objective epsilon :

Number of epochs:

Entropy coefficient:

:

Generalized advantage estimation :

Neural network of actor and critic:

Number of hidden layers:

Number of units per hidden layer:

Nonlinearity: hyperbolic tangent ()

learning rate :

Optimizer: Adam [30]

Gradient norm clipping threshold:

Footnotes
 https://www.nomisweb.co.uk
 We use the NOMIS WU03UK dataset that was released in 2011.
 Note that this is a proxy to the ground truth, as we use a deterministic version of the model.
 https://www.nomisweb.co.uk
 We use the NOMIS WU03UK dataset that was released in 2011.
 NUTS (Nomenclature of Territorial Units for Statistics) is a geocode standard constructed by Eurostat to reference the subdivisions of European countries. NUTS2 is the second level and corresponds to basic regions for the application of regional policies.
References
 (1997) The onehour course in compositional data analysis or compositional data analysis is simple. In Proceedings of IAMG, Vol. 97, pp. 3–35. Cited by: Definition 2, Definition 9.
 (1983) Principal component analysis of compositional data. Biometrika 70 (1), pp. 57–65. Cited by: §S1.
 (1992) On criteria for measures of compositional difference. 24 (4), pp. 365–379. Cited by: Definition 10.
 (1994) Principles of compositional data analysis. pp. 73–81. Cited by: §S6.
 (2008) Construction of equivalent stochastic differential equation models. 26 (2), pp. 274–297. Cited by: §3.1, §3.1.
 (2010) Vaccination against pandemic influenza a/h1n1v in england: a realtime economic evaluation. Vaccine 28 (12), pp. 2370–2384. Cited by: §S4.1, §4.
 (2009) Seasonal transmission potential and activity peaks of the new influenza a (h1n1): a monte carlo likelihood analysis based on human mobility. BMC medicine 7 (1), pp. 45. Cited by: §S4.2, §S4.2, Table 1, Table 2.
 (2010) Fluctuation effects in metapopulation models: percolation and pandemic threshold. Journal of theoretical biology 267 (4), pp. 554–564. Cited by: §S3.
 (2009) Strategies for pandemic and seasonal influenza vaccination of schoolchildren in the United States. American journal of epidemiology 170 (6), pp. 679–686. Cited by: §S4.1.
 (2002) Introduction to probability. Vol. 1, Athena Scientific Belmont, MA. Cited by: Definition 5.
 (2011) Would school closure for the 2009 h1n1 influenza epidemic have been worth the cost?: a computational simulation of pennsylvania. BMC public health 11 (1), pp. 353. Cited by: §2.
 (2016) School closure policies at municipality level for mitigating influenza spread: a modelbased evaluation. BMC infectious diseases 16 (1), pp. 576. Cited by: §2.
 (2013) Introduction to stochastic processes. Courier Corporation. Cited by: §S3.1, §3.2, §S3, Definition 4, Definition 5, Definition 6, Theorem 1.
 (2018) The impact of regular school closure on seasonal influenza epidemics: a datadriven spatial transmission model for belgium. 18 (1), pp. 29. Cited by: §2.
 (2009) A preliminary analysis of the epidemiology of influenza a (h1n1) v virus infection in thailand from early outbreak data, junejuly 2009. Eurosurveillance 14 (31), pp. 19292. Cited by: §S4.2, Table 1.
 (2009) The construction of nextgeneration matrices for compartmental epidemic models. Journal of the Royal Society Interface, pp. rsif20090386. Cited by: §S2.
 (2012) Measured dynamic social contact patterns explain the spread of h1n1v influenza. PLoS computational biology 8 (3), pp. e1002425. Cited by: §2, §S2, §3.1, §3.1, §S4.1, §S4.1, §S4.1, §S4.2, Table 2.
 (2010) Spatial dynamics of the 1918 influenza pandemic in england, wales and the united states. Journal of the Royal Society Interface 8 (55), pp. 233–243. Cited by: §3.2, §3.3, §S3, §S4.
 (2006) Strategies for mitigating an influenza pandemic. Nature 442 (7101), pp. 448. Cited by: §S4.1, §4.
 (2009) Pandemic potential of a strain of influenza a (h1n1): early findings. science 324 (5934), pp. 1557–1561. Cited by: §S4.2, Table 1.
 (2012) Inferring the structure of social contacts from demographic data in the analysis of infectious diseases spread. 8 (9), pp. e1002673. Cited by: §3.1.
 (2019) School dismissal as a pandemic influenza response: when, where and for how long?. Epidemics, pp. 100348. Cited by: §2.
 (2014) Spatial transmission of 2009 pandemic influenza in the us. PLoS computational biology 10 (6), pp. e1003635. Cited by: §3.2, §3.2, §3.3, §S3, §S3, §S4.
 (2019) DARPA’s explainable artificial intelligence program. AI Magazine 40 (2), pp. 44–58. Cited by: §7.
 (2007) Effectiveness of interventions to reduce contact rates during a simulated influenza pandemic. Emerging infectious diseases 13 (4), pp. 581. Cited by: §2.
 (2010) Developing guidelines for school closure interventions to be used during a future influenza pandemic. BMC infectious diseases 10 (1), pp. 221. Cited by: §2.
 (2008) Modeling targeted layered containment of an influenza pandemic in the United States. Proceedings of the National Academy of Sciences 105 (12), pp. 4639–4644. Cited by: §2.
 (2019) A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and MultiAgent Systems, pp. 1–48. Cited by: §6, §S7.
 (2015) Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to ebola. Proceedings of the Royal Society B: Biological Sciences 282 (1806), pp. 20150347. Cited by: §3.1.
 (2014) Adam: a method for stochastic optimization. Cited by: 5th item.
 (2019) Geographic transmission hubs of the 2009 influenza pandemic in the united states. Epidemics 26, pp. 86–94. Cited by: §3.2, §3.3, §S3, §S4.
 (2018) Contagion! the bbc four pandemic–the model behind the documentary. Cited by: §3.2, §3.2, §S3.
 (2012) Why was the 2009 influenza pandemic in england so small?. PloS one 7 (2), pp. e30223. Cited by: §3.3, §S4.2, §S4.2, Table 1, §S4.
 (2015) Numba: a llvmbased python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, pp. 7. Cited by: §S5.
 (2008) Community structure in directed networks. Physical review letters 100 (11), pp. 118703. Cited by: §S7.
 (2011) Network science: theory and applications. John Wiley & Sons. Cited by: §S2.
 (2019) Multiagent game abstraction via graph attention neural network. arXiv preprint arXiv:1810.09202. External Links: 1911.10715 Cited by: §7.
 (2005) Containing pandemic influenza at the source. Science 309 (5737), pp. 1083–1087. Cited by: §4.
 (2007) Nonpharmaceutical interventions implemented by us cities during the 19181919 influenza pandemic. Jama 298 (6), pp. 644–654. Cited by: §1, §2.
 (2009) Optimizing influenza vaccine distribution.. Science 325 (5948), pp. 1705–1708. External Links: Document, ISBN 00368075, ISSN 00368075 Cited by: §S4.1.
 (2010) Incidence of 2009 pandemic influenza a h1n1 infection in england: a crosssectional serological study. The Lancet 375 (9720), pp. 1100–1108. Cited by: §3.3, §S4.2.
 (2014) The spatial resolution of epidemic peaks. PLoS computational biology 10 (4), pp. e1003561. Cited by: §S4.1.
 (2013) The cost effectiveness of pandemic influenza interventions: a pandemic severity based analysis. PloS one 8 (4). Cited by: §2.
 (2010) Pros and cons of estimating the reproduction number from early epidemic growth rate of influenza a (h1n1) 2009. Theoretical Biology and Medical Modelling 7 (1), pp. 1. Cited by: §S4.2, Table 1.
 (200601) Guide to numpy. Cited by: §S5.
 (2012) Applied stochastic system modeling. Springer Science & Business Media. Cited by: §S3.1.
 (2017) Influenza. The Lancet, pp. 697–708. External Links: ISSN 01406736 Cited by: §1.
 (2019) Context matters: using reinforcement learning to develop humanreadable, statedependent outbreak response policies. Philosophical Transactions of the Royal Society B 374 (1776), pp. 20180277. Cited by: §2.
 (2017) Community structure of copper supply networks in the prehistoric balkans: an independent evaluation of the archaeological record from the 7th to the 4th millennium bc. 6 (1), pp. 106–124. Cited by: §S7.
 (201810–15 Jul) QMIX: monotonic value function factorisation for deep multiagent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, StockholmsmÃ¤ssan, Stockholm Sweden, pp. 4295–4304. Cited by: §6.
 (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
 (1987) Large sample properties of simulations using latin hypercube sampling. Technometrics 29 (2), pp. 143–151. Cited by: §4.
 (2008) A simple explanation for the low impact of border control as a countermeasure to the spread of an infectious disease. Mathematical biosciences 214 (12), pp. 70–72. Cited by: §S3.
 (2012) Social contact patterns and control strategies for influenza in the elderly. 240 (2), pp. 241–249. Cited by: §3.1.
 (2019) From louvain to leiden: guaranteeing wellconnected communities. 9. Cited by: §S7.
 (2010) Estimated epidemiologic parameters and morbidity associated with pandemic h1n1 influenza. Can Med Assoc J 182 (2), pp. 131–136. Cited by: §S4.2, Table 1, Table 2.
 (2010) An introduction to infectious disease modelling. OUP oxford. Cited by: §S2.
 (2018) Characterizing the dynamics underlying global spread of epidemics. Nature communications 9 (1), pp. 218. Cited by: §3.2, §S3.
 (2003) Are we ready for pandemic influenza?. Science 302 (5650), pp. 1519–1522. Cited by: §1.
 (2011) Dynamic health policies for controlling the spread of emerging infections: influenza as an example. PloS one 6 (9). Cited by: §2.
 (2013) Identifying dynamic tuberculosis casefinding policies for hiv/tb coepidemics. Proceedings of the National Academy of Sciences 110 (23), pp. 9457–9462. Cited by: §2.
 (2016) Identifying costeffective dynamic policies to control epidemics. Statistics in medicine 35 (28), pp. 5189–5209. Cited by: §2.
 (2009) The transmissibility and control of pandemic influenza a (h1n1) virus. Science 326 (5953), pp. 729–733. Cited by: §S4.2, Table 1.
 (2018) Towards sample efficient reinforcement learning.. In IJCAI, pp. 5739–5743. Cited by: §7.
 (2020) A novel coronavirus from patients with pneumonia in china, 2019. New England Journal of Medicine. Cited by: §1.