Towards the Use of Neural Networks for Influenza Prediction at Multiple Spatial Resolutions

Towards the Use of Neural Networks for Influenza Prediction at Multiple Spatial Resolutions

Emily L. Aiken
Harvard University
Cambridge, MA 02138
Andre T. Nguyen
University of Maryland
Booz Allen Hamilton
Columbia, MD 21044
Mauricio Santillana
Harvard Medical School
CHIP, Boston Children’s Hospital
Boston, MA 02215

We introduce the use of a Gated Recurrent Unit (GRU) for influenza prediction at the state- and city-level in the US, and experiment with the inclusion of real-time flu-related Internet search data. We find that a GRU has lower prediction error than current state-of-the-art methods for data-driven influenza prediction at time horizons of over two weeks. In contrast with other machine learning approaches, the inclusion of real-time Internet search data does not improve GRU predictions.

1 Introduction

Infectious diseases affect billions of people every year and cause considerable morbidity and mortality worldwide. Influenza alone infects 35 million people in the US annually, causing 12,000-56,000 deaths cdcfludata . Accurate real-time surveillance and forecasting of disease activity could help public health officials design timely interventions to mitigate outbreaks, but traditional healthcare-based surveillance systems are limited by inherent reporting delays. Data from the US Centers for Disease Control (CDC) on Influenza-like Illness (ILI) rates, for example, are available with approximately a 2-week delay and are frequently retrospectively revised yangargo . Time-series machine learning methods that provide real-time estimates of disease activity at a high spatial resolution show promise for filling this temporal data gap, helping hospitals, clinics, and communities manage public health threats.

Previous computational work on improving real-time estimation and forecasting of disease activity has focused on ILI in the United States, employing methods ranging from applied machine learning and statistical modeling yangargo ; gft ; santillana ; argokernel ; brooks ; lu ; wu ; li ; hu to standard mechanistic epidemiological modeling wyang1 ; wyang2 and network approaches viboud ; charu . Many of these approaches explore the use of novel Internet-based data sources, including Google search information (GT) gft , Twitter microblogs paul2014twitter , and electronic health records SantillanaEHR , to complement epidemiological data with Internet-based signals available in real time, producing accurate “nowcasts" of influenza incidence yangargo ; santillana ; argokernel ; lu ; hu .

While the literature on data-driven nowcasting methods for estimating disease activity is well-developed from an epidemiological standpoint, the machine learning methods employed lag behind the state-of-the-art. The nowcasting models introduced to date mainly use variations of regularized linear regressions yangargo ; lu or, less often, random forests or support vector machines argokernel . From a machine learning perspective, the problem of disease activity estimation is most suited to a more sophisticated and time-series specific model architecture, and thanks to the growing volume of recorded epidemiological data, the use of recurrent neural networks (RNNs), and more specifically their variants long short-term memory (LSTM) and gated recurrent unit (GRU) networks, is increasingly feasible.

To our knowledge, four papers to date explore neural network methods for epidemiological prediction. Wu et al. wu apply a CNN-GRU architecture for state-level ILI estimation, Li et al. li use a graph-structured RNN to account for networked regional disease spread, and Hu et al. hu and Lui et al. lui employ a fully-connected network and an LSTM, respectively, to track national ILI activity. However, these papers leave significant gaps: they do not evaluate performance with walk-forward validation (which has been shown to improve nowcasting accuracy yangargo ), with the exception of hu and lui they do not explore digital data sources, and, most importantly, with the exception of wu they do not compare model performance to relevant baseline methods in the epidemiological literature.

1.1 Our Contributions

Our work bridges the gap between the state-of-the-art in machine learning and in disease forecasting, comparing the performance of a GRU to previously established machine learning methods for real-time ILI estimation on the state- and city-level in the US. We find that the GRU is superior to baseline methods when there is a large reporting delay in the standard surveillance system (over two weeks). We further experiment with the inclusion of real-time Internet search-engine data from Google Trends (GT), and find that while the performance of baseline methods is improved by GT data, the GRU’s performance is not. Finally, we conduct an in-depth analysis of feature importances for each model we build, as interpretability is key to effective practical use of data-driven models in public health.

2 Methods

2.1 Datasets and Preprocessing

Our state-level epidemiological dataset consists of CDC weekly ILI counts from Oct. 4, 2009 to May 14, 2017. Only 37 states without missing data are included in our analysis. Our city-level epidemiological dataset was compiled by IMS health based on weekly medical claims for 159 cities for the period Jan. 1, 2004 to July 20, 2010 viboud ; charu . We extract historical flu-related search activity from Google Trends (GT) trends for each location for 256 key words shown in previous work to have strong correlation with ILI incidence lu . Each dataset is split into training (first 50%) and test (last 50%) periods. Each time-series is normalized to the range [0, 1], where the minimum and maximum values are identified from the training period; normalization is reversed before evaluation.

2.2 Modeling

We construct four baseline models, built independently for each location in both datasets.

  • [leftmargin=*]

  • The persistence (P) model is the standard naïve baseline for time-series prediction, in which the most recently observed incidence is propagated weeks forward.

  • The linear autoregression (AR) uses a linear combination of autoregressive observations of ILI incidence in a given location to predict incidence at time horizon in that location. A linear autoregression incorporating synchronous Google search data, similar to the “ARGO” model presented in yangargo , takes as features a linear combination of autoregressive terms and synchronous query volumes for a set of search terms from a single location.

  • The linear network autoregression (LR) captures spatial spread of disease, taking as features a linear combination of autoregressive terms from a set of regions available in the data set. We also implement a form of the LR for which synchronous Google search query volumes from all regions are incorporated, similar to the “ARGO-net" model presented in lu .

  • The Random Forest (RF) uses the same predictors as the LR model, but takes a nonparametric approach with a forest of 50 decision trees. As always, we experiment with inclusion of GT data.

All models use a lookback window of autoregressive terms. Models that incorporate GT data use search terms. For models incorporating data from multiple locations, is selected via 4-fold cross validation from the set {10, 20, 40}, with selected independently for epidemiological time-series and GT time-series. Finally, linear regressions incorporate L1-regularization with the penalty parameter chosen via 4-fold cross validation from the set , and the maximum depth of the random forest is chosen via 4-fold cross-validation from .

We implement a small Gated Recurrent Unit Neural Network (GRU) with a single 5-node hidden layer. Without GT data, the GRU accepts as input autoregressive terms from all locations in the data set and predicts incidence at the given time horizon for all locations simultaneously. When using GT data, the GRU accepts as input autoregressive terms and total synchronous Google search query volumes. We choose the queries with the highest correlation (in the training period) with ILI incidence in any location in the dataset. The GRU is trained on a mean-squared error objective with a dropout rate of 0.3 after the hidden layer to reduce overfitting. We use a learning rate of 0.001 for stochastic gradient descent and train the model for 1,000 epochs.

2.3 Training and Evaluation

Models are trained with walk-forward validation (“dynamic training"), wherein each model is re-trained in each week with all data available in that week. In addition to eliminating forward-looking bias and allowing models to use all the available data, dynamic training has been shown in previous ILI-specific work to increase model accuracy yangargo , and reflects how models would be used in real-world scenarios. Models are evaluated on the second half of each dataset based on the distribution of root mean squared error (RMSE) across all locations for four time horizons of prediction, = 1, 2, 4, and 8 weeks. We also conduct a set of Wilcoxon signed-rank tests to test whether the distribution of RMSE across locations is different for machine learning methods and the naïve persistence method.

3 Results

3.1 Accuracy

We find that in general the GRU flu predictions have significantly lower prediction errors (RMSE) than less sophisticated machine learning models for long time horizons of prediction. Specifically, as shown in Figure 1 and Table 1, the GRU demonstrates superior performance on 4- and 8-week time horizons for both datasets when only epidemiological data is used, and for an 8-week horizon when both epidemiological and GT data are incorporated. We observe a larger gap in accuracy between the GRU and the baseline methods on the city-level dataset. However, we find that, unlike baseline models, the GRU’s performance is not improved by including real-time GT data at any time horizon.

Figure 1: Summary of GRU performance in comparison to baseline models. Each violin records the distribution of prediction errors (RMSE) across locations, disaggregated by the inclusion of GT data.
No GT Data GT Data Included
1 week 2 weeks 4 weeks 8 weeks 1 week 2 weeks 4 weeks 8 weeks
States AR 141(1e-3) 3(<e-5) 9(<e-5) 0(<e-5) 143(2e-3) 23(<e-5) 8(<e-5) 0(<e-5)
LR 340(.86) 77(3e-5) 4(<e-5) 1(<e-5) 205(.03) 26(<e-5) 3(<e-5) 0(<e-5)
RF 306(.49) 29(<e-5) 1(<e-5) 0(<e-5) 268(.21) 16(<e-5) 1(<e-5) 0(<e-5)
GRU 100(1e-3) 95(1e-4) 9(<e-5) 0(<e-5) 114(3e-4) 112(3e-4) 22(<e-5) 0(<e-5)
Cities AR 5384(.09) 4119(1e-4) 2575(<e-5) 851(<e-5) 6064(.61) 1907(<e-5) 474(<e-5) 14(<e-5)
LR 2328(<e-5) 663(<e-5) 218(<e-5) 13(<e-5) 3996(5e-5) 816(<e-5) 117(<e-5) 6(<e-5)
RF 5759(.30) 264(<e-5) 40(<e-5) 2(<e-5) 4159(2e-4) 365(<e-5) 29(<e-5) 1(<e-5)
GRU 2296(<e-5) 368(<e-5) 0(<e-5)) 0(<e-28) 1946(<e-5) 333(<e-5) 0(<e-5) 0(<e-5)
Table 1: Results of Wilcoxon signed-rank test comparing the distribution of RMSE for machine learning methods with the naïve persistence model. Test statistics (), between 0 and 352 for states and 0 and 6360 for cities, indicate differences between methods (small signals a large difference), and P-values are included in parentheses with statistically significant results bolded.

3.2 Interpretability

For interpretability purposes, we analyze feature importances across each method. Specifically, we obtain regression coefficients for each linear regression, feature importances breiman for each random forest model, and saliency maps simonyan for each GRU prediction, examples of which are in Figure 2 in the appendix. We observe certain results consistent with intuitive spatial and temporal model interpretation: the most important features in linear regression and random forest models for the city-level dataset tend to be epidemiological lags and GT information from cities near to the city of prediction, and the most immediately available epidemiological information (lags 1-4) tend to be important features in linear regression and random forest models that predict under short reporting delays, while information from the previous season (lags 48-52) is more important in models that predict under long reporting delays. Similarly, saliency maps indicate that GRU attention extends much further back for neural network models that predict under a eight week reporting delay than for models working with a one week reporting delay.

4 Discussion

Here we introduce the use of a time-series neural network approach that improves upon the predictive accuracy of previously used machine learning methods for ILI prediction in the presence of reporting delays of over two weeks. We show that the GRU achieves superior accuracy at two spatial resolutions relevant to actionable interventions, and could therefore improve real-time tracking of ILI given the reporting delays inherent to standard healthcare-based surveillance systems. Furthermore, our results that exclude GT data indicate that under short reporting delays, the GRU could provide highly accurate forecasts of ILI activity up to 8 weeks ahead of the most recently available epidemiological report.

We find, however, that the GRU outperforms other models only at reporting delays longer than two weeks and that the GRU is not improved by GT data. These performance differences are consistent with the trade-off between model complexity and convergence in the data-deficient scenario. The simpler models evaluated here have fewer trainable parameters than the GRU, while the GRU’s complex architecture and time series-specific structure allow it to better learn embedded patterns in historic data. The benefit of the GRU is larger when there is no external (GT) real-time data available to the model likely because the inclusion of GT data significantly increases the number of parameters. With the availability of more training data this behavior may change.

As the amount of available high spatial resolution disease-specific data grows in the field of public health, using neural network models like the one introduced here becomes increasingly feasible. Trade-offs in interpretability should be considered, however, when comparing neural networks to less complex machine learning methods. For that reason, we have presented a comprehensive feature importance analysis in this work. Note that while linear regression coefficients like the ones extracted in our analysis are highly interpretable, feature importances in a random forest model include more stochasticity and the saliency maps produced for predictions by the neural network model represent only rough approximations of model attention.

Two key limitations to this study are tuning of the neural network model and lack of access to real-time epidemiological data. First, the performance of neural network models is sensitive to several hyperparameters, including optimization choices, depth, width, and regularization. Due to computational limits, we adopt a simple GRU architecture with a single, five unit hidden layer and do not tune the model for other hyperparameters. Likely the performance of the GRU would be improved if cross-validation was used to tune key hyperparameters. Second, we have access only to final (revised) ILI data, but as noted in the introduction these data are frequently updated with post hoc revisions up until several weeks after their original release.

There is much room for further exploration of sophisticated machine learning methods for epidemiological prediction. It would be particularly impactful to explore how well the models presented here can track other infectious diseases outside of the United States. There is also room for experimentation with other neural network model architectures. In particular, we adopt a network that is similar in size to those in past work on ILI prediction wu ; li ; hu , but is very small compared to those used for other machine learning applications. We leave the exploration of deeper and wider architectures as future work.


  • (1) Centers for Disease Control and Prevention: Disease burden of influenza. (2018).
  • (2) Yang, S., Santillana, M., & Kou, S. C. Accurate estimation of influenza epidemics using google search data via ARGO. Proceedings of the National Academy of Sciences 112, 14473-14478 (2015).
  • (3) Ginsberg, J. et al. Detecting influenza epidemics using search engine query data. Nature 457, 1012-1014 (2009).
  • (4) Santillana, M. et al. What Can Digital Disease Detection Learn from (an External Revision to) Google Flu Trends? American Journal of Preventive Medicine 47, 341-347 (2014).
  • (5) Santillana, M. et al. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLOS Computational Biology 11, e1004513 (2015).
  • (6) Brooks, L. et al. Flexible Modeling of Epidemics with an Empirical Bayes Framework. PLOS Computational Biology 11, e1004382 (2015).
  • (7) Lu, F. et al. Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches. Nature Communications 10, 147 (2019).
  • (8) Wu, Y., Yang, Y., Nishiura, H., & Saitoh, M. Deep Learning for Epidemiological Predictions. Proceedings of the 41st International ACM SIGR Conference on Research & Development in Information Retrieval, 1085-1088 (2018).
  • (9) Li, Z. et al. A Study on Graph-Structured Recurrent Neural Networks and Sparsification with Application to Epidemic Forecasting. Optimization of Complex Systems: Theory, Models, Algorithms, and Applications, 730-739 (2019).
  • (10) Hu, H. et al. Prediction of influenza-like illness based on the improved artificial tree algorithm and artificial neural network. Scientific Reports 8, 4895 (2018).
  • (11) Lui, L. et al. LSTM Recurrent Neural Networks for Influenza Trends Prediction. International Symposium on Bioinformatics Research and Applications, 259-264 (2018).
  • (12) Yang, W., Karspeck, A., & Shaman, J. Comparison of Filtering Methods for the Modeling and Retrospective Forecasting of Influenza Epidemics. PLOS Computational Biology 10, e1003583 (2014).
  • (13) Yang, W., Lipsitch, M., & Shaman, J. Inference of seasonal and pandemic influenza transmission dynamics. Proceedings of the National Academy of Sciences 112, 2723-2728 (2015).
  • (14) Viboud, C. et al. Demonstrating the Use of High-Volume Electronic Medical Claims Data to Monitor Local and Regional Influenza Activity in the US. PLOS One 9, e0102429 (2014).
  • (15) Charu, V. et al. Human mobility and the spatial transmission of influenza in the United States. PLOS Computational Biology 13, 1005382 (2017).
  • (16) Paul M. J., Dredze M., & Broniatowski, D. Twitter improves influenza forecasting. PLOS Currents 6 (2014).
  • (17) Santillana, M. et al. Cloud-based electronic health records for real-time, region-specific influenza surveillance. Scientific Reports 6 (2016).
  • (18) Google Trends.
  • (19) L. Breiman. Classification and Regression Trees. New York: Routledge (1984).
  • (20) K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: visualizing image classification models and saliency maps. International Conference on Learning Representations (2013).
  • (21) Moss, R et al. Epidemic forecasts as a tool for public health: interpretation and (re)calibration. Australia and New Zealand Journal of Public Health 42, 69–76 (2018)

Appendix A Appendix

Figure 2: Example of interpretability analysis for the state of Pennsylvania. Similar analyses were performed for all states and cities. Feature importances are averaged over the entire prediction period. Note that the most important short-term predictors in the LR and RF are from Pennsylvania and nearby Virginia. Also note that GRU attention extends back much further for the 8-week prediction than for the 1-week prediction.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description