Examining Deep Learning Models with Multiple Data Sources for COVID19 Forecasting
Abstract
The COVID19 pandemic represents the most significant public health disaster since the 1918 influenza pandemic. During pandemics such as COVID19, timely and reliable spatiotemporal forecasting of epidemic dynamics is crucial. Deep learningbased time series models for forecasting have recently gained popularity and have been successfully used for epidemic forecasting. Here we focus on the design and analysis of deep learningbased models for COVID19 forecasting. We implement multiple recurrent neural networkbased deep learning models and combine them using the stacking ensemble technique. In order to incorporate the effects of multiple factors in COVID19 spread, we consider multiple sources such as COVID19 testing data and human mobility data for better predictions. To overcome the sparsity of training data and to address the dynamic correlation of the disease, we propose clusteringbased training for highresolution forecasting. The methods help us to identify the similar trends of certain groups of regions due to various spatiotemporal effects. We examine the proposed method for forecasting weekly COVID19 new confirmed cases at county, state, and countrylevel. A comprehensive comparison between different time series models in COVID19 context is conducted and analyzed. The results show that simple deep learning models can achieve comparable or better performance when compared with more complicated models. We are currently integrating our methods as a part of our weekly forecasts that we provide state and federal authorities.
I Introduction
The COVID19 pandemic is the worst outbreak we have seen since 1918; it has caused over 22 million confirmed cases globally and over 791,000 deaths in more than 200 countries.
Our contributions. Our work focuses on exploring deep learningbased methods that incorporate multiple sources for weekly 4 weeks ahead forecasting of COVID19 new confirmed cases at multiple geographical resolutions including country, state, and countylevel. In the context of COVID19, the problem is more complicated than seasonal influenza forecasting for the following reasons: () very sparse training data for each region; () noisy surveillance data due to heterogeneity in epidemiological context e.g. disease spreading timeline and testing prevalence in different regions, () system is constantly in churn – individual behavioral adaptation, policies and disease dynamics are constantly coevolving. Given these challenges, we examine different types of time series models and propose an ensemble framework that combines simple deep learning models using multiple sources such as COVID19 testing data and human mobility data. The multisource data allows us to capture the above mentioned factors more effectively. To overcome the data sparsity problem we propose clusteringbased training methods to augment training data for each region. We group spatial regions based on trend similarity and infer a model per cluster. Among other things this avoids overfitting due to sparse training data. As an additional benefit it aids in explicitly uncovering the spatial correlation across regions by training models with similar time series. Our main contributions are summarized below:

First, we systematically examine time seriesbased deep learning models for COVID19 forecasting and propose clusteringbased training methods to augment sparse and noisy training data for high resolution regions which can avoid overfitting and explicitly uncover the similar spreading trends of certain groups of regions.

Second, we implement a stacking ensemble framework to combine multiple deep learning models and multiple sources for better performance. Stacking is a natural way to combine multiple methods and data sources.

Third, we analyze the performance of our method and other published results in their ability to forecast weekly new confirmed cases at country, state, and county level. The results show that our ensemble model outperforms any individual models as well as several classic machine learning and stateoftheart deep learning models;

Finally, we conduct a comprehensive comparison among mechanistic models, statistical models and deep learning models. The analysis shows that for COVID19 forecasting deep learningbased models can capture the dynamics and have better generalization capability as opposed to the mechanistic and statistical baselines. Simple deep learning models such as simple recurrent neural networks can achieve better performance than complex deep learning models like graph neural networks for high resolution forecasting.
Ii Related work
COVID19 is a very active area of research and thus it is impossible to cover all the recent manuscripts. We thus only cover important papers here.
Iia COVID19 forecasting by mechanistic methods
Mechanistic methods have been a mainstay for COVID19 forecasting due to their capability of represent the underlying disease transmission dynamics as well as incorporating diverse interventions. They enable counterfactual forecasting which is important for future government interventions to control the spread. Forecasting performance depends on the assumed underlying disease model. Yang et al. [yang2020modified] use a modified susceptible(S)exposed(E)infected(I)recovered(R) (SEIR) model for predicting the COVID19 epidemic peaks and sizes in China. Anastassopoulou et al. [anastassopoulou2020data] provide estimations of the basic reproduction number and the per day infection mortality and recovery rates using an susceptible(S)infected(I)dead(D)recovered(R) (SIDR) model. Giordano et al. [giordano2020modelling] propose a new susceptible(S)infected(I)diagnosed(D)ailing(A)recognized(R)threatened(T)healed(H)extinct(E) (SIDARTHE) model to help plan an effective control strategy. Yamana et al. [yamana2020projection] use a metapopulation SEIR model for US county resolution forecasting. Chang et al. [chang2020modelling] develop an agentbased model for a finegrained computational simulation of the ongoing COVID19 pandemic in Australia. Kai et al. [kai2020universal] present a stochastic dynamic networkbased compartmental SEIR model and an individual agentbased model to investigate the impact of universal face mask wearing upon the spread of COVID19.
IiB COVID19 forecasting by time series models
Time series models, such as statistical models and deep learning models, are popular for their simplicity and forecasting accuracy in the epidemic domain. One big challenge is the lack of sufficient training data in the context of COVID19 dynamics. Another challenge is that the surveillance data is extremely noisy (hard to model noise) due to rapidly evolving epidemics. However, additional data becomes available and the surveillance systems mature these models become more promising. Harvey et al. [harvey2020time] propose a new class of time series models based on generalized logistic growth curves that reflect COVID19 trajectories. Petropoulos et al. [petropoulos2020forecasting] produce forecasts using models from the exponential smoothing family. Ribeiro et al. [ribeiro2020short] evaluate multiple regression models and stackingensemble learning for COVID19 cumulative confirmed cases forecasting with one, three, and six days ahead in ten Brazilian states. Hu et al. [hu2020artificial] propose a modified autoencoder model for realtime forecasting of the size, lengths and ending time in China. Chimmula et al. [chimmula2020time] use LSTM networks to predict COVID19 transmission. Arora et al. [arora2020prediction] use LSTMbased models for positive reported cases for 32 states and union territories of India. Magri et al. [magri2020first] propose a datadriven model trained with both data and first principles. Dandekar et al. [dandekar2020neural] use neural network aided quarantine control models to estimate the global COVID19 spread.
IiC Deep learningbased epidemic forecasting
Recurrent neural networks (RNN) has been demonstrated to be able to capture dynamic temporal behavior of a time sequence. Thus it has become a popular method in recent years for seasonal influenzalikeillness (ILI) forecasting. Volkova et al. [volkova2017forecasting] build an LSTM model for shortterm ILI forecasting using CDC ILI and Twitter data. Venna et al. [venna2019novel] propose an LSTMbased method that integrates the impacts of climatic factors and geographical proximity. Wu et al. [wu2018deep] construct CNNRNNRes combining RNN and convolutional neural networks to fuse information from different sources. Wang et al. [wang2019defsi, wang2020tdefsi] propose TDEFSI combining deep learning models with casual SEIR models to enable highresolution ILI forecasting with no or less highresolution training data. Adhikari et al. [adhikari2019epideep] propose EpiDeep for seasonal ILI forecasting by learning meaningful representations of incidence curves in a continuous feature space. Deng et al. [deng2019graph] design colaGNN which is a crosslocation attentionbased graph neural network for forecasting ILI. Regarding COVID19 forecasting, Amol et al. [kapoor2020examining] examined a novel forecasting approach for COVID19 daily case prediction that uses graph neural networks and mobility data. Gao et al. [gao2020stan] proposed STAN that uses a spatiotemporal attention network. Aamchandani et al. [ramchandani2020deepcovidnet] presented DeepCOVIDNet to compute equidimensional representations of multivariate time series. These works examine their models on daily forecasting for US state or county levels.
Our work focuses on time series deep learning models for COVID19 forecasting that yield weekly forecast at multiple resolution scales and provide 4 weeks ahead forecasts (equal to 28 days ahead in the context of daily forecasting). We use an ensemble model to combine multiple simple deep learning models. We show that compared to stateoftheart time series models, simple recurrent neural networkbased models can achieve better performance. More importantly, we show that the ensemble method is an effective way to mitigate model overfitting caused by the super small and noisy training data.
Iii Method
Iiia Problem Formulation
We formulate the COVID19 new confirmed cases forecasting problem as a regression task with time series of multiple sources as the input. We have regions in total. Each region is associated with a time series of multisource input in a time window . For a region , at time step , the multisource input is denoted as where is the feature numbers. We denote the training data as . The objective is to predict COVID19 new confirmed cases at a future time point where refers to the horizon of the prediction. We are interested in a predictor that predicts new confirmed case count at time , denoted as , by taking as the input where is the most recent time of data availability.
(1) 
where denotes parameters of the predictor and denotes the prediction of .
IiiB Recurrent Neural Networks (RNNs)
For brevity, we assume a region is given, thus we omit subscript in this subsection. An RNN model consists of kstacked RNN layers. Each RNN layer consists of cells, denoted as . The input is , the output from the last layer is denoted as . Let be the dimension of the hidden state in . For the first layer , will work as:
(2) 
where is activation function; , and are learned weights and bias; is the output of and is from . The cell computation is similar in the , but with being replaced by , and . The first RNN layer takes as the input, the second layer takes as the input, and the rest of the layers behave in the same manner. The RNN module can be replaced by Gated Recurrent Unit (GRU) [cho2014learning] or Long Shortterm Memory (LSTM) [hochreiter1997long] which avoid shortterm memory and gradient vanishing problems of vanilla RNNs.
The output of the kstacked RNN layers is fed into a fully connected layer:
(3) 
where is the output dimension, , , and is a linear function.
IiiC Multisource Attention RNNs
The Multisource attention RNN model consists of kstacked RNN models, each of which encodes a time series of one feature. Assume the output of branch is in which we omit subscript for brevity. An attention layer is used to measure the impact of multisource on new confirmed cases. We assume the time series of new confirmed cases is encoded in branch , and we define attention coefficient as the effect of feature on target feature:
(4) 
where , is RELU function. Then the output of attention layer is:
(5) 
where , , is the tanh function. The output layer is a dense layer that outputs :
(6) 
where , , is the linear function. In our paper, all the features have the same length of time series. However, the multisource attention RNN model enables training with the input that has a different length of time series of the features, which is superior in heterogeneous availability of multiple factors.
IiiD Clusteringbased Training
Deep learning models usually require a large amount of training data which is not the case in the context of COVID19. Particularly, for regions where the pandemic starts late, there are only a few valid data points for weekly forecasting. Thus training a single model for each such region, which we call vanilla training, is highly susceptible to overfitting. One modeling strategy is to train a model for a group of selected regions which to some extent overcomes the data sparsity problem. It is more likely that groups of regions exhibit strong correlations due to various spatiotemporal effects and geographical or demographic similarity. We explore a clusteringbased approach that simultaneously learns COVID19 dynamics from multiple regions within the cluster and infers a model per cluster. Various types of similarity metrics can be used to uncover the trend similarity allowing for an explainable time series forecasting framework.
Generalizing the earlier problem formulation, we denote the historical available time series for a region as where is the time span of the available surveillance data. is increasing as new data becomes available and it varies across different regions. The set of time series for regions is denoted as . The clustering process aims to partition the into sets .
In our work, the trend is represented as the time series of new confirmed cases and we cluster the time series in two ways – geographybased clustering (geoclustering) and algorithmbased clustering (algclustering). Geoclustering: Clustering is based on their geographical proximity, e.g. partition counties based on their state codes for the US. We propose this method due to differences across regions with respect to their size, population density, epidemiological context, and differences in how policies are being implemented. Thus we assume those who belong to the same jurisdictions would have strong relationship in COVID19 time series. Algclustering: Clustering using (i) kmeans [hartigan1979algorithm] which partitions observations into clusters in which each observation belongs to the cluster with the nearest mean; (ii) time series kmeans (tskmeans) [huang2016time] that clusters time series data using the smooth subspace information; (iii) kshape [paparrizos2015k] uses a normalized version of the crosscorrelation measure in order to consider the shapes of time series while comparing them. Note that kmeans requires the time series to be clustered must have the same length, while geoclustering, tskmeans and kshape allow for clustering on different lengths of time series. Algclustering discovers implicit correlation of epidemic trends which does not assume any geographical knowledge. We denote the set of above methods as .
IiiE Ensemble
Ensemble learning is primarily used to improve the model performance. Ren et. al. [ren2016ensemble] present a comprehensive review. In this paper, we implement stacking ensemble. It is to train a separate dense neural network using the predictions of individual models as the inputs. We use leaveoneout cross validation to train and predict for each region. For each target value , we train the ensemble model using the training samples from the same region but other time points.
IiiF Probabilistic Forecasting
In the epidemic forecasting domain, probabilistic forecasting is important for capturing the uncertainty of the disease dynamics and to better support public health decision making. We implement MCDropout [gal2016dropout] for each individual predictors to demonstrate estimation of prediction uncertainty. However, the ensemble predictions are point estimation by the definition of stacking.
IiiG Proposed Framework
Fig. 1 shows the framework of the proposed method. It works as follows: (1) we choose a geographical scale and resolution, e.g. counties in the US; (2) we collect and process multisource training data; (3) we cluster regions into certain groups based on their similarities between time series of new confirmed cases; (4) we train multiple predictors per cluster and ensemble individual predictors to make final predictions.
Multiple data sources
In order to model the coevolution of multiple factors in COVID19 spread, we incorporate the following data sources in our models to make future forecasts. COVID19 Surveillance Data [uva2020uva] and Case Count Growth Rate (CGR) quantify case count and case count changes of COVID19 time series. COVID19 Testing Data [jhu2020covid], Testing Rate (TR) and Testing Positive Rate (TPR) quantify the COVID19 testing coverage in each region. Google COVID19 Aggregated Mobility Research Dataset [kraemer2020mapping], Flow Reduction Rate (FRR) and Social Distancing Index (SDI) quantify the anonymized weekly mobility flow (MF) and flow changes between and within regions. We denote the set of multiple sources as where can be expanded by combining any new data sources. We generate by preprocessing . Details of data description and generation are shown in section IVA.
Multiple RNNbased models
By combining different data sources (single feature, features, attention features), RNN modules (RNN, GRU, LSTM), and training methods (vanilla, geo, kmeans, tskmeans, kshape), we implement multiple individual models. For country, US state and US county levels, models include: RNN, GRU, LSTM use vanilla training with single feature; RNNm, GRUm, LSTMm use vanilla training with features; RNNatt, GRUatt, LSTMatt are attentionbased models using vanilla training with features. For US county level, to investigate the effect of clustering training, we implement additional models using RNN module and single feature: RNNgeo, RNNkmeans, RNNtskmeans and RNNkshape. We analyze the effects by varying clustering methods while fixing other factors. Thus other combinations of modules, features and training methods are omitted in this work. We denote the set of individual models as . Note that is not limited to the models we implemented in this paper. It can be expanded by adding or improving upon any of the individual components.
Training and forecasting
Algorithm 1 presents how the proposed framework works. We first preprocess the collected data sources to generate based on the data availability for different resolutions. Each feature is in the form of time series of weekly data points at a given geographical resolution. We design various models for different resolutions based on . Next, each model in is trained using its corresponding cluster of training data. For region , given an input , a model will output . Then the outputs of individual models in will be combined using stacking ensemble which will output the final prediction for region at time .
Multistep forecasting
For single feature, we use a recursive forecasting approach to make multistep forecasting. That is appending the most recent prediction to the input for the next step forecasting. For multiple features that include exogenous time series as the input, we train a separate model for each step ahead forecasting.
Iv Experiment Setup
Iva Data

COVID19 surveillance data is obtained via the UVA COVID19 surveillance dashboard [uva2020uva]. It contains daily confirmed cases (CF) and death count (DT) at the resolution of county/state in the US and nationallevel data for other countries. Daily case counts and death counts are further aggregated to weekly counts.

Case count growth rate (CGR): Denoting the new confirmed/death case count at week as , the CGR of week is computed as , where we add 1 to smooth zero counts. We compute confirmed CGR (CCGR) and death CGR (DCGR).

COVID19 testing data via JHU COVID19 tracking project [jhu2020covid]. It includes multiple data like positive and negative testing count for state and country level of the US. We compute testing per 100K (TR) and testing positive rate (TPR) i.e. positive/(positive+negative).

Google COVID19 Aggregated Mobility Research (MF) Dataset [kraemer2020mapping] contains the anonymized relative weekly mobility flows aggregated over users within a 5 km cell. Given a set of regions , the flow from to during week is denoted by . The outgoing flow of region during week is . Similarly the incoming flow is . All the flow data can be aggregated to the level of county/state/country in the globe and cover most of countries all over the world. In our experiment, we work with outgoing flows since MF is mostly symmetric. Thus we omit , in the notation.

Flow Reduction Rate (FRR) [adiga2020interplay] measures the impact of social distancing by comparing the levels of connectivity before and after the spread of pandemic. Given a region , we compute the average outgoing MF during the prepandemic period (the first 6 weeks of 2020) and then compute weekly FRR by .

Social Distancing Index (SDI) [adiga2020interplay] quantifies the mixing or movement within a county, we consider the MF between the 5 km cells in it. Let denote the normalized flow matrix of the county at week where , we compare to the uniform matrix and the identity matrix . The SDI is defined as . Note that value close to one indicates less mixing within a county while a value close to zeros indicates more mixing within a county. For more details please refer to [adiga2020interplay].
All data sources are weekly and ends on Saturday. It starts from Week ending March 7th and ends at Week ending August 22nd (25 weeks) at Global, USState and USCounty resolutions. The global dataset includes Austria, Brazil, India, Italy, Nigeria, Singapore, the United Kingdom, and the United States. The summary of each dataset is shown in Table I. We chose 2020/03/07 as the start week since commercial laboratories began testing for SARSCoV2 in the US on March 1st, 2020. Thus the COVID19 surveillance data before that date is substantially noisy. The forecasting week starts from 2020/05/23 and we make 4 weeks ahead forecasting at each week until 2020/08/22. For example, if we use time series of data from 2020/03/07 to 2020/05/16 to train models, then the forecasting weeks are 2020/05/23, 2020/05/30, 2020/06/06, and 2020/06/13. Then we move one week ahead to repeat the training and forecasting.
Data set  # regions  # weeks  # features 
Global  8  25  6 
USState  50  25  8 
USCounty  2952  25  7 
IvB Metrics
The metrics used to evaluate the forecasting performance are: root mean squared error (RMSE), mean absolute percentage error (MAPE), Pearson correlation (PCORR).

Root mean squared error ():
(7) 
Mean absolute percentage error ():
(8) 
Pearson correlation ():
(9)
IvC Baselines
To serve as baselines for comparing the individual models, we also implemented SEIR compartmental model and several statistical time series models as well as stateoftheart deep learning models. There are a few deep learning models proposed recently for COVID19 forecasting which have not been peer reviewed, thus we do not consider any models published within 2 months upon our completion of this paper.

Naive uses the observed value of the most recent week as the future prediction.

SEIR [venkatramanan2017spatio] is an SEIR compartmental model for simulating epidemic spread. We calibrate model parameters based on surveillance data for each region. Predictions are made by persisting the current parameter values to the future time points and run simulations.

Autoregressive (AR) uses observations from previous time steps as input to a regression equation to predict the value at the next time step. We train one model per region using AR order 3.

Global Autoregression (GAR) trains one global AR model using the data available from each region. This is similar to the clusteringbased methods that we proposed in this paper. We train one model per resolution using AR order 3.

Vector Autoregression (VAR) is a stochastic process model used to capture the linear interdependencies among multiple time series. We train one model per resolution using AR order 3.

Autoregressive Moving Average (ARMA) [contreras2003arima] is used to describe weakly stationary stochastic time series in terms of two polynomials for the autoregression (AR) and the moving average (MA). We set AR order to 3 and MA order to 2.

CNNRNNRes [wu2018deep] uses RNNs to capture the longterm correlation in the data and uses convolution neural networks to fuse time series of other regions. A residual structure is also applied in the training process. We train one model per region. We set the residual window size as 3 and all the other parameters are set as the same as the original paper.

ColaGNN [deng2019graph] uses attentionbased graph neural networks to a graph message passing framework to combine graph structures and time series features in a dynamic propagation process. We train one model per resolution. We set RNN window size as 3 and all the other parameters are set as the same as the original paper.
IvD Settings and Implementation Details
We set training window size for all RNNbased models due to the short length of available CF and DT. We examine weekly CF forecasting at county and state level for US and country level for 8 countries of which at least one country is from each continent. The forecasting is made to 1, 2, 3, 4 weeks ahead at each time point i.e. . All RNNbased models consist of 2 recurrent neural network layers with 32 hidden units, 1 dense layer with 16 hidden units, 1 dropout layer with 0.2 drop probability. We set batch size as 32, epoch number as 500. Stacking ensemble model consists of 1 dense layer with 32 hidden units and RELU activation function. We train ensemble with batch size 8 and epoch number as 200. Adam optimizer with default settings and early stopping with patience of 50 epochs are used for all model training. Geoclustering and algclustering methods are applied when training county level models. We set the number of clusters for algclustering method as . The clustering is conducted on the normalized training curves using MinMaxScaler. Single feature means time series of CF. For country level forecasting, features include CF, DT, CCGR, DCGR, MF and FRR. For US state level forecasting, features include CF, DT, CCGR, DCGR, MF, FRR, TR and TPR. And CF, DT, CCGR, DCGR, MF, FRR and SDI are included for US county level forecasting. ARbased models and CNNRNNbased models are trained with single feature time series. For all models, we run 50 Monte Carlo predictions. For SEIR method, we calibrate a weekly effective reproductive number () using simulation optimization to match the new confirmed cases per 100k. We set the disease parameters as follows: mean incubation period 5.5 days, mean infectious period 5 days, delay from onset to confirmation 7 days and case ascertainment rate of 15% [lauer2020incubation].
V Results
Va Forecasting Performance
We evaluate the model performance of horizon 1, 2, 3, and 4 at county, state and nationallevel using RMSE, MAPE and PCORR. To mitigate the performance bias caused by our settings, we divide the individual models into several categories based on different modules, training methods, features. Then we calculate the average performance per category. Note that an individual model may belong to multiple categories. RNNs includes models mainly consist of RNN module. GRUs includes models mainly consist of GRU module. LSTMs includes models mainly consist of LSTM module. GNNRNNs includes models mix CNN, RNN, GNN modules. ARs includes autoregression based models. Vanillas includes models in RNNs that use single feature and vanilla training. Clusters includes models in RNNs that use single feature and geo, kmeans, tskmeans, kshape clustering training. SglFtrs includes RNN, GRU, LSTM. MulFtrs includes RNNm, GRUm, LSTMm, RNNatt, GRUatt, LSTMatt. SEIRs includes SEIR. Naive includes Naive. ENS is stacking ensemble of RNNs, GRUs and LSTMs. GNNRNNs excludes colaGNN and ARs excludes VAR for UScounty forecasting due to their failures to make reasonable forecasting. For more details please refer to Table II note.
Table II presents the numerical results. In general, we observe that (i) at US state and county level ENS performs the best on 2, 3 and 4 weeks ahead forecasting while Naive performs the best on 1 week ahead. (ii) SEIR outperforms others at global level forecasting on horizon 1, 2 and 3. (iii) Models with a single type of DNN modules outperform those with mixed types of modules. (iv) Models trained with vanilla methods outperform models trained with clusteringbased methods. We will investigate and explain this observation in the next two paragraphs. (v) Models trained with multiple features outperform models trained with a single feature at US state and county level.
To better understand the model performance distribution over all regions, we select one individual method from each category without overlapping and count frequency of the best performance (FRQBP) per method. Fig. 2 presents the aggregate counts of 1, 2, 3, 4 horizons. Note that methods with larger counts do not necessarily have better MAPE, RMSE and PCORR performance. The observations are in general consistent with those from Table II but with more specific observations regarding FRQBP: (vi) the best 1 week ahead predictions are mostly achieved by Naive methods. (vii) For US state and county level, the best 2, 3, 4 weeks ahead predictions are achieved by ENS and the value increases as horizon increases. (viii) Algclusteringbased models and models with multiple features achieve more best performance than vanilla models. (ix) GAR and AR have larger FRQBP than DNN models at US county level.
Furthermore, in Fig. 3 we show the US county level curves of weekly new confirmed cases grouped by individual methods where the best RMSE performance is achieved. It is interesting to observe that different methods achieve best performance over regions with different patterns, such as when the curves of weekly new confirmed cases have large fluctuation between subsequent weeks, the deep learningbased methods are able to capture the dynamics well as opposed to SEIR and Naive methods. The naive and SEIR models assume certain level of regularity in the time series, which tends to be violated in the curves pertaining to deep learning methods. LSTM, RNNkmeans, RNNkshape, and RNNtskmeans are outstanding in capturing dynamics with various patterns which show their generalization capability for time series forecasting. However, as we mentioned above the good performance in FRQBP does not indicate a better average performance on RMSE, MAPE, and PCORR since the latter also depends on the scales of ground truths. AR and GAR perform well on capturing dynamics of small number of cases. The CNNRNNbased methods does not perform well on county level forecasting. The likely reason is that the complexity of these models is much higher than simple RNNbased models and the complexity increases as the number of regions increases. Thus overfitting happens with such a small training data size at county level.
We want to highlight that in order to investigate deep learning models for COVID19 forecasting, the ensemble framework in this paper only combine DNN models. However it can but not necessarily include baselines like SEIR and Naive who perform very well in this task. We encourage researchers to ensemble models of various types to average the forecasting errors made by a particular poor model.
Global  USState  USCounty  
RMSE()  1  2  3  4  1  2  3  4  1  2  3  4 
ARs  38067  46065  53942  57905  3255  3546  3822  4933  77  92  101  120 
CNNRNNs  36895  49589  62499  69172  3511  4253  4615  5546  114  138  147  149 
RNNs  31232  34877  44838  55403  2200  2940  3593  4605  60  80  96  110 
GRUs  31172  36503  41513  55325  1936  2666  3520  4507  58  78  96  111 
LSTMs  28023  35252  43130  53907  2031  2682  3576  4483  60  79  97  111 
Vanillas  26323  33337  44273  54620  2135  2611  3415  4162  65  79  95  109 
Clusters                  72  91  103  117 
SglFtrs  26878  33513  44838  54909  1824  2614  3533  4610  56  77  97  112 
MulFtrs  32052  36648  42604  55008  1559  2154  3091  4114  47  66  84  97 
SEIRs  8761  9393  13879  22805  2310  3362  4558  4635  65  75  82  96 
Naive  15427  24899  27415  29318  1095  1936  1969  2466  37  48  60  71 
ENS  18166  23204  28150  19558  1261  1547  1599  2109  45  49  59  61 
MAPE()  1  2  3  4  1  2  3  4  1  2  3  4 
ARs  173  167  187  195  2301  2571  1549  1821  129  119  121  127 
CNNRNNs  95  123  173  197  1833  2656  1370  1777  148  187  202  191 
RNNs  82  95  105  133  1265  1662  772  1084  116  142  153  162 
GRUs  61  68  86  94  1335  1870  604  834  93  118  131  143 
LSTMs  43  64  71  89  1453  1848  650  947  94  119  129  143 
Vanillas  35  52  75  91  1092  1733  335  533  84  95  100  115 
Clusters                  140  167  171  179 
SglFtrs  37  57  86  105  891  1260  509  719  94  122  139  152 
MulFtrs  75  87  94  112  1448  1839  732  1101  101  127  139  142 
SEIRs  12  12  18  28  996  1067  555  585  344  331  308  292 
Naive  20  29  38  29  796  1198  565  590  75  98  95  83 
ENS  26  30  31  22  1048  1177  524  509  90  95  91  80 
PCORR()  1  2  3  4  1  2  3  4  1  2  3  4 
ARs  0.8787  0.8335  0.8040  0.7995  0.8713  0.8257  0.7161  0.5214  0.7712  0.6070  0.5586  0.3062 
CNNRNNs  0.9016  0.8479  0.8015  0.8217  0.7654  0.6441  0.5195  0.3119  0.1828  0.0232  0.0246  0.0636 
RNNs  0.9477  0.9167  0.8690  0.7950  0.9094  0.8403  0.7974  0.6129  0.8321  0.7103  0.6086  0.5161 
GRUs  0.9295  0.8968  0.8719  0.7966  0.9426  0.9152  0.8349  0.6791  0.8520  0.7377  0.5819  0.4776 
LSTMs  0.9312  0.8829  0.8329  0.8030  0.9218  0.8776  0.7844  0.6782  0.8513  0.7226  0.5655  0.4779 
Vanillas  0.9453  0.9106  0.8447  0.7703  0.9301  0.9094  0.8497  0.7521  0.8307  0.7528  0.6350  0.5297 
Clusters                  0.8167  0.6544  0.5242  0.4146 
SglFtrs  0.9388  0.8989  0.8306  0.7560  0.9392  0.9035  0.8175  0.6635  0.8607  0.7347  0.5752  0.4744 
MulFtrs  0.9348  0.8988  0.8716  0.8193  0.9662  0.9522  0.8978  0.7882  0.9292  0.8656  0.7679  0.7247 
SEIRs  0.9957  0.9954  0.9851  0.9576  0.5806  0.5138  0.5379  0.3622  0.8632  0.7997  0.7809  0.7000 
Naive  0.9888  0.9715  0.9498  0.9300  0.9764  0.9563  0.9208  0.8110  0.9546  0.9071  0.8485  0.7748 
ENS  0.9660  0.9397  0.9163  0.9725  0.9601  0.9488  0.9476  0.9072  0.9162  0.9166  0.8622  0.8789 

RNNs: RNN, RNNgeo, RNNm, RNNatt, RNNkmeans, RNNtskmeans, RNNkshape. GRUs: GRU, GRUm , GRUatt. LSTMs: LSTM, LSTMm, LSTMatt. GNNRNNs: colaGNN, GCNRNNRes, CNNRNNRes. ARs: AR, ARMA, VAR, GAR. Vanillas: RNN. Clusters: RNNgeo, RNNkmeans, RNNtskmeans, RNNkshape. SglFtrs: RNN, GRU, LSTM. MulFtrs: RNNm, GRUm, LSTMm, RNNatt, GRUatt, LSTMatt. Naive: naive. SEIRs: SEIR. ENS is stacking ensemble of the union of RNNs, GRUs, and LSTMs. CNNRNNs excludes colaGNN and ARs excludes VAR for UScounty forecasting due to their failures to make reasonable forecasting.
VB Sensitivity Analysis and Discussion
In this section, we show sensitivity analysis on model types, feature number, and clustering method for individual models.
RNN modules
We compare RMSE performance of models with pure RNN, GRU, LSTM modules. Fig. 4 shows the comparison between RNN, GRU, LSTM methods for three resolution datasets. We observe that RNN performs the best on 1 week ahead forecasting while GRU and LSTM outperform RNN on 3 and 4 weeks ahead forecasting at state and county level. The results indicate that RNN tends to perform better than GRU and LSTM for shortterm forecasting while it loses advantage for longterm forecasting.
Number of features
In our framework, we involve multiple data sources to model the coevolution of multiple factors in epidemic spreading. We implement individual models either with single feature or with features. In addition, we use an attention layer to model the effect of other features on the target feature. Fig. 5 presents the model performance of GRU, GRUm, and GRUatt at three datasets. In general, GRUm and GRUatt using features outperform GRU using single feature in most cases except for 1 and 2 week ahead forecasting at global level. Note that for global forecasting, there is no testing information which is a critical factor for revealing COVID19 dynamics.
Clustering method
Clusteringbased training is applied in our framework to mitigate the likely overfitting due to small training data size. We compare US county level model performance of RNN, RNNgeo, RNNkmeans, RNNtskmeans, RNNkshape. The comparison is shown in 6. In general, we observe RNN, RNNgeo and RNNkshape outperform RNNkmeans and RNNtskmeans. RNNgeo performs the best for 1 and 2 week ahead forecasting while RNNkshape performs the best for 3 and 4 week ahead forecasting. This indicates that geoclustering can capture near future coevolution dynamics within a state informed by similar local epidemiological environments. Kshape clustering can further capture far future dynamics informed by other counties with similar trends.
Vi Conclusion
In this work, we developed an ensemble framework that combines multiple RNNbased deep learning models using multiple data sources for COVID19 forecasting. The multiple data sources enable better forecasting performance. To mitigate the likely overfitting to noisy and small size of training datasets, we proposed clusteringbased training method to further improve DNN model performance. We trained stacking ensembles to combine individual deep learning models of simple architectures. We show that the ensemble in general performs the best among baseline individual models for high resolution and long term forecasting like US state and county level. Ensembles play a very important role for improving model performance for COVID19 forecasting. A comprehensive comparison between SEIR methods, DNNbased methods and ARbased methods are conducted. In the context of COVID19, our experimental results show that different models are likely to perform best on different patterns of time series. Despite the lack of sufficient training data, DNNbased methods can capture the dynamics well and show strong generalization ability for high resolution forecasting as opposed to SEIR and Naive methods. Among multiple DNNbased models, spatiotemporal models are more likely to overfitting due to the high model complexity for high resolution forecasting.
References
Footnotes
 Source: https://covid19.who.int/ as of August 26, 2020.
 Source:https://www.cdc.gov/coronavirus/2019ncov/coviddata/forecastingus.html as of August 10, 2020.