Hierarchical (Deep) Echo State Networks with Uncertainty Quantification for Spatio-Temporal Forecasting

# Hierarchical (Deep) Echo State Networks with Uncertainty Quantification for Spatio-Temporal Forecasting

Patrick L. McDermott and Christopher K. Wikle
Department of Statistics
University of Missouri
Department of Statistics, University of Missouri, Columbia, MO 65211 USA, plmyt7@gmail.com (corresponding author)
###### Abstract

Long-lead forecasting for spatio-temporal problems can often entail complex nonlinear dynamics that are difficult to specify a priori. Current statistical methodologies for modeling these processes are often overparameterized and thus, struggle from a computational perspective. One potential parsimonious solution to this problem is a method from the dynamical systems and engineering literature referred to as an echo state network (ESN). ESN models use so-called reservoir computing to efficiently estimate a dynamical neural network forecast, model referred to as a recurrent neural network (RNN). Moreover, so-called “deep” models have recently been shown to be successful at predicting high-dimensional complex nonlinear processes. These same traits can be used to characterize many spatio-temporal processes. Here we introduce a deep ensemble ESN (D-EESN) model. Through the use of an ensemble framework, this model is able to generate forecasts that are accompanied by uncertainty estimates. After introducing the D-EESN, we then develop a hierarchical Bayesian implementation. We use a general hierarchical Bayesian framework that accommodates non-Gaussian data types and multiple levels of uncertainties. The proposed methodology is first applied to a data set simulated from a novel non-Gaussian multiscale Lorenz-96 dynamical system simulation model and then to a long-lead United States (U.S.) soil moisture forecasting application.

Keywords: Long-lead forecasting, deep modeling, echo state networks, hierarchical Bayesian, spatio-temporal

## 1 Introduction

Spatio-temporal data are ubiquitous in engineering and the sciences, and their study is important for understanding and predicting a wide variety of processes. One of the chief difficulties in modeling spatial processes that change with time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex datasets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamical spatio-temporal models that are simultaneously useful scientifically (e.g., long-lead forecasting as discussed below) and efficient computationally. Statisticians have developed some “deep” mechanistically-motivated hierarchical models that can accommodate process complexity as well as the uncertainties in predictions and inference (see the overview in Cressie and Wikle, 2011). However, these models can be expensive and are typically application specific. On the other hand, the science, engineering, and machine learning communities have developed alternative approaches for nonlinear spatio-temporal modeling, in a few cases with fairly parsimonious parameterizations. These approaches can be very flexible and sometimes can be implemented quite efficiently, but typically without formal uncertainty quantification. Here, we present a hierarchical deep statistical implementation of so-called echo state network (ESN) models for spatio-temporal processes with the goal of long-lead forecasting with uncertainty quantification.

The atmosphere is a chaotic dynamical system. Because of that, skillful weather forecasts are only possible out to about 10-14 days (e.g., Stern and Davidson, 2015). However, dynamical processes in the ocean operate on a much longer time scales, and many atmospheric processes depend crucially on the ocean as a forcing mechanism. This coupling between the slowly varying ocean and the faster varying atmosphere (and associated processes), allows for the skillful prediction of some general properties of the atmospheric state many months to over a year in advance (i.e., long-lead forecasting). Although statistical models are not as skillful as deterministic numerical weather prediction models for short-to-medium weather forecasting, they have consistently performed as well or better than deterministic models for long-lead forecasting (e.g., Barnston et al., 1999; Jan van Oldenborgh et al., 2005). In some cases, fairly standard linear regression or multivariate canonical correlation analysis methods can be used to generate effective long-lead forecasts (e.g., Penland and Magorian, 1993; Knaff and Landsea, 1997). However, given the inherent nonlinearity of these systems, it has consistently been shown that well crafted nonlinear statistical methods often perform better than linear methods, at least for some spatial regions and time spans (e.g., Drosdowsky, 1994; Tangang et al., 1998; Tang et al., 2000; Berliner et al., 2000; Timmermann et al., 2001; Kondrashov et al., 2005; Kravtsov et al., 2005; Wikle and Hooten, 2010; Gladish and Wikle, 2014; Richardson, 2017). It remains an active area of research to develop nonlinear statistical models for long-lead forecasting, and there is a need to develop methods that are computationally efficient, skillful, and can provide realistic uncertainty quantification.

Statistical approaches for nonlinear dynamical spatio-temporal models (DSTMs) have focused on accommodating the quadratic nonlinearity that is present in many mechanistic models of such systems (Kravtsov et al., 2005; Wikle and Hooten, 2010; Richardson, 2017). These models, at least when implemented in a way that fully accounts for uncertainty in data, process, and parameters, can be quite computationally challenging, mainly due to the very large number of parameters that must be estimated. Solutions to this challenge require reducing the dimension of the state-space, regularizing the parameter space, the incorporation of additional information (prior knowledge), and novel computational approaches (see the summary in Wikle, 2015). Parsimonious alternatives include analog methods (e.g., McDermott and Wikle, 2016; Zhao and Giannakis, 2016), individual (agent) based models (Hooten and Wikle, 2010), and increasingly, machine learning methods such as recurrent neural networks (RNNs).

RNNs are a type of artificial neural network, originally developed in the 1980s (Hopfield, 1982) that, unlike traditional feedforward neural networks, include “memory” and allow cycles that can process sequences in their hidden layers. That is, unlike feedforward neural networks, RNNs explicitly account for the dynamical structure of the data. This has made them ideal for applications in natural language processing, speech recognition, and image captioning. These methods have not been used extensively for spatio-temporal prediction, although there are notable exceptions (Dixon et al., 2017; McDermott and Wikle, 2017a). Like the quadratic nonlinear spatio-temporal DSTMs in statistics, these models have a very large number of parameters (weights), and can be quite difficult to tune and train, and are computationally intensive. One way to get the advantages of an RNN within a more parsimonious parameter estimation context is through the use of so-called “reservoir computing” methods - the most common of which is the ESN (Jaeger, 2001). In this case the hidden states and inputs evolve in a “dynamical reservoir” in which the parameters (weights) that describe their evolution are drawn at random with most (e.g., 90% or upwards) assumed to be zero. Then, the only parameters that are estimated are the output parameters (weights) that connect the hidden states to the output response. For example, in the context of continuous spatio-temporal responses, this is just a regression estimation problem (usually with a regularization penalty; e.g., ridge regression). See Section 3.1 for a model representation of the ESN.

Historically, the reservoir parameters in the ESN are just chosen once, with fairly large hidden state dimensions. Although this often leads to good predictions, it provides no opportunity for uncertainty quantification. An alternative is to perform parametric bootstrap or ensemble estimation, in which multiple reservoir samples are drawn (e.g., Sheng et al., 2013; McDermott and Wikle, 2017b). This provides a measure of uncertainty quantification and allows one to choose smaller hidden state dimensions, essentially building an ensemble of weak learners analogous to many methods in machine learning (e.g., Friedman et al., 2001). There have also been Bayesian implementations of the ESN model (e.g., Li et al., 2012; Chatzis, 2015), but none of these have been implemented with traditional Markov chain Monte Carlo (MCMC) estimation methods, as is the case here, where multiple levels of uncertainties can be accounted for. In the context of spatio-temporal forecasting, (McDermott and Wikle, 2017b, a) have shown that augmenting the traditional ESN with quadratic output terms (analogous to the quadratic nonlinear component in statistical DSTMs) and input embeddings (e.g., including lags of the input variables as motivated by Takens (1981) representation in dynamical systems) can improve forecast accuracy compared to traditional ESN models.

“Deep” models have recently shown great success in many neural network applications in machine learning (e.g., Krizhevsky et al., 2012; Goodfellow et al., 2016). As mentioned above, deep (hierarchical) statistical models have also been shown to be effective in complex spatio-temporal models. There are challenges in training such models given the very large number of parameters (weights), so it can be advantageous to consider deep models that are also relative parsimonious in their parameter space. It is then natural to explore the potential for deep or hierarchical ESN models. The purpose of adding additional layers in the ESN framework is to model additional temporal scales (Jaeger, 2007a). That is, deep ESN models provide a greater level of flexibility by allowing individual layers to potentially represent different time scales. The model presented here attempts to exploit the multiscale nature of deep ESN models for spatio-temporal forecasting.

Deep ESN models have been considered in the engineering literature (Jaeger, 2007a; Triefenbach et al., 2013; Antonelo et al., 2017; Ma et al., 2017), yet none of these approaches accommodate uncertainty quantification. Furthermore, these methods do not include a spatial component nor are they applied to spatio-temporal systems. Here we develop a hierarchical ESN approach for spatio-temporal prediction that explicitly accounts for uncertainty quantification. We first present an ensemble approach, extending the work of McDermott and Wikle (2017b) to the deep setting, and then consider a Bayesian implementation of the deep ESN model that more rigorously accounts for observation model uncertainty.

Section 2 describes the motivating spatio-temporal long-lead forecasting problem, in this case using Pacific sea surface temperature (SST) anomalies to forecast soil moisture anomalies over the US “corn belt” 6 months in the future. Section 3 describes the hierarchical ESN methodology, first from the ensemble perspective and then the Bayesian perspective. Section 4 presents a simulation study using a non-Gaussian multiscale Lorenz-96 system and then presents the soil moisture forecasting example. We conclude with a brief discussion in Section 5.

## 2 Motivating Problem: Long-Lead Forecasting

In the context of atmospheric and oceanic processes, long-lead forecasting refers to forecasting atmospheric, oceanic or related variables on monthly to yearly time scales. Although it is fundamentally impossible to generate skillful meteorological forecasts of atmospheric processes on time horizons of greater than about ten days to two weeks, the ocean operates on much longer time scales than the atmosphere and provides a significant amount of the forcing of atmospheric variability. Thus, this linkage between the ocean and the atmosphere can lead to skillful long-lead “forecasts” of the atmospheric state, or other processes linked to the atmospheric state (e.g., soil moisture) on time scales of months to two years or so (e.g., Philander, 1990). In this case, one cannot typically generate skillful meteorological point forecasts, but can provide general distributional forecasts that are skillful relative to naïve models such as climatology (long-term averages) or persistence (assuming current conditions persist into the future).

Historically, successful long-lead forecasting applications are typically tied to the ocean El Niño/Southern Oscillation (ENSO) phenomenon, which shows quasi-periodic variability between warmer than normal ocean states in the central and eastern tropical Pacific ocean (El Niño) and colder than normal ocean states in the central tropical Pacific (La Niña). The ENSO phenomenon accounts for the largest amount of variability in the tropical Pacific ocean and leads to world wide impacts due to atmospheric “teleconnections” (i.e., the shifting of the warm pools in the tropical Pacific also shift the convective clusters of precipitation that drive wave trains, that in turn influence the atmospheric circulation; e.g., locations of jet streams). These atmospheric circulation changes then affect temperature, precipitation, and many responses to those variables such as habitat conditions for ecological processes, soil moisture, severe weather, etc., as described in Philander (1990). It has been demonstrated for over two decades that skillful long-lead forecasts of sea surface temperature (SST) in the tropical Pacific (and, hence, ENSO) is possible with both deterministic and statistical models. In fact, this is one of the aforementioned situations in the atmospheric and ocean sciences where statistical forecasting is as good as, or better than, deterministic models (e.g., Barnston et al., 1999; Jan van Oldenborgh et al., 2005).

There are essentially two general approaches to the long-lead forecasting of a response to SST forcing. One approach is to generate a long-lead forecast of SST (e.g., a 6 month forecast) and then use contemporaneous relationships between the ocean state and the response of interest (e.g., say midwest soil moisture) to generate a long-lead forecast of that response. This typically requires a dynamical forecast of SST and then some regression or classification model from SST to the outcome of interest. The alternative is to model the relationship between SST and the future response at a chosen lead time (e.g., forecasting midwest soil moisture in May given SST in November). This is typically a spatio-temporal regression, where one is predicting a spatial field in the future given a spatial field at the current time. In either case, linear models have been shown to perform reasonably well in these situations (e.g., Van den Dool et al., 2003), but it is typically the case that models that include nonlinear interactions can perform more skillfully, and can produce more realistic long-lead forecast distributions (e.g., Sheffield et al., 2004; Fischer et al., 2007; McDermott and Wikle, 2016).

### 2.1 Long Lead Forecasting of Soil Moisture in the Midwest U.S. Corn Belt

Soil moisture is fundamentally important to many processes (e.g., agricultural production, hydrological runoff). In particular, the amount of soil moisture available to crops such as wheat and corn at certain critical phases of their growth cycle can make a significant impact on yield (Carleton et al., 2008). Thus, having a long-lead understanding of the plausible distribution of soil moisture over an expansive area of agricultural production can assist producers by suggesting optimal management approaches (e.g., timing of planting, nutrient supplementation, and irrigation). Given the aforementioned links between tropical Pacific SST and North American weather patterns, it is not surprising that skillful long-lead forecasts of soil moisture in major production areas of the U.S. are possible (e.g., Van den Dool et al., 2003). Indeed, the U.S. National Oceanic and Atmospheric Administration (NOAA) and National Center for Environmental Prediction (NCEP) routinely provide soil moisture outlooks (“forecasts”) that are based on a combination of deterministic and statistical (constructed analog) models (e.g., see  http://www.cpc.ncep.noaa.gov/soilmst
/index_jh.html), although for shorter lead times than of interest here. Recently, McDermott and Wikle (2016) showed that a Bayesian analog forecasting model could provide skillful forecasts of soil moisture anomalies over the state of Iowa in the U.S. at lead times up to 6 months.

The application here will consider the problem of forecasting soil moisture over the midwest U.S. corn belt in May given data from the previous November (i.e., a 6 month lead time). We consider May soil moisture because it is an important time period for planting corn in this region (i.e., Blackmer et al., 1989). This application corresponds to the long-lead spatio-temporal field regression approach to nowcasting described above – that is, we regress a tropical Pacific SST field in November onto the soil moisture field the following May. The details associated with the data and model used for this example are given in Section 4.3.

## 3 Methodology

Suppose we are interested in forecasting the spatio-temporal process for time periods at a discrete set of spatial locations for . Using a chosen linear dimension reduction method, can be decomposed such that , where is a matrix of spatial basis functions and is a -dimensional vector of basis coefficients, indexed by (e.g., Cressie and Wikle, 2011). Next, assume is a vector of input variables that correspond to a discrete set of spatial locations for , but more generally may consist of input covariates that are not indexed spatially (e.g., ).

Below we provide a brief overview of the ensemble ESN model considered in McDermott and Wikle (2017b), followed by the multi-level (deep) extension that we call a deep ensemble echo state network (D-EESN) model. We then describe a Bayesian version of the D-EESN model that we label (BD-ESSN).

### 3.1 Basic ESN Background

The quadratic ESN model outlined in McDermott and Wikle (2017b) is given by:

 Data stage: Zt≈\boldmathΦ\unboldmath\boldmathα\unboldmatht (1) Output stage: \boldmathαt=V1ht+V2h2t+\boldmathϵ\unboldmatht,\boldmathϵ\unboldmatht∼Gau(0,σ2ϵI) (2) Hidden stage: ht=gh(ν|λw|Wht−1+U~xt), (3)

where is an matrix of spatial basis functions, is a -vector that contains the associated basis expansion coefficients (where ), is an -dimensional vector of hidden units, is a nonlinear activation function (e.g., a sigmoidal function such as a hyperbolic tangent function), is the “spectral radius” (largest eigenvalue) of , is a scaling parameter, and , , , are weight (parameter) matrices of dimension , , , and , respectively (defined below). Crucially, the square parameter matrix can be thought of similarly to a transition matrix in a vector autoregressive (VAR) model in that it can capture transition dynamic interactions between various inputs. The scaling parameter helps control the amount of memory in the system and is restricted such that, for stability purposes (Jaeger, 2007b). In addition, is an -vector of embeddings (lagged input values) given by:

 ~xt=[x′t,x′t−τ,x′t−2τ,…,x′t−mτ]′. (4)

Critically, only the weight matrices and in the output stage (2) are estimated (usually with a ridge penalty). These weight matrices can be thought of similarly to regression parameters that weight the hidden units appropriately. In contrast, the “reservoir weights” in (3) are drawn from the following distributions:

 W=[wk,q]k,q:wk,q=γwk,qUnif% (−aw,aw)+(1−γwk,q)δ0, (5) U=[uk,r]k,r:uk,r=γuk,rUnif% (−au,au)+(1−γuk,r)δ0, (6) γwk,q∼Bern(πw),γuk,r∼Bern(πu), (7)

where and denote indicator variables, denotes a dirac function, and (and ) can be thought of as the probability of including a particular weight (parameter) in the model. As is common in the machine learning literature, both and are also set to small values to help prevent overfitting. Similarly, the parameters and are set to small values to create a sparse network. Both the sparseness and randomness act as a regularization mechanism, which prevents the ESN model from overfitting to in-sample data. In summary, note that the hidden (reservoir) stage in (3) corresponds to a nonlinear stochastic transformation of the input vectors that are then regressed, with regularization, onto the output vectors in the output stage (i.e., (2) above), which are then transformed back to the data scale in (1).

Traditional ESN models (e.g., Jaeger, 2007b; Lukoševičius and Jaeger, 2009) do not typically include the quadratic () term in the output stage nor do they include the embeddings in the input vector, but McDermott and Wikle (2017b) found that those can be helpful when forecasting many spatio-temporal processes. In addition, traditional ESN applications typically include a “leaking rate” parameter that corresponds to a convex combination of the previous hidden state and the current reservoir estimate (e.g., Lukoševičius, 2012). We have not found this to be as useful for long-lead spatio-temporal prediction as it has been in other applications and so omit it in our presentation for notational simplicity.

To accommodate uncertainty quantification, McDermott and Wikle (2017b) were motivated by parametric bootstrap prediction methods (e.g., Genest and Rémillard, 2008) to consider an ensemble of predictions from the quadratic ESN model. In particular, Algorithm 1 of McDermott and Wikle (2017b) provides a simple procedure that generates new forecast realizations of , for by implementing the reservoir model in (3) to obtain in (2) through the use of simulated (sample) weight matrices from (5) and (6). These samples can then be used in a Monte Carlo sense to obtain features of the predictive distribution (e.g., means, variances, etc.). This quadratic-ensemble ESN (Q-EESN) model was shown to be quite effective in capturing the uncertainty associated with long-lead forecasting of tropical Pacific SST.

### 3.2 Deep Ensemble ESN (D-EESN)

The Q-EESN model described in Section 3.1 has multiple levels, but is not a “deep” model in the sense that it has no mechanism to link hidden layers, which might be important for processes that occur on multiple time scales. To accommodate such structure, we extend some of the deep ESN model components developed in Ma et al. (2017) and Antonelo et al. (2017) to a spatio-temporal ensemble framework in the following deep EESN (D-EESN) model. In particular, the D-EESN model with hidden layers, for , is defined as follows for time period and :

 \bf Data Stage:Zt≈% \boldmathΦ\unboldmath\boldmathα\unboldmatht (8) \bf Output Stage:\boldmathα% \unboldmatht=V1ht,1+L∑ℓ=2Vℓgh(˜ht,ℓ)+\boldmathϵ\unboldmatht,\boldmathϵ% \unboldmatht∼Gau(\boldmath0\unboldmath,σ2ϵI), (9) s.t. v′ℓ,bvℓ,b≤cv, (10) \bf Reduction Stage ℓ+1:˜ht,ℓ+1≡Q(ht,ℓ+1),for ℓ

where is a -dimensional vector for the hidden layer and each parameter matrix is defined as:

 Vℓ ≡⎡⎢ ⎢ ⎢⎣v′ℓ,1⋮v′ℓ,nb⎤⎥ ⎥ ⎥⎦. (13)

The function in (11) denotes a dimension reduction mapping function (see below for specific examples) such that for , where is a matrix such that . More concisely, this transformation is on the matrix , resulting in the transformed matrix , and then for each time period the appropriate is extracted from the transformed matrix to construct (10). Through the inclusion of the terms in (9), the model can potentially weight layers lower down in the hierarchy that may represent different time scales or features (i.e., Ma et al., 2017). Note, the activation function is included in (9) to place the dimension reduction variables on a similar scale as the variables (i.e., the hidden variables that have not gone through the dimension reduction transformation).

For a given hidden unit , denotes the largest eigenvalue of the square matrix , where as in the basic ESN model above, is drawn from the following distribution:

 w(ℓ)kℓ,qℓ=γwℓkℓ,qℓ% Unif(−awℓ,awℓ)+(1−γwℓkℓ,qℓ)δ0,γwℓkℓ,qℓ∼Bern(πwℓ), (14)

and each element of the parameter matrix is drawn from:

 u(ℓ)kℓ,rℓ=γuℓkℓ,rℓ% Unif(−auℓ,auℓ)+(1−γuℓkℓ,rℓ)δ0,γuℓkℓ,rℓ∼Bern(πuℓ). (15)

The embedded input vector, is defined as in (4) above. Estimation of is carried out through the use of ridge regression using the ridge hyper-parameter , where there is a one-to-one relationship between and the constant in (9). Note that we do not include a quadratic output term in this model as we did in (2) for simplicity, but this could easily be added if the application warrants (note - both applications presented here did not benefit from the addition of quadratic output terms in the deep model setting).

The bootstrap ensemble prediction for the D-EESN model is presented in Algorithm 1 below. In particular, Algorithm 1 starts by drawing and for every layer in the D-EESN model. Next, these parameters are plugged-in sequentially, starting with the input layer defined in (12) and ending with the hidden layer that comes directly before the output stage (i.e., (10) for ). Finally, with all of the hidden states calculated, ridge regression is used to estimated the regression parameters in (9).

The reservoir hyperparameters, are chosen by cross-validation driven by a genetic algorithm (see Section 4.1 for further details). As in McDermott and Wikle (2017b), we found that fixing the parameters as well as the number of hidden units for all of the layers except the first (i.e., ) was a reasonable assumption here since the dimension of all the hidden units besides the top layer (i.e., ) are eventually reduced using the dimension reduction transformation.

Note that this model consists of a series of linked non-linear stochastic transformations of the input vector that are available for prediction. In addition, the data reduction steps act similarly to the convolution step in a convolutional neural network (Krizhevsky et al., 2012). That is, it simultaneously reduces the dimension of the hidden layers and provides a summary of the important features. Similar to how ridge regression acts to reduce redundancy in the classic ESN model, dimension reduction serves this purpose in the deep framework. Due to the limited amount of research on deep ESNs, the question of which dimension reduction method to use for in (11) is still an open research question. Here, principal component analysis (PCA), Fourier basis functions, and Laplacian eigenmaps (Belkin and Niyogi, 2001) were explored as possible choices for . Although these methods were selected to represent both linear and nonlinear dimension reduction techniques, there are certainly other choices that could be explored in future applications.

### 3.3 Bayesian Deep Ensemble ESN (BD-EESN)

The D-EESN model given in Section 3.2 is very efficient to implement, but at a potential cost of not fully accounting for all sources of uncertainty in estimation and prediction. In particular, the data stage given in (8) does not account for truncation error in the basis expansion, nor the estimation associated with the estimates of the regression or residual variance parameters in (9). This can easily be remedied via Bayesian estimation at these stages. To our knowledge, this is the first ESN model for spatio-temporal data to be implemented within a traditional hierarchical Bayesian framework.

We write this so-called Bayesian deep EESN (BD-EESN) model in a general form here so that it can be applied to both the traditional EESN and the D-EESN described above. In particular, let:

 \bf Data stage:Zt|\boldmathαt∼D(\boldmathμ\unboldmath(% \boldmathα\unboldmatht),\boldmathΘ\unboldmath), (16) \bf Output stage:\boldmathα% \unboldmatht=1nresnres∑j=1[% \boldmathβ\unboldmath(j)1h(j)t,1+L∑ℓ=2\boldmathβ\unboldmath(j)ℓgh(˜h(j)t,ℓ)]+\boldmathη\unboldmatht, \boldmathη\unboldmatht∼Gau(0,σ2ηI), (17)

where D denotes an unspecified distribution; for both of the applications discussed here and (this specification is application dependent and could be altered to accommodate other distributions). Here, is defined as a (known) spatial covariance matrix, see Section 4 for application-specific details. The regression matrices in (17) are defined as follows for and :

 \boldmathβ\unboldmath(j)ℓ ≡⎡⎢ ⎢ ⎢ ⎢⎣\boldmathβ\unboldmath(j)′ℓ,1⋮\boldmathβ\unboldmath(j)′ℓ,nb⎤⎥ ⎥ ⎥ ⎥⎦. (18)

In this notation, is the sampled -dimensional vector from a dynamical reservoir generated as in the D-EESN model in Section 3.2; that is, we generate reservoir samples “off-line” using (10) from the D-EESN model (for ). Similarly, are also sampled a priori and come from the dimension reduction stage(s) of the D-EESN model (i.e., (12) above). That is, both and are treated as fixed covariates for the BD-EESN model. So, the output model takes an ensemble approach in terms of generating a suite of nonlinear stochastic transformed input variables for a Bayesian regression, where each are matrices of regression parameters as defined in (18). Note the obvious similarity between the process model in (9) and the one for the D-EESN model in (17). If all of the terms for are set to zero, the process model in (17) can be used with the traditional EESN model (or the Q-EESN model by simply adding quadratic terms).

This model is clearly over-parameterized, so the regression parameters in the BD-EESN model are given stochastic search variable selection (SSVS) priors (George and McCulloch, 1997). While one of the many variable selection priors from the Bayesian variable selection literature can be used here, we choose to use SSVS priors for their ability to produce efficient shrinkage. In particular, SSVS priors shrink a (large) percentage of parameters to zero (in a spike-and-slab implementation) or infinitesimally close to zero, while leaving the remaining variables unconstrained. Thus, the regression parameters in the BD-EESN model are given the following hierarchical prior distribution for and :

 β(j)ℓ,b,kℓ∣γβℓℓ∼γβℓℓGau(0,σ2βℓ,0)+(1−γβℓℓ)Gau(0,σ2βℓ,1), (19) γβℓℓ∼Bernoulli(πβℓ),

where indexes the hidden units for a particular layer, , and can be thought of as the prior probability of including a particular variable in the model. Finishing the prior specifications for the model, the variance parameter is given an inverse-gamma prior such that . The hyper-parameters , , , , and are problem-specific (see the examples in Section 4.1 below). Since the parameters here are given conjugate priors, the full-conditional distributions that make up the Gibbs sampler MCMC algorithm needed to implement this model are straight-forward to sample from. A summary of the entire estimation procedure for the BD-EESN can be found in Algorithm LABEL:algorithm2.

## 4 Simulation and Motivating Examples

Here we describe the model setup used to implement the D-EESN and BD-EESN models on a complex simulated process and for the soil moisture long-lead forecasting problem that motivated these models.

### 4.1 Model Setup

The previously mentioned cross-validation for the D-EESN is carried out using a genetic algorithm (GA) (e.g., Sivanandam and Deepa, 2007) contained in the GA package ( https://cran.r-project.org/web/packages/GA) from the R statistical computing program ( http://cran.r-project.us.org). Unlike the basic ESN model, the number of hyper-parameters for the deep EESN model can increase quickly as the number of layers increases (e.g., with five-layers and a relatively coarse search grid the number of total parameters in the search space can easily approach ), thus rendering grid search approaches computationally burdensome at best (i.e., Ma et al., 2017). Through the use of a GA, the hyper-parameters that make up a deep ESN can be selected at a fraction of the computational cost. The GA was implemented with 40 generations and a population size of 20 for all of the applications presented below. This is the first implementation, to our knowledge, of either a traditional or deep ESN within an ensemble framework using a GA. The bounds of the parameter search space for each of the hyper-parameters in the D-EESN model can be found in Table 1.

All ensemble models are comprised of 100 ensembles, we did not find any of the applications to be overly sensitive to this choice. Using this number of ensembles represents a compromise between computational efficiency and achieving consistent results (e.g., we found that using less than approximately 30 ensembles produced unstable results, while values around approximately 100 did not change the results significantly). Note, previous bootstrap ensemble ESN papers have generally used far fewer ensembles (e.g., Sheng et al., 2013). For context, on a 2.3 GHz laptop the D-EESN algorithm defined in Algorithm 1 takes 4.3 and 17.5 seconds for a two-layer and seven-layer model, respectively, using 100 ensembles with the Lorenz-96 application described in Section 4.2 below.

Regarding the previous discussed dimension reduction function for the reduction stage of the D-EESN model (i.e., (11) above), Fourier basis functions were selected for both models using cross-validation. Although, we should note that the model was not very sensitive to this choice, with PCA producing similar results for both applications. Finally, for simplicity we use the same dimension for each dimension reduction layer in the D-EESN model (i.e., ), a similar assumption was made in Ma et al. (2017). Similar to McDermott and Wikle (2017b) all of the hyper-parameters in the set are fixed at . Finally, for was set to 84 for all of the deep models (an even value was used to accommodate the Fourier basis functions). The model was not sensitive to this value with values ranging between 60 and 100 producing nearly identical results for the metrics evaluated below. As previously discussed, the ensemble framework presented here aims to create an ensemble of weak learners, thus the selection of a moderate value for .

Estimation of all the Bayesian models considered here is carried out using a Gibbs sampler MCMC algorithm with 5,000 iterations where the first 1,000 iterations are treated as burn-in. The trace plots showed no evidence of non-convergence (more details regarding convergence are given below). The specific hyper-parameters used for the Bayesian implementation can be found in Table 2. Note, more restrictive priors (in terms of regularization) are employed for the models with more regression parameters (i.e., models with more hidden layers). This improves the mixing of the MCMC algorithm along with preventing the model from overfitting.

Both the D-EESN and BD-EESN are compared against the previously described single-layered Q-EESN model in order to investigate the added utility of using a deep framework. In addition, naïve or simple forecasting methods are often employed for difficult long-lead forecasting problems such as the soil moisture application, to act as a baseline for comparison. We consider both a climatological and linear dynamical spatial-temporal model (DSTM) here as baseline models for the soil moisture application (for consistency, we also compare to the linear model in the deep Lorenz-96 simulation study described below). The linear DSTM model is defined as follows:

 Zt∼Gau(\boldmathΦ% \unboldmath\boldmathα\unboldmatht,\boldmathΣ\unboldmath(L)), (20) \boldmathα\unboldmatht=M\boldmathα\unboldmatht−1+\boldmathη\unboldmath(L)t, (21)

where is a transition matrix estimated using least squares (with associated Gaussian error-terms ) and is a covariance matrix estimated empirically using the in-sample residuals from the model fit (i.e., this is essentially a two-stage least squares estimation procedure, where is first estimated as previously described, followed by ). See Chapter 7 of Cressie and Wikle (2011) for a comprehensive overview of linear DSTMs.

The models considered here are evaluated in terms of both out-of-sample prediction accuracy, and the quality of their respective forecast distributions (i.e., uncertainty quantification). In pareticular, mean squared prediction error (MSPE), defined here as the average squared difference between out-of-sample realizations and forecasts averaged over space and time, is calculated for every model. The forecast distributions are evaluated using the continuous ranked probability score (CRPS). Unlike MSPE, CRPS also considers the quality of a given model’s uncertainty quantification by comparing the forecast distribution with the observed values (e.g., see Gneiting and Katzfuss, 2014). The classic CRPS formulation for Gaussian data (i.e., Gneiting et al., 2005) is used for the soil moisture application, while the deep Lorenz-96 simulation uses the closed form expression for CRPS with log-Gaussian data from Baran and Lerch (2015). Regarding the soil moisture application, as previously noted, standard forecasting methodologies for difficult long-lead problems are often compared to simpler forecast methods such as climatological or linear forecasts. The so-called skill score (SS) is an evaluation metric that compares a forecasting method to some reference forecast (e.g., Wilks, 2001). Assuming one wants to compare a reference forecast to some model forecast in terms of MSPE, SS is defined as follows:

 SS=1−MSPE(Model)MSPE(Reference). (22)

In our application, SS is calculated for each location in the soil moisture application by calculating MSPE across time. Thus, by computing a SS for each spatial location we can obtain a spatial field that shows where the new model improves upon a particular reference model.

### 4.2 Simulation Study: Deep Lorenz-96 Model

The deterministic model of Lorenz (1996), often referred to as the “Lorenz-96 model,” is frequently used as a simulation model in the atmospheric science and dynamical systems literature because it incorporates the quadratic nonlinear interactions of the famous Lorenz (1963) “butterfly” model in a one-dimensional spatial setting. A multiscale extension of this model (see Wilks, 2005) has gained popularity in the atmospheric science and applied mathematics literature for its ability to represent realistic nonlinear interactions between small and large scale variables (Crommelin and Vanden-Eijnden, 2008; Grooms and Lee, 2015). Moreover, the Lorenz-96 model operates on multiple scales in both space and time, consisting of locations with slowly varying large-scale behavior and locations with fast varying small-scale behavior. Thus, the multiscale Lorenz-96 model represents a very relevant example for the multiscale deep EESN methodology developed here. To make this simulation model even more realistic, we extend it by adding an additional data stage to allow for non-Gaussian data types. We will refer to this simulation model as the “deep Lorenz-96 model.” To our knowledge, this type of Lorenz-96 model is novel.

The spatial structure of the Lorenz-96 system (as shown in Figure 1) is a one-dimensional circular structure (i.e., periodic boundary conditions) where each large-scale location is evenly spaced. Furthermore, each large-scale location is associated with small scale locations (see Figure 1). Using the process model parameterization from Chorin and Lu (2015) and the inclusion of a data model, our deep Lorenz-96 model is defined as follows for time period :

 Data Model: zt,k∼LogGaussian(|xt,k|c,σ2η), Process Model: dxkdt=xk−1(xk+1−xk−2)−xk+F+hxJ∑jyj,k, (23) dyj,kdt=1ϵL[yj+1,k(yj−1,k−yj+2,k)−yj,k+hyxk],

where and are user defined parameters, and correspond to large-scale locations, corresponds to a small-scale location for and , and denotes the absolute value. The parameter corresponds to a forcing term and helps determine the amount of nonlinearity in the model. In this setting, is a “time separation parameter” such that smaller values lead to a faster-varying temporal-scale for the small-scale locations. The contribution of the small-scale locations to the large-scale locations (and vice versa) is determine by and , respectively. Finally, the scaling parameter helps ensure the log-Gaussian data model does not produce unrealistically large realizations (we let here).

Since the methodology presented here concerns multiscale spatio-temporal modeling, the parameters for the deep Lorenz-96 model are selected to maximize the level of multiscale behavior in the process (as can be seen from the simulated deep Lorenz-96 data shown in Figure 2). In particular we use the following settings: , and , while we follow Chorin and Lu (2015) in setting and . The variance term in (23) is set such that . An Euler solver is used to numerically solve the Lorenz-96 equations in (23) using a time step of (Fatkullin and Vanden-Eijnden, 2004). After a burn-in period, 510 periods of simulated data from the deep Lorenz-96 model are retained, with the final 75 periods held out and treated as out-of-sample realizations.

Only the large-scale process is treated as observed here, thus both the large and small scale Lorenz-96 variables (i.e., and ) are treated as unobserved. Therefore, unlike the soil moisture application, both the input and output of the model are the same process (i.e., ). The input and output are separated by three periods here (i.e., the lead time is set to three periods), in order to make the problem slightly more difficult. The previously discussed embedding lag (i.e., in (4)) is set to the lead time (three periods) for the deep Lorenz-96 D-EESN implementations. The number of embedding lags (i.e., in (4)) is selected with the GA (using cross-validation) to be three. Finishing the specification of the data models in (8) and (16), is defined as , where is a identity matrix, a log-Gaussian distribution is used for the unspecified distribution in (16), and the covariance matrix is set such that .

The out-of-sample validation metrics in Table 3 show the seven-layer D-EESN models to be the best forecast model for the deep Lorenz-96 data. In particular, both of the seven-layer D-EESN models perform better in terms of MSPE and CRPS than the Q-EESN or linear model. Although not shown here, D-EESN models with 2-6 layers all performed better than the Q-EESN model and worse than the seven-layer D-EESN model in terms of MSPE, with the MSPE monotonically decreasing as layers were added. We found that after seven-layers any gain in forecast accuracy was very minimal in comparison to the extra computation required to keep adding layers.

The Bayesian version of the seven-layer D-EESN model produces similar MSPE and CRPS values to the non-Bayesian version. Although one would expect the Bayesian model to perform similar in terms of MSPE to the non-Bayesian version, there is potential for the Bayesian version to improve in terms of uncertainty quantification (as shown below for the soil moisture application below). Despite the more heuristic nature of uncertainty quantification in an ensemble framework, the D-EESN model has strong point-wise coverage probabilities with these simulation data (that is, of the true values from the deep Lorenz-96 simulation are contained within the 95% prediction intervals (P.I.s) for the non-Bayesian seven layer D-EESN model). Thus, there is little the Bayesian model can provide in terms of improvement for uncertainty quantification here. This point is illustrated in Figure 3 which shows forecast summaries for the first six locations with the seven-layer D-EESN model. Despite the clear nonlinearity in the process, the model correctly forecasts much of the overall quasi-cyclic nature and intensity of the process, while producing uncertainty metrics that cover most of the true values.

### 4.3 Midwest Soil Moisture Long-Lead Forecasting Application

The previously mentioned soil moisture data comes from the Climate Prediction Center’s (CPC) high resolution monthly global soil moisture data set (Fan and van den Dool (2004), https://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GMSM/.w/). Spatially, the data domain covers N - N latitude and W - W longitude at a resolution of . The monthly data set begins in January 1948 and goes through December 2017. While the model is trained on monthly data from January 1948 through November 2011, only the May forecasted values from the out-of-sample period covering 2012-2017 are used to evaluate the model, given the importance of May soil moisture for planting corn in the U.S. corn belt. We follow the common practice in the atmospheric science literature of converting data into anomalies by subtracting the respective in-sample monthly means from the data.

Dimension reduction for the soil moisture data is carried out using empirical orthogonal functions (EOFs; i.e., or spatial-temporal principal component analysis) (e.g., see Cressie and Wikle, 2011, Chapter 5). Therefore, is obtained using the first 15 EOFs, which account for of the variation in the soil moisture data (note, the model was not sensitive to small variations in this choice). Note that other linear or nonlinear dimension reduction methods could also be employed in the framework presented here. The data model spatial covariance matrix is calculated using the following formulation from Cressie and Wikle (2011, Equation 7.6): , where and are the remaining eigenvalues and eigenvectors, respectively, that are not used in the decomposition of and is a constant (set to ). Since the data are converted into anomalies, a Gaussian distribution is used for the distribution in (16).

The monthly SST data comes from the extended reconstruction sea surface temperature (ERSST) data set (Smith et al. (2008), http://iridl.ldeo.columbia.edu/SOURCES/
.NOAA/.NCDC/.ERSST) and cover the same temporal period as the soil moisture data. The spatial domain for the SST data is given by S - N latitude and E - W longitude with a resolution of , and covers much of the mid-Pacific ocean. Similar to the soil moisture data, the SST data are converted into anomalies from their monthly means. Dimension reduction is once again carried out using EOFs by retaining the first 5 EOFs, which account for almost of the variation in the data (the model was also not overly sensitive to this choice). The embedding lag for the input is set to the lead time of six periods and the number of embedding lags is selected to be three using cross-validation with the GA. Convergence for the MCMC Gibbs algorithm sampler was further assessed using the Gelman-Rubin diagnostic (Gelman and Rubin, 1992) with four chains, which did not suggest any lack of convergence.

Table 4 shows the performance of various models for out-of-sample long-lead forecasting of May soil moisture. Both the Bayesian and non-Bayesian D-EESN models perform the best in terms of MSPE. The D-EESN models also represent the largest improvement over the climatological forecast. Further, both D-EESN models also outperform the single-layered Q-EESN model, suggesting the soil moisture application benefits from a deep framework. Notably, the two and three layer models appear to have similar MSPE values, indicating that two layers is likely sufficient. While the Bayesian and non-Bayesian models perform similarly in terms of forecast accuracy, the Bayesian D-EESN models perform much better in terms of CRPS than the non-Bayesian versions. In particular, the Bayesian version produces CRPS values that are lower than the non-Bayesian version. Absent the improvement allowed by the Bayesian implementation, the D-EESN would be much less able to quantify the uncertainty in the long-lead soil moisture forecasts.

Figure 4 shows the posterior predictive means and standard deviations for 2014 and 2015 over the prediction domain as given by the two-layer BD-EESN model. The model appears to mostly pick up the correct pattern of the soil moisture anomaly signal in the western and mid-central part of the spatial domain for both years, while struggling more with the upper midwest states in 2014. Regarding the critical corn belt, the model suggests an overall large amount of uncertainty in Missouri, especially in 2014, while Iowa appears to have a moderate to low amount of uncertainty for both years. As previously noted, the BD-EESN produces considerably better uncertainty metrics for the soil moisture data than the non-Bayesian D-EESN. The relative difference between these two predictive uncertainties can be seen in Figure 5, where both versions of the two-layer D-EESN model are plotted for 2015. Although the forecast means are very similar in Figure 5, the non-Bayesian D-EESN produces much smaller standard deviations across the spatial domain compared to the Bayesian version. Considering the inherent difficulty in predicting soil moisture six months into the future, the standard deviations for the non-Bayesian model appear unrealistically low. This point is confirmed by the Bayesian model producing a considerably lower CRPS value than the non-Bayesian version.

Next, the previously defined SS in (22) is shown in Figure 6 for the two-layer BD-EESN model, where a climatological forecast is used as the reference model. Values of SS greater than zero represent locations where the BD-EESN model improved upon the climatological forecast, while values less than zero indicate locations where the model did worse than the climatological forecast. Overall, the BD-EESN model outperforms the climatological forecast in the central western part of the domain, and performs worse in the east central and northern part of the domain. Critically, across much of the agriculturally important corn belt, including much of Iowa, the BD-EESN model improves upon the climatological forecast.

## 5 Discussion

Despite the large amount of research into deep models, uncertainty quantification is rarely considered. Given the demonstrated predictive ability of these methods, having the ability to quantify uncertainty with deep models is very powerful and widely applicable. Through the use of an ensemble framework, the proposed deep ESN model allows for uncertainty quantification with spatio-temporal processes. The non-Gaussian Lorenz-96 simulated example showed that the D-EESN both improved upon the traditional ESN model, while also producing very robust estimates of the uncertainty in the process. Furthermore, the Bayesian version of the D-EESN model provides a formal framework in which many data types can be considered and multiple levels of uncertainties can be accounted for. The results for the BD-EESN with the soil moisture data illustrated this point by considerably improving upon the uncertainty metrics produced by the D-EESN model. We were also able to show spatially were the deep models improved upon simpler forecast methods, thus giving model developers and resource managers critical information.

As discussed above, there are many more possible choices for the dimension reduction function used with the hidden units from the D-EESN model. One potential choice that has been unexplored in the literature is so-called convolution operators. Deep image classification methods have shown convolution operators to be extremely powerful for slowly learning general (spatial) features (Krizhevsky et al., 2012). Its possible that the deep RNN framework could also benefit from such tools, especially in situations where the input had explicit spatial structure. More exploration into the effects of setting different dimensions for the various dimension reduction layers in the deep model is likely also needed. Investigating other dimension reduction methods, including nonlinear methods, for the SST data is another possible avenue for improving the model.

Additionally, there is also flexibility regarding the prior distributions used for the regression parameters in the BD-EESN model. Ridge penalties have almost exclusively been used in the ESN literature, with little research regarding alternative regularization methods. Considering the large number of potential regression parameters in a deep ensemble framework, further exploration of regularization methods is likely warranted.

Long-lead spatio-temporal soil moisture forecasting is inherently a challenging problem. It is not uncommon to treat such difficult forecasting problems as categorial instead of continuous. That is, treating the response as categorical by re-labeling continuous values with qualitative values (e.g., below average, average, above average). Unlike the RNN literature, to date much of the ESN literature has exclusively focused on continuous responses. Given the flexibility of the BD-EESN such a model formulation can be developed and is the subject of future research. The soil moisture forecasts may also benefit from using other climate indexes or more local variables (such as precipitation) as inputs into the model.

## References

• Antonelo et al. (2017) Antonelo, E. A., Camponogara, E., and Foss, B. (2017), “Echo State Networks for data-driven downhole pressure estimation in gas-lift oil wells,” Neural Networks, 85, 106–117.
• Baran and Lerch (2015) Baran, S. and Lerch, S. (2015), “Log-normal distribution based Ensemble Model Output Statistics models for probabilistic wind-speed forecasting,” Quarterly Journal of the Royal Meteorological Society, 141, 2289–2299.
• Barnston et al. (1999) Barnston, A. G., He, Y., and Glantz, M. H. (1999), “Predictive skill of statistical and dynamical climate models in SST forecasts during the 1997-98 El Niño episode and the 1998 La Niña onset,” Bulletin of the American Meteorological Society, 80, 217–243.
• Belkin and Niyogi (2001) Belkin, M. and Niyogi, P. (2001), “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” MIPS, 14.
• Berliner et al. (2000) Berliner, L. M., Wikle, C. K., and Cressie, N. (2000), “Long-lead prediction of Pacific SSTs via Bayesian dynamic modeling,” Journal of climate, 13, 3953–3968.
• Blackmer et al. (1989) Blackmer, A., Pottker, D., Cerrato, M., and Webb, J. (1989), “Correlations between soil nitrate concentrations in late spring and corn yields in Iowa,” Journal of Production Agriculture, 2, 103–109.
• Carleton et al. (2008) Carleton, A. M., Arnold, D. L., Travis, D. J., Curran, S., and Adegoke, J. O. (2008), “Synoptic circulation and land surface influences on convection in the Midwest US “Corn Belt” during the summers of 1999 and 2000. Part I: Composite synoptic environments,” Journal of Climate, 21, 3389–3415.
• Chatzis (2015) Chatzis, S. P. (2015), “Sparse Bayesian Recurrent Neural Networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 359–372.
• Chorin and Lu (2015) Chorin, A. J. and Lu, F. (2015), “Discrete approach to stochastic parametrization and dimension reduction in nonlinear dynamics,” Proceedings of the National Academy of Sciences, 112, 9804–9809.
• Cressie and Wikle (2011) Cressie, N. and Wikle, C. (2011), Statistics for Spatio-Temporal Data, New York: John Wiley & Sons.
• Crommelin and Vanden-Eijnden (2008) Crommelin, D. and Vanden-Eijnden, E. (2008), “Subgrid-scale parameterization with conditional Markov chains,” Journal of the Atmospheric Sciences, 65, 2661–2675.
• Dixon et al. (2017) Dixon, M. F., Polson, N. G., and Sokolov, V. O. (2017), “Deep Learning for Spatio-Temporal Modeling: Dynamic Traffic Flows and High Frequency Trading,” arXiv preprint arXiv:1705.09851.
• Drosdowsky (1994) Drosdowsky, W. (1994), “Analog (nonlinear) forecasts of the Southern Oscillation index time series,” Weather and forecasting, 9, 78–84.
• Fan and van den Dool (2004) Fan, Y. and van den Dool, H. (2004), “Climate Prediction Center global monthly soil moisture data set at 0.5 resolution for 1948 to present,” Journal of Geophysical Research: Atmospheres, 109.
• Fatkullin and Vanden-Eijnden (2004) Fatkullin, I. and Vanden-Eijnden, E. (2004), “A computational strategy for multiscale systems with applications to Lorenz 96 model,” Journal of Computational Physics, 200, 605–638.
• Fischer et al. (2007) Fischer, E. M., Seneviratne, S. I., Vidale, P. L., Lüthi, D., and Schär, C. (2007), “Soil moisture–atmosphere interactions during the 2003 European summer heat wave,” Journal of Climate, 20, 5081–5099.
• Friedman et al. (2001) Friedman, J., Hastie, T., and Tibshirani, R. (2001), The elements of statistical learning, Springer series in statistics New York.
• Gelman and Rubin (1992) Gelman, A. and Rubin, D. B. (1992), ‘‘Inference from iterative simulation using multiple sequences,” Statistical science, 457–472.
• Genest and Rémillard (2008) Genest, C. and Rémillard, B. (2008), “Validity of the parametric bootstrap for goodness-of-fit testing in semiparametric models,” in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 44.
• George and McCulloch (1997) George, E. I. and McCulloch, R. E. (1997), “Approaches for Bayesian variable selection,” Statistica sinica, 339–373.
• Gladish and Wikle (2014) Gladish, D. W. and Wikle, C. K. (2014), “Physically motivated scale interaction parameterization in reduced rank quadratic nonlinear dynamic spatio-temporal models,” Environmetrics, 25, 230–244.
• Gneiting and Katzfuss (2014) Gneiting, T. and Katzfuss, M. (2014), “Probabilistic forecasting,” Annual Review of Statistics and Its Application, 1, 125–151.
• Gneiting et al. (2005) Gneiting, T., Raftery, A. E., Westveld III, A. H., and Goldman, T. (2005), ‘‘Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation,” Monthly Weather Review, 133, 1098–1118.
• Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016), Deep learning, MIT press Cambridge.
• Grooms and Lee (2015) Grooms, I. and Lee, Y. (2015), “A framework for variational data assimilation with superparameterization,” Nonlinear Processes in Geophysics, 22.
• Hooten and Wikle (2010) Hooten, M. B. and Wikle, C. K. (2010), “Statistical agent-based models for discrete spatio-temporal systems,” Journal of the American Statistical Association, 105, 236–248.
• Hopfield (1982) Hopfield, J. J. (1982), “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the national academy of sciences, 79, 2554–2558.
• Jaeger (2001) Jaeger, H. (2001), ‘‘The “echo state” approach to analysing and training recurrent neural networks-with an erratum note,” Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148.
• Jaeger (2007a) — (2007a), “Discovering multiscale dynamical features with hierarchical echo state networks,” Tech. rep., Jacobs University Bremen.
• Jaeger (2007b) — (2007b), “Echo state network,” Scholarpedia, 2, 2330.
• Jan van Oldenborgh et al. (2005) Jan van Oldenborgh, G., Balmaseda, M. A., Ferranti, L., Stockdale, T. N., and Anderson, D. L. (2005), “Did the ECMWF seasonal forecast model outperform statistical ENSO forecast models over the last 15 years?” Journal of climate, 18, 3240–3249.
• Knaff and Landsea (1997) Knaff, J. A. and Landsea, C. W. (1997), “An El Niño–Southern Oscillation climatology and persistence (CLIPER) forecasting scheme,” Weather and forecasting, 12, 633–652.
• Kondrashov et al. (2005) Kondrashov, D., Kravtsov, S., Robertson, A. W., and Ghil, M. (2005), “A hierarchy of data-based ENSO models,” Journal of climate, 18, 4425–4444.
• Kravtsov et al. (2005) Kravtsov, S., Kondrashov, D., and Ghil, M. (2005), “Multilevel regression modeling of nonlinear processes: Derivation and applications to climatic variability,” Journal of Climate, 18, 4404–4424.
• Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012), “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105.
• Li et al. (2012) Li, D., Han, M., and Wang, J. (2012), “Chaotic time series prediction based on a novel robust echo state network,” IEEE Transactions on Neural Networks and Learning Systems, 23, 787–799.
• Lorenz (1963) Lorenz, E. N. (1963), “Deterministic nonperiodic flow,” Journal of the atmospheric sciences, 20, 130–141.
• Lorenz (1996) — (1996), “Predictability: A problem partly solved,” in Proc. Seminar on predictability, vol. 1.
• Lukoševičius (2012) Lukoševičius, M. (2012), “A practical guide to applying echo state networks,” in Neural networks: tricks of the trade, Springer, pp. 659–686.
• Lukoševičius and Jaeger (2009) Lukoševičius, M. and Jaeger, H. (2009), “Reservoir computing approaches to recurrent neural network training,” Computer Science Review, 3, 127–149.
• Ma et al. (2017) Ma, Q., Shen, L., and Cottrell, G. W. (2017), “Deep-ESN: A Multiple Projection-encoding Hierarchical Reservoir Computing Framework,” arXiv preprint arXiv:1711.05255.
• McDermott and Wikle (2016) McDermott, P. L. and Wikle, C. K. (2016), “A model-based approach for analog spatio-temporal dynamic forecasting,” Environmetrics, 27, 70–82.
• McDermott and Wikle (2017a) — (2017a), “Bayesian Recurrent Neural Network Models for Forecasting and Quantifying Uncertainty in Spatial-Temporal Data,” arXiv preprint arXiv:1711.00636.
• McDermott and Wikle (2017b) — (2017b), “An Ensemble Quadratic Echo State Network for Nonlinear Spatio-Temporal Forecasting,” STAT, 6, 315–330.
• Penland and Magorian (1993) Penland, C. and Magorian, T. (1993), “Prediction of Nino 3 sea surface temperatures using linear inverse modeling,” Journal of Climate, 6, 1067–1076.
• Philander (1990) Philander, S. (1990), El Niño, La Niña, and the Southern Oscillation, Academic Press.
• Richardson (2017) Richardson, R. A. (2017), “Sparsity in nonlinear dynamic spatiotemporal models using implied advection,” Environmetrics.
• Sheffield et al. (2004) Sheffield, J., Goteti, G., Wen, F., and Wood, E. F. (2004), “A simulated soil moisture based drought analysis for the United States,” Journal of Geophysical Research: Atmospheres, 109.
• Sheng et al. (2013) Sheng, C., Zhao, J., Wang, W., and Leung, H. (2013), “Prediction intervals for a noisy nonlinear time series based on a bootstrapping reservoir computing network ensemble,” IEEE Transactions on neural networks and learning systems, 24, 1036–1048.
• Sivanandam and Deepa (2007) Sivanandam, S. and Deepa, S. (2007), Introduction to genetic algorithms, Springer Science & Business Media.
• Smith et al. (2008) Smith, T. M., Reynolds, R. W., Peterson, T. C., and Lawrimore, J. (2008), “Improvements to NOAA’s historical merged land–ocean surface temperature analysis (1880–2006),” Journal of Climate, 21, 2283–2296.
• Stern and Davidson (2015) Stern, H. and Davidson, N. E. (2015), “Trends in the skill of weather prediction at lead times of 1–14 days,” Quarterly Journal of the Royal Meteorological Society, 141, 2726–2736.
• Takens (1981) Takens, F. (1981), “Detecting strange attractors in turbulence,” Lecture notes in mathematics, 898, 366–381.
• Tang et al. (2000) Tang, B., Hsieh, W. W., Monahan, A. H., and Tangang, F. T. (2000), “Skill comparisons between neural networks and canonical correlation analysis in predicting the equatorial Pacific sea surface temperatures,” Journal of Climate, 13, 287–293.
• Tangang et al. (1998) Tangang, F. T., Tang, B., Monahan, A. H., and Hsieh, W. W. (1998), “Forecasting ENSO events: A neural network–extended EOF approach,” Journal of Climate, 11, 29–41.
• Timmermann et al. (2001) Timmermann, A., Voss, H., and Pasmanter, R. (2001), “Empirical dynamical system modeling of ENSO using nonlinear inverse techniques,” Journal of physical oceanography, 31, 1579–1598.
• Triefenbach et al. (2013) Triefenbach, F., Jalalvand, A., Demuynck, K., and Martens, J.-P. (2013), “Acoustic modeling with hierarchical reservoirs,” IEEE Transactions on Audio, Speech, and Language Processing, 21, 2439–2450.
• Van den Dool et al. (2003) Van den Dool, H., Huang, J., and Fan, Y. (2003), “Performance and analysis of the constructed analogue method applied to US soil moisture over 1981–2001,” Journal of Geophysical Research: Atmospheres, 108.
• Wikle (2015) Wikle, C. (2015), “Modern perspectives on statistics for spatio-temporal data,” Wiley Interdisciplinary Reviews: Computational Statistics, 7, 86–98.
• Wikle and Hooten (2010) Wikle, C. K. and Hooten, M. B. (2010), “A general science-based framework for dynamical spatio-temporal models,” Test, 19, 417–451.
• Wilks (2001) Wilks, D. (2001), “A skill score based on economic value for probability forecasts,” Meteorological Applications, 8, 209–219.
• Wilks (2005) Wilks, D. S. (2005), ‘‘Effects of stochastic parametrizations in the Lorenz’96 system,” Quarterly Journal of the Royal Meteorological Society, 131, 389–407.
• Zhao and Giannakis (2016) Zhao, Z. and Giannakis, D. (2016), “Analog forecasting with dynamics-adapted kernels,” Nonlinearity, 29.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters