# Probabilistic Forecasting using Deep Generative Models

###### Abstract

The \glsAnEn method tries to estimate the probability distribution of the future state of the atmosphere with a set of past observations that correspond to the best analogs of a deterministic \glsNWP. This model post-processing method has been successfully used to improve the forecast accuracy for several weather-related applications including air quality, and short-term wind and solar power forecasting, to name a few. In order to provide a meaningful probabilistic forecast, the \glsAnEn method requires storing a historical set of past predictions and observations in memory for a period of at least several months and spanning the seasons relevant for the prediction of interest. Although the memory and computing costs of the \glsAnEn method are less expensive than using a brute-force dynamical ensemble approach, for a large number of stations and large datasets, the amount of memory required for \glsAnEn can easily become prohibitive. Furthermore, in order to find the best analogs associated with a certain prediction produced by a \glsNWP model, the current approach requires searching over the entire dataset by applying a certain metric. This approach requires applying the metric over the entire historical dataset, which may take a substantial amount of time. In this work, we investigate an alternative way to implement the \glsAnEn method using deep generative models. By doing so, a generative model can entirely or partially replace the dataset of pairs of predictions and observations, reducing the amount of memory required to produce the probabilistic forecast by several orders of magnitude. Furthermore, the generative model can generate a meaningful set of analogs associated with a certain forecast in constant time without performing any search, saving a considerable amount of time even in the presence of huge historical datasets.

###### keywords:

Computational algorithms, Ensemble modeling, Deep Learning^{†}

^{†}journal: Journal of Computers and Geosciences\setacronymstyle

long-short \newacronymGAGAGenetic Algorithm \newacronymAnEnAnEnAnalog Ensemble \newacronymELBOELBOEvidence Lower Bound \newacronymVAEVAEVariational Autoencoder \newacronymCVAECVAEConditional Variational Autoencoder \newacronymKLKLKullback-Leibler \newacronymFLTFLTForecast Lead Time \newacronymEAEAEvolutionary Algorithm \newacronymRMSERMSERoot Mean Square Error \newacronymNWPNWPNumerical Weather Prediction \newacronymNAMNAMNorth American Mesoscale Forecast System \newacronymDOUGDOUGDynamically Optimized Unstructured Grid \newacronymOMEGAOMEGAOperational Multiscale Environment Model with Grid Adaptivity \newacronymECMWFECMWFEuropean Centre for Medium-Range Weather Forecasts \newacronymPDFPDFProbability Density Function

## 1 Introduction

Quick and accurate weather prediction is an essential and critical part for decision-making, in particular when human lives are at stake. It has been counted that more than 200 people died in 2018 because of all weather and climate events over U.S. and the damages has been estimated to be more than 90 billion dollars (NOAA 2018). However, these losses would be far higher if scientists had not predicted these events. \glsNWP model forecast is usually used as the main tool for weather prediction. However, its utility is limited as it represents only a single plausible future state of the atmosphere. In fact, imperfect initial conditions and model deficiencies can lead to model errors that grow non-linearly during the model evolution. Lorenz (1969) pointed out that, “the errors in estimating the current state of the atmosphere are due mainly to omission rather than inaccuracy”. In other words, the errors can be related to the gaps in science or computational resources and the uncertainty in input data. However, current NWP models are state-of-the-science models that include the most recent discovered science. So, reducing the effects of uncertain data is a possible option to increase the accuracy and reliability of these models.

In order to address the forecast uncertainty of deterministic models, integrated probabilistic forecasts has been widely used. As shown by Hopson (2014), the rationale for using a probabilistic forecast is four-fold: 1) the ensemble prediction mean usually has half of the error variance of a single forecast; 2) ensemble forecasts are capable to estimate the likelihood of extreme weather events; 3) ensemble forecasts can be used to generate non-Gaussian forecasts \glsPDF; and 4) ensemble spread acts as a representation of forecast uncertainty. These ensemble members can be achieved by changing different parts of the models such as different physical parametrization of the sub-grid physical processes (Foley et al. 2012), dynamic solvers (Jablonowski 2004), and initial conditions (Sperati et al. 2016). For example, different variations of an initial condition can be found by adding perturbations from a stochastic model. Deterministic NWP models realizations with these variations will generate an ensemble of predictions to capture the uncertainty for a specific weather event. Although this Monte Carlo based method is a popular and reliable way of generating probabilistic forecasting, it requires a huge amount of computational resources for solving Navier-Stokes equations for multiple times (Schmidt et al. 2017; Syrakos et al. 2012; Fant et al. 2016).

In order to reduce the computational cost of the ensemble forecast generation, another method was suggested by Delle Monache et al. (2011) which is called Analog Ensemble (AnEn) method. AnEn assumes that the model error for past predictions can be used to correct future predictions. As a result, this technique only requires one single NWP model run and an archive of previous model runs and observed values to correct the current model run (See 2 for more details). AnEn has become a very popular method for correcting forecasts due to its robustness and computational resources efficiency and it has been applied to different projects by National Center for Atmospheric Research (NCAR) in the recent years (Cervone et al. 2017). Although AnEn significantly reduced computational resources, it requires keeping the whole historical datasets of the model and corresponding observed values in the memory. The specific implementation of AnEn is provided by the Parallel Analog Ensemble package (Hu et al. 2019). This package has been developed during the course of the past 3 years and successfully applied to temperature forecasts for Eastern United States (Hu and Cervone 2019; Balasubramanian et al. 2018).

However, for newer computer architectures, the amount of memory capacity and memory bandwidth per core is diminishing. As shown in Mahapatra and Venkatrao (1999); Wulf and McKee (1995), the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory. This is mostly due to the division of the semiconductor industry into microprocessor and memory camps. Consequently, microprocessor performance has been improving at a rate of 60 percent per year, whereas the access time to DRAM has been improving at less than 10 percent per year.

As a result, we hypothesized that a similar analog method can be applied using the probability distribution function (PDF) of the historical dataset instead of using the actual data. Indeed, we applied a generative machine learning model to generate the analogous states of atmosphere based on current NWP model forecast. The application of different machine learning methods in Geoscience has been explored in other studies. For instance, Convolutional Neural Network (CNN) models have been used to analyze satellite images (Li et al. (2017)) and Recurrent Neural Network (RNN) models for air quality predictions (Athira et al. (2018), Roozitalab et al. (2019)). However, the use of generative machine learning models has been not studied for doing atmospheric probabilistic forecasts. In this study, we apply generative machine learning models for the first time as a new technique of atmospheric probabilistic forecasting. Specifically, probabilistic characteristics of Conditional Variational Autoencoder (CVAE) is utilized in this study to learn the multivariate probability distribution of historical observed dataset based on historical modeled dataset and it can be used to produce ensemble forecasts.

This paper is organized as follows: Section 2 summarizes the theory behind the Analog Ensemble method and its most important practical and computational aspects; Section 3 introduces the main concepts of deep generatives modelling, mathematical background of Variational Autoencoders and Conditional Variational Autoencoders; Section 4 explains how probabilistic forecasting can be performed using Conditional Variational Autoencoders and reports the main technical issues encountered during the process; Section 5 describes the experimental setting; Section 6 reports the results obtained by the experiments and describes the metric used to evaluate and compare the performance of the CVAE against the AnEn; Section 7 provides more insights about the results and shows the memory footprint and runtime for probabilistic forecasting using CVAE; finally, Section 8 concludes the paper describing the potential uses of producing probabilistic forecasting with deep generative models and proposing new ideas for future works.

## 2 Analog Ensemble

The Analog Ensemble (AnEn) method tries to estimate the probability distribution of the future state of the atmosphere with a set of past observations that correspond to the best analogs of a deterministic \glsNWP. In particular, \glsAnEn seek to estimate the probability distribution of the observed value of the predictand variable given a model prediction, which can be represented as in Eq. 1

(1) |

where, at a give time and location, y is the observed future value of the predictand variable, and contains the values of k predictors from the deterministic model prediction at the same location.

The \glsAnEn method generates samples of via three main steps using a history of cases, called the analog training period, in which both the \glsNWP (deterministic forecast) and the verifying observations are available. Step 1 consists in searching and selecting the best-matching historical forecasts for the current prediction, these forecasts are called analogs. An analog may come from any past date within the training period. Step 2 consists in getting each analog’s verifying observation. Step 3 consists in grouping the selected observations, creating the final ensemble prediction for the current forecast.

For Step 1, the quality of an analog is determined by a metric, which is usually represented by the following equation presented in Delle Monache et al. (2011):

(2) |

where is the current NWP deterministic forecast valid at the future time t at a station location; is an analog at the same location and with the same forecast lead time but valid at a past time ; and are the number of physical variables used in the analogs search and their weights, respectively; is the standard deviation of the time series of past forecasts of a given variable at the same location and forecast lead time; is equal to half the number of additional times over which the metric is computed; and and are the values of the forecast and the analog in a time window for a given variable.

During this whole three-step process, the entire historical dataset of past forecasts and observations needs to be kept in memory. Furthermore, the metric expressed in eq. 2 used in Step 1 has to be applied to each and every forecast of the historical dataset in order to select the best analogs. Although the memory and computing costs of the \glsAnEn method are less expensive than running a numerical weather prediction model, for a large number of stations and large datasets, the amount of memory required for \glsAnEn can easily become prohibitive.

## 3 Deep Generative Models

A generative model describes how a dataset is generated, in terms of a probabilistic model. The final goal is to sample from the generative model in order to generate new data which is consistent with the original model. A generative model must be probabilistic rather than deterministic. In fact, a generative model should not produce the same output given the same input. For this reason, a generative model should include a stochastic element that drives the generation of new samples.

In machine learning, the most common task performed by models is discrimination, either in the form of classification (e.g. is this a picture of a cat or not?) or regression (e.g. what’s the prediction for tomorrow’s temperature?). In both cases, a set of features is given as input to the model and the output is either a classification value (label) or a prediction. In generative models, the model is asked to learn how samples of a certain class look like and to produce new ones based on a stochastic value. In a more formal way, discriminative models estimate , meaning the probability of having the label given the input .

Generative models on the other hand, try to estimate : the probability of observing the input . If the dataset is labeled, a generative model estimates the distribution . This distribution represents the probability of having a set of features given a label . In other words, discriminative models attempt to estimate the probability that an observation belongs to category y. Generative models do not care about labeling observations. Instead, they try to estimate the probability of seeing the observation at all. Assuming to be able to build a perfect discriminative model to identify cats vs non-cats, the model would still have no idea how to create a picture (or features like weight and height) that looks like a cat. It can only output probabilities against existing images, as this is what it has been trained to do. We would instead need to train a generative model, which can output sets of pixels that have a high chance of belonging to the original training dataset.

### 3.1 Variational Autoencoder

One of the first papers about deep generative models was published by Kingma and Welling (2013), which laid the foundations of the well-known deep neural network called \glsVAE. The mathematical basis of Variational Autoencoders has relatively little to do with the traditional autoencoders. In fact, \glsVAE are called ”autoencoders” only because of their architecture, composed by an encoder and decoder, resembles a traditional autoencoder. In this work, we only provide a high-level view to justify the application of \glsVAE for probabilistic forecasts. A very detailed and yet understandable explanation of \glsVAE is provided by Doersch (2016).

As mentioned in Sec. 3, a deep generative model tries to approximate the true distribution of the inputs using a deep neural network.

(3) |

The distribution is expressed in eq. 3, where are the parameters of the distribution determined during training. In machine learning, it is important to find equation 4, a joint distribution between the inputs and the latent variables . The latent variables are derived by the encoder of the \glsVAE and they represent properties observable from inputs. This sort of learning is called representation learning and it is based on the idea that each point in the latent space (having less dimensions than the original one) is the representation of some high-dimensional sample, as explained in Bengio et al. (2012). In other words, eq. 4 is a distribution of input data points and their attributes.

(4) |

(5) |

which means considering all of the possible attributes in order to describe the input. This problem is not tractable, because eq. 5 cannot be solved analytically. By using the Bayes theorem, equation 6 represents an alternative expression for eq. 5:

(6) |

The goal of the \glsVAE is to find a tractable distribution that estimates .

In order to make tractable, \glsVAE makes use of an ”encoder” able to generate the distribution defined in the following equation:

(7) |

can be approximated by a neural network and it is chosen to be a multivariate Gaussian, as expressed in the following equation:

(8) |

where and are computed by the encoder using the input dataset.

generates the latent vector z from the input x and it represents the encoder of the \glsVAE. On the other hand, generates the input from the latent vector z and thus it acts as decoder. Our final goal is still to estimate .

In order to quantify the distance between and , the \glsKL divergence is used. \glsKL determines the distance between our two conditional densities, as expressed in the following equation:

(9) |

By using the Bayes theorem and rearranging a few terms, equation 9 becomes:

(10) |

Equation 10 represents the core equation of the \glsVAE. On the left-hand side, the term is the term we are trying to estimate minus the error imposed by : the approximation of the real . When the approximation of is good, the KL distance goes to zero. This first part of the equation represents the encoder of the \glsVAE. On the right-hand side, represents the decoder of the \glsVAE, whereas the DL distance represents the loss function from the Gaussian distribution expressed in eq. 8.

The left-hand side of eq. 10 is also known as \glsELBO, and because the KL divergence is always positive, it represents a lower bound for . When the encoder and decoder of the VAE are trained together, the \glsELBO is maximized; meaning that the KL distance on the left-hand side of the equation goes to zero (thus the encoding of x in z is getting better) and that on the right-hand side is maximized (thus the decoder is getting better in reconstructing x from the latent representation z).

The right-hand side of eq. 10 has two important parts of the final loss function of the \glsVAE: the decoder part takes samples from the output of the encoder to reconstruct the inputs. Maximizing this term means minimizing the reconstruction loss . The second part is straightforward to evaluate, thanks to the Gaussian nature of , becoming:

(11) |

where is the dimension of . This quantity is called KL Loss, represented as .

Summarizing, the loss function of the \glsVAE is .

Although theoretically solid, in practice this approach usually leads to a degenerate solution where ; meaning that the variational posterior does not depend on the data (x and z are basically independent). In other words, the model does not learn a good representation of the data. This problem is known as KL-vanishing and it will be covered in Sec. 4.1.1.

An alternative way to see the \glsELBO is as a regularized version of the regular autoencoder. In particular, represents the reconstruction loss of the regular autoencoder and the regularization. From this point of view, it is natural to introduce a new hyperparameter able to control the strength of the regularization, leading to the following equation:

(12) |

By changing the value of from 0 to 1 (or greater than 1) we can nullify (transforming the VAE in an AE) or force more the posterior on the latent representation. As explained by Fu et al. (2019), by cyclically assigning values between 0 and 1 to it is possible to mitigate the KL-vanishing problem.

### 3.2 Conditional Variational Autoencoder

In a \glsVAE, if the latent space is randomly sampled, the \glsVAE has no control over which the kind of data to generate. For example, in a \glsVAE trained to generate hand written digits over the MNIST dataset, there is no control over which digit should be produced.

The approach proposed in Sohn et al. (2015), called \glsCVAE, solves this problem by imposing a condition on both encoder and decoder inputs. Equation 10 is modified to include the condition in each term as follows:

(13) |

Similar to a classic \glsVAE, a \glsCVAE tries to estimate the quantity expressed in the following equation:

(14) |

which would allow us to generate samples from the distribution of the input data under a certain condition.

## 4 Proposed Approach

The main focus of this work is to generate probabilistic forecasting without maintaining the entire dataset of forecasts and observations in memory. Generative models are a great fit for learning the distribution of the original dataset and generating new data. A valid way to generate probabilistic forecasting using deep generative models is to transform the \glsAnEn method from an instance-based ( Aha et al. (1991)) to a generative-based method.

Equation 2 represents the probability distribution of the observed value of the predictand variable given a model prediction , composed by various predictors. This equation can be easily seen as equation 14 of a \glsCVAE, where the condition is the forecast coming from the deterministic \glsNWP and is the observation coming from the input data.

For a correctly and successfully trained \glsCVAE, we can perform probabilistic forecasting without keeping in memory the whole dataset and without performing any search of the best-matching analogs. This is because the \glsCVAE has been trained to generate samples that already match as best as possible the distribution of the input data, conditioned to the forecast provided as condition.

As mentioned in Sec. 1, the AnEn method is very popular for correcting forecasts produced by a numerical weather model. In this work, we focus on correcting only the wind speed by using four predictors: wind speed, wind direction (expressed in radians), 2 meter temperature and pressure.

### 4.1 CVAE Training

Given a dataset , composed by observations paired with forecasts produced by a \glsNWP model, the training phase for a \glsCVAE can be performed by passing the observations as the data to be generated by the model, and the forecasts as the conditions. In formulas, becomes .

Fig. 1 shows the structure of the CVAE during training: the variables that we want to generate, coming from the observations, are the inputs of the CVAE (e.g. T_obs represents the observed temperature). The condition is the forecast produced by the \glsNWP model, associated with the observations given as input (e.g. Ws_for stands for Wind Speed forecast). The same condition is given to the decoder during the training phase, this allows the decoder to tie a specific condition with a certain latent representation). The output of the decoder layer is the reconstructed features given in input.

#### 4.1.1 KL-Vanishing Problem and Cyclic Annealing Scheduling

Training a CVAE with traditional methods can be challenging, in particular because of a notorious issue already experienced by Bowman et al. (2015); Yang et al. (2017): the decoder ignores the latent variable, yielding what is termed the KL-vanishing or latent variable collapse problem ( Dieng et al. (2018)). As mentioned in Sec. 3.1, it is possible to mitigate the KL-vanishing problem by cyclically changing the value of during the training as explained in details by Fu et al. (2019).

The KL-vanishing problem is supposed to be related to the low quality of z at the beginning of the decoder training. As a result, the model is forced to learn an easier solution by relying only on the previous samples of x without relying on z at all.

During the training of our CVAE with fixed to 1, we observed the KL-vanish problem since the very first epochs. As mentioned by Sønderby et al. (2016) and Bowman et al. (2015), setting for the first epoch and then fix it to 1 for the rest of the epochs made a huge difference in the final results. In fact, this approach called Warm-up transforms the CVAE in a regular autoencoder for the first epoch, creating a more meaningful z than random values to be used by the final CVAE when is set to 1 (monotonic annealing schedule). The cyclic annealing scheduling proposed by Fu et al. (2019) produced even better results for our CVAE. This approach proposes to cyclic several times between and during the epochs. By doing so, the values of z is very close to at the beginning of the training, which is very meaningful.

For our CVAE, we obtained the best results by using a cyclic annealing scheduling including values of . In particular, we adopted the cyclic scheme shown in Figure 2 where only one epoch was executed by setting and 0.5; for , 2.0 and 4.0 we executed 50 epochs in each case.

### 4.2 CVAE Inference

Inference or sample generation in VAE and CVAE is performed only by the decoder. In order to generate a new data point, z-dimension random numbers sampled by a Normal distribution with mean 0 and variance 1 represent the vector z in the latent space. This vector of random numbers is given as input to the decoder along with the forecast as condition of the CVAE.

Figure 3 shows the decoder architecture able to generate analogs according to the condition ”Ws_for”.

This simple two-step method allows us to make probabilistic forecast in constant time and memory.

## 5 Experiments

As the case study in this work, we seek to correct the forecasted wind speed value using observational dataset. As a result, we feed NWP forecasted wind speed to the model as the condition layer. Delle Monache et al. (2013) found that 10-m wind speed (Ws) and direction (Wd), 2-m temperature (T), and surface pressure (P) were the most important parameters in AnEn for 10-m wind speed prediction. As a result, the same variables have been used in this study for input and output layers.

A 10*5 box in NAM dataset covering State College, PA, is chosen for running all the tests. As forecast dataset, we used data from The North American Mesoscale Forecast System (NAM) (Janjic (2003)) which is one of the major operational forecasting systems maintained by NOAA(NOAA (2019)). NAM initiates every day at 00 UTC and predicts meteorological status of the next 84 hours. The outputs are saved in hourly format for the first 36 hours and 3-hourly for the next 48 hours; in other words, 53 leading times. On the other hand, NOAA uses real observed data such as satellite, aircraft, and ground measurements data and by assimilating them into NAM forecasts, it provides another dataset called NAM Analysis (NOAA (2019)). NAM Analysis usually has data for 16 hours per day and we used this dataset as the observation dataset in this study.

Although longer periods have been suggested for probabilistic forecasts (Delle Monache et al. (2013), our analysis shows that NAM model configurations have been updated roughly every year. Moreover, consistent data are required for training machine learning models. As a result for training the model, NAM forecasts and analysis data for 365 days between 2017-06-01 and 2018-06-01 are used for training. A 7-days validation period between 2018-06-01 and 2018-06-08 has been used for tuning and analyzing the probabilistic performance of the model. Different configurations in terms of number of layers and number of neurons has been tested and a configuration with reasonable size and performance has been selected as the proposed model (CVAE). Moreover, the performance of the proposed model with bigger datasets has been evaluated and discussed in discussion section.

## 6 Results

The focus of this paper is on generating probabilistic prediction of wind speed. As a result, the following evaluations are based on the generated (for CVAE) and analogs (for AnEn) values of wind speed. The performance of the model for other output variables is discussed in Section 7. In order to evaluate the performance of the ensemble forecasts generated by the CVAE and compare the results with the AnEn ensemble analogs, we have used the metrics for reliability and probability scores in ensemble forecasting. Each of the 7 days composing the testset has predictions for the next 84 hours; the following results are based on the average of these 7 days for each leading time. It should be mentioned that the observations data (NAM analysis) do not change for the first 4 leading times. This can be confusing for the model and also do not provide meaningful information and thus, we have excluded them from the dataset for both CVAE and AnEn. The CVAE trained model uses Gaussian noise in order to produce one forecast as explained in 4.2. However, 21-members ensemble forecast need 21 forecasts. As a result, the CVAE has to be used for inference 21 times per each forecast, using new generated noise sampled from a normal distribution (N(0,1)).

### 6.1 Dispersion and Rank Histogram

In probabilistic forecasting, a reliable forecast seeks to have the same relative frequency as the probability of the forecasted value (Delle Monache et al. (2013)). This means that the reliability can be expressed as the conditional probability of the observed value given that the forecasted value actually happens (Jolliffe and Stephenson (2003)):

(15) |

In eq. 15, Y and k are the observation random variable and observed value, F and p are the forecast random variable and predicted value. In ensemble probabilistic forecasts, two criteria are usually studied in terms of reliability (Sperati et al. (2017); Cervone et al. (2017); Alessandrini et al. (2015)): Rank Histogram (RH) and Dispersion.

Dispersion is one of the simplest ways to get an insight of the reliability of an m-member ensemble forecast (Jolliffe and Stephenson (2003) and references therein). Dispersion relates that Mean Squared Error (MSE) and mean ensemble variance should be close to each other (with the ratio of m+1/m) in a reliable forecast. In this study, a bootstrap method with 1000 samples has been applied for mean dispersion and CRPS (see section 6.2) in order to consider uncertainty. Figure 4 depicts 50-stations averaged MSE (mean dispersion) for each leading time averaged during testing period for CVAE and AnEn. AnEn has lower all-leading-times averaged MSE which suggests its better prediction performance compared to CVAE. On the other hand, CVAE Mean dispersion oscillates more in different leading times whereas the mean ensemble variance (spread) does not while AnEn dispersion plot shows that MSE and spread have similar trends during leading times. However, the difference between mean dispersion and spread for both CVAE (0.05) and AnEn (0.10) are very close to each other. This suggests that both models are providing similar reliable probabilistic characteristics in terms of dispersion.

In order to get further insights on the reliability of two models, RH has been studied, too. In order to calculate RH, an observed value is ranked based on its corresponding ensemble members and the histogram after giving ranks to all the observed values is plotted (Jolliffe and Stephenson (2003)). The processes of plotting RH is as follows: in this study, we have 21 predicted values (21-members ensemble) and one observed actual value per forecast; for a total of 22 values. These values are sorted in ascending order and the rank of the observed value is found. After doing the same task for all the forecasts, the histogram of ranks is plotted. For a reliable probabilistic ensemble forecast with m members, the ensemble members will be statistically indistinguishable with a constant ratio of 1/m+1 (0.045 in this study) (Jolliffe and Stephenson (2003)).

Figure 8 shows the RH for AnEn and CVAE for the 7-days testing period considering all the stations. These plots show that although both models are having flat frequencies in some portions of the histograms, neither CVAE nor AnEn are perfect reliable forecasts. Specifically, RH results for AnEn show that there is a skew of the predictions towards lower than the observed (real) value. On the other hand, right-skew of CVAE results indicate over prediction of ensemble members. However, roughly distributed RH shows the probabilistic performance of CVAE model. Overall, the range of the frequencies suggests that both models are reliable.

It should be mentioned that Delle Monache et al. (2013) and Sperati et al. (2017) found very uniform RH for wind speed prediction using AnEn which can be related to the size of the historical dataset (Delle Monache et al. (2013)). However, for this study only one year of data was used over a period where the NAM model was not changed. Since the CVAE does not use the actual data and instead uses the characteristics of the data for forecasting, it is crucial that used data during learning process have been produced by one single model configuration. The AnEn method is not very sensitive to these configuration modifications thanks to its instance-based nature, but this is not true for the CVAE with current configuration. In fact, by using forecasts produced by different versions of NAM, the CVAE is exposed to different distributions during the training.

### 6.2 Probability Score

A number of probabilistic metrics are used to evaluate the performance of a forecast using scores (Wilks (2011)). Verification statistics are designed to assess the reliability and skill of a forecast (Jolliffe and Stephenson (2003)). The Continuous Ranked Probability Score (CRPS) compares the cumulative distribution function (CDF) of the forecast and observation for all the possible outcome values (Jolliffe and Stephenson (2003).

(16) |

In eq. 16, S represents the CRPS scoring rule, F is the cumulative distribution function, H(a), the Heaviside function, is 1 if a 0 and zero otherwise, and o is the observed value. However, for obtaining the averaged CRPS, the expectation of the above equation has to be calculated using the concept of decomposition which is explained in detail in Jolliffe and Stephenson (2003) and references therein. Figure 6 shows CRPS for CVAE and AnEn at each leading time with bootstrapping. CRPS results are lower for absolute values of AnEn at almost all the leading times compared to CVAE. However, both models are showing similar trends. Specifically, the confidence intervals of both models overlap in some of the lead times, but also do not in others. When the confidence intervals overlap, we can assume that the models are statistically similar.

## 7 Discussion

The probabilistic verification metrics of CVAE suggest that consistent and reliable predictions of wind speed can be made using the proposed method. While these metrics alone capture only some of the characteristics of good probabilistic predictions, they suggest that the method proposed can be used to generate reliable analogs.

Although the AnEn displays better performance, the trade-off is the size of the data required and the time required for making the forecasts. Recall that CVAE generates ensemble members using the compact representation learned from the datasets, and thus is much more efficient in computational terms.

As discussed in Section 4, we used wind speed as the condition for the CVAE and generated ensembles for 2-m temperature, surface pressure, wind direction, and wind speed in the decoder block. The performance of the proposed approach for probabilistic forecasting of wind speed is discussed in detail in Section 6. In order to provide a more comprehensive evaluation of the observations generated by the CVAE, in Figure 8 we shows the RH for wind direction, pressure and 2-m temperature for CVAE and AnEn.

AnEn produces more reliable wind direction values compared to CVAE, whereas CVAE has better performance for surface pressure. Moreover, the results for 2m temperature show that the CVAE is unable to capture the probabilistic behavior whereas AnEn has better performance. The reason for relatively poor performance of the CVAE for variables other than the desired variable can be related to the design of the model. In other words, only the desired variable (wind speed) is feed to the model as the condition and the model optimizes the results only for that variable. During our tests we observed that providing the whole set of forecast variables as condition during training and inference confuses the model and produces worse results than just providing the goal variable alone (wind speed).

However, the main motivation for this paper was to show the efficiency of generative machine learning models in probabilistic forecasts in terms of computations and memory footprint. Since the results confirms the probabilistic performance of CVAE, we studied its efficiency in using computational resources. Specifically, memory usage and predictions runtime has been analyzed.

Another advantage of the proposed CVAE solution is that the model needs to be trained only once. As more data becomes available, incremental training is still possible (encoder block must be saved to retrain CVAE) and useful; in fact, new cases (in particular if extreme cases) can dramatically change the data distribution and final prediction.

Moreover, the training of the model is an offline process. In other words, once the model is trained, it can be used for predictions without keeping the original data. As a result, the memory that will be used in the prediction phase is the size of the model architecture and its weights. Hence, as shown in fig. 7, the memory usage for using CVAE does not depend on the size of the historical dataset.

The size of the model will not change since the architecture (number of layers and neurons) of the model remains constant. In contrast, the memory usage for AnEn depends on the amount of the data to be kept in memory. Specifically, AnEn requires two datasets (pairs of historical forecasts and observations) to be able to find the desired ensemble members; in the case of 1 year the memory needed is about 30 MB and it can reach up to about 900 MB for 30 years. On the other hand, the size of the CVAE models for all 50 stations used in this study is less than 7 MB and it does not change regardless of the size of the historical dataset used to train the model. It should be mentioned that the size of the NAM testing data has not been considered since both models have to keep it in memory.

Using CVAE generally shows inferior results when compared to AnEn, but nevertheless within a small error margin. Given the very large advantage of CVAE in terms of memory and computational complexity, the trade-off between these advantage and the larger error become advantageous for a large class of problems.

Figure 9 shows the predicted runtime for CVAE and AnEn. The main advantage of CVAE becomes prominent with large datasets. In fact, while AnEn needs to linearly scan the entire dataset every time a new set of ensemble needs to be generated, CVAE generates them nearly instantaneously once the model has been trained. The training of the model is a computationally expensive operation, however it is performed only once. Therefore, for large datasets (in our experiments large corresponds to larger than roughly one year of data), or in cases when the analogs must be generated in real time, CVAE provides a faster and more efficient solution, with results that while are inferior of AnEn, are still very good and can be used in many problems to provide a measure of uncertainty.

## 8 Conclusion

The use of specialized hardware able to perform very efficiently only specific tasks (e.g. deep learning) seems to be the best way to face the challenges imposed by the exascale era. Furthermore, the memory bottleneck on modern computing architectures is more than ever a limiting factor, mostly because the rate of improvement in microprocessor speed exceed the rate of improvement in DRAM memory. In order to take advantages of the new specialized hardware and limit the penalty imposed by memory speed, portions of current scientific codes should be rewritten and new algorithms should be developed.

In this work, we accomplished this by transforming a well established method for probabilistic forecasting, the Analog Ensemble method, from a traditional programming style to a deep learning approach. The approach proposed in this work not only provides a more efficient way to perform probabilistic forecasting using the specialized hardware for deep learning, but also shows how general generative models can be used effectively to generate probabilistic forecasting. Both these contributions represent make this paper a unique and innovative work.

The results show that despite the better performance of the AnEn method, the Conditional Variational Autoencoder used in this work produces reliable probabilistic forecasts using a fraction of the time and memory needed by the AnEn. Thanks to the low memory requirements and the advantage of using new specialized hardware, this new approach represents a valid option for edge-computing probabilistic forecasting. Such a new possibility provides new opportunities to a large amount of safety-critical applications in conditions where power is very limited and no internet connection is available (e.g. catastrophic scenarios).

In the future, we plan to explore other generative models (e.g. Deep Belief Networks, Generative Adversarial Networks) to perform better probabilistic forecasting.

## 9 Acknowledgments

We wish to thank Dr. S. Alessandrini for providing his code to generate the confidence intervals for the verification statistics. Furthermore, we thank Summer Internships in Parallel Computational Science (SIParCS) at National Center for Atmospheric Research (NCAR) supported by National Science Foundation (NSF) as B.R. was financially supported to work on this research.

## References

## References

- NOAA (2018) NOAA, Billion-dollar weather and climate disasters, 2018.
- Lorenz (1969) E. N. Lorenz, Atmospheric Predictability as Revealed by Naturally Occurring Analogues, 1969. doi:10.1175/1520-0469(1969)26<636:APARBN>2.0.CO;2.
- Hopson (2014) T. M. Hopson, Assessing the ensemble spread–error relationship, Monthly Weather Review 142 (2014) 1125–1142.
- Foley et al. (2012) A. M. Foley, P. G. Leahy, A. Marvuglia, E. J. McKeogh, Current methods and advances in forecasting of wind power generation, Renewable Energy 37 (2012) 1–8.
- Jablonowski (2004) C. Jablonowski, Adaptive grids in weather and climate modeling, Ph.D. thesis, University of Michigan, 2004.
- Sperati et al. (2016) S. Sperati, S. Alessandrini, L. Delle Monache, An application of the ecmwf ensemble prediction system for short-term solar power forecasting, Solar Energy 133 (2016) 437–450.
- Schmidt et al. (2017) G. A. Schmidt, D. Bader, L. J. Donner, G. S. Elsaesser, J.-C. Golaz, C. Hannay, A. Molod, R. Neale, S. Saha, Practice and philosophy of climate model tuning across six us modeling centers, Geoscientific model development 10 (2017) 3207.
- Syrakos et al. (2012) A. Syrakos, G. Efthimiou, J. G. Bartzis, A. Goulas, Numerical experiments on the efficiency of local grid refinement based on truncation error estimates, Journal of Computational Physics 231 (2012) 6725–6753.
- Fant et al. (2016) C. Fant, C. A. Schlosser, K. Strzepek, The impact of climate change on wind and solar resources in southern africa, Applied Energy 161 (2016) 556–564.
- Delle Monache et al. (2011) L. Delle Monache, T. Nipen, Y. Liu, G. Roux, R. Stull, Kalman Filter and Analog Schemes to Postprocess Numerical Weather Predictions, Monthly Weather Review 139 (2011) 3554–3570.
- Cervone et al. (2017) G. Cervone, L. Clemente-Harding, S. Alessandrini, L. Delle Monache, Short-term photovoltaic power forecasting using Artificial Neural Networks and an Analog Ensemble, Renewable Energy 108 (2017) 274–286.
- Hu et al. (2019) W. Hu, G. Cervone, L. Clemente-Harding, M. Calovi, Parallel analog ensemble, 2019. doi:10.5281/zenodo.3384321.
- Hu and Cervone (2019) W. Hu, G. Cervone, Dynamically optimized unstructured grid (doug) for analog ensemble of numerical weather predictions using evolutionary algorithms, Computers & Geosciences (2019).
- Balasubramanian et al. (2018) V. Balasubramanian, M. Turilli, W. Hu, M. Lefebvre, W. Lei, R. Modrak, G. Cervone, J. Tromp, S. Jha, Harnessing the power of many: Extensible toolkit for scalable ensemble applications, in: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2018, pp. 536–545.
- Mahapatra and Venkatrao (1999) N. R. Mahapatra, B. Venkatrao, The processor-memory bottleneck: Problems and solutions, XRDS 5 (1999).
- Wulf and McKee (1995) W. A. Wulf, S. A. McKee, Hitting the memory wall: Implications of the obvious, SIGARCH Comput. Archit. News 23 (1995) 20–24.
- Li et al. (2017) T. Li, H. Shen, Q. Yuan, X. Zhang, L. Zhang, Estimating ground-level pm2. 5 by fusing satellite and station observations: A geo-intelligent deep learning approach, Geophysical Research Letters 44 (2017) 11–985.
- Athira et al. (2018) V. Athira, P. Geetha, R. Vinayakumar, K. Soman, Deepairnet: Applying recurrent networks for air quality prediction, Procedia computer science 132 (2018) 1394–1403.
- Roozitalab et al. (2019) B. Roozitalab, M. Zhou, Y. Shen, L. Castro Garcia, Forecasting the pm2.5 concentration using a deep learning approach (case study: Los angeles), in: RMACC Symposium, 2019.
- Delle Monache et al. (2011) L. Delle Monache, T. Nipen, Y. Liu, G. Roux, R. Stull, Kalman filter and analog schemes to postprocess numerical weather predictions, Monthly Weather Review 139 (2011) 3554–3570.
- Kingma and Welling (2013) D. P. Kingma, M. Welling, Auto-Encoding Variational Bayes, arXiv e-prints (2013) arXiv:1312.6114.
- Doersch (2016) C. Doersch, Tutorial on Variational Autoencoders, arXiv e-prints (2016) arXiv:1606.05908.
- Bengio et al. (2012) Y. Bengio, A. Courville, P. Vincent, Representation Learning: A Review and New Perspectives, arXiv e-prints (2012) arXiv:1206.5538.
- Fu et al. (2019) H. Fu, C. Li, X. Liu, J. Gao, A. Çelikyilmaz, L. Carin, Cyclical annealing schedule: A simple approach to mitigating KL vanishing, CoRR abs/1903.10145 (2019).
- Higgins et al. (2017) I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed, A. Lerchner, beta-vae: Learning basic visual concepts with a constrained variational framework, in: ICLR, 2017.
- Kim and Mnih (2018) H. Kim, A. Mnih, Disentangling by Factorising, arXiv e-prints (2018) arXiv:1802.05983.
- Chen et al. (2018) R. T. Q. Chen, X. Li, R. Grosse, D. Duvenaud, Isolating Sources of Disentanglement in Variational Autoencoders, arXiv e-prints (2018) arXiv:1802.04942.
- Sohn et al. (2015) K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep conditional generative models, in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information Processing Systems 28, Curran Associates, Inc., 2015, pp. 3483–3491.
- Aha et al. (1991) D. W. Aha, D. Kibler, M. K. Albert, Instance-based learning algorithms, Mach. Learn. 6 (1991) 37–66.
- Bowman et al. (2015) S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, S. Bengio, Generating sentences from a continuous space, CoRR abs/1511.06349 (2015).
- Yang et al. (2017) Z. Yang, Z. Hu, R. Salakhutdinov, T. Berg-Kirkpatrick, Improved variational autoencoders for text modeling using dilated convolutions, in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, JMLR.org, 2017, pp. 3881–3890.
- Dieng et al. (2018) A. B. Dieng, Y. Kim, A. M. Rush, D. M. Blei, Avoiding Latent Variable Collapse With Generative Skip Models, arXiv e-prints (2018) arXiv:1807.04863.
- Sønderby et al. (2016) C. Sønderby, T. Raiko, L. Maaløe, S. Sønderby, O. Winther, How to train deep variational autoencoders and probabilistic ladder networks, in: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2016.
- Delle Monache et al. (2013) L. Delle Monache, F. A. Eckel, D. L. Rife, B. Nagarajan, K. Searight, Probabilistic Weather Prediction with an Analog Ensemble, Monthly Weather Review 141 (2013) 3498–3516.
- Janjic (2003) Z. Janjic, A nonhydrostatic model based on a new approach, Meteorology and Atmospheric Physics 82 (2003) 271–285.
- NOAA (2019) NOAA, North american mesoscale forecast system, 2019.
- Jolliffe and Stephenson (2003) I. T. Jolliffe, D. B. Stephenson, Forecast Verification: A Practitionar’s Guide in Atmospheric Science, Wiley, 2003.
- Sperati et al. (2017) S. Sperati, S. Alessandrini, L. Delle Monache, Gridded probabilistic weather forecasts with an analog ensemble, Quarterly Journal of the Royal Meteorological Society 143 (2017) 2874–2885.
- Alessandrini et al. (2015) S. Alessandrini, L. D. Monache, S. Sperati, J. Nissen, A novel application of an analog ensemble for short-term wind power forecasting, Renewable Energy 76 (2015) 768 – 781.
- Wilks (2011) D. Wilks, Statistical Methods in the Atmospheric Sciences, Elsevier, 2011.