# On the generation of probabilistic forecasts from deterministic models

###### Abstract

Most of the methods that produce space weather forecasts are based on deterministic models. In order to generate a probabilistic forecast, a model needs to be run several times sampling the input parameter space, in order to generate an ensemble from which the distribution of outputs can be inferred. However, ensemble simulations are costly and often preclude the possibility of real-time forecasting.

We introduce a simple and robust method to generate uncertainties from deterministic models, that does not require ensemble simulations. The method is based on the simple consideration that a probabilistic forecast needs to be both accurate and well-calibrated (reliable). We argue that these two requirements are equally important, and we introduce the Accuracy-Reliability cost function that quantitatively measures the trade-off between accuracy and reliability. We then define the optimal uncertainties as the standard deviation of the Gaussian distribution that minimizes the cost function.

We demonstrate that this simple strategy, implemented here by means of a regularized deep neural network, produces accurate and well-calibrated forecasts, showing examples both on synthetic and real-world space weather data.

Space Weather

Center for Mathematics and Computer Science (CWI), Amsterdam, The Netherlands Laboratory for Atmospheric and Space Physics, University of Colorado, Boulder, CO, USA Space Sciences Laboratory, University of California Berkeley, Berkeley, CA, USA Department of Atmospheric and Oceanic Sciences, University of California, Los Angeles, CA, USA

Enrico Camporealee.camporeale@cwi.nl

We introduce a new method to estimate the uncertainties associated with single-point outputs generated by a deterministic model

The method ensures a trade-off between accuracy and reliability of the generated probabilistic forecasts

The method is computationally inexpensive, avoiding the costs associated with ensemble simulations, and is model independent. The only inputs needed are the observed errors between predictions and observations of the deterministic model.

## 1 Introduction

The US National Space Weather Action Plan released in October 2015 has fueled interest in so-called Operations-to-Research (O2R) activities, which are now explicitly funded by NASA and NOAA programs.
An important element of O2R is the enhancement of existing operational models and products with fundamental research. A major weakness of most of the state-of-the-art forecasting models used by national Space Weather agencies is that they are essentially deterministic. For any given set of input parameters, they output a single-point estimate, without providing information on the uncertainty associated with such an estimate.
On the other hand, the Space Weather community is gradually recognizing the importance of probabilistic forecasts, which have been the standard in meteorological weather forecast for many years.
Indeed, several probabilistic forecasting models have been proposed in the last few years, concerning solar energetic particles (Kahler and Ling, 2015; Aminalragia-Giamini et al., 2018), geomagnetic indexes (McPherron and Siscoe, 2004; Zhang and Moldwin, 2014; Riley and Love, 2016; Chandorkar et al., 2017; Gruet et al., 2018), GPS scintillation (Prikryl et al., 2012), solar flares (Gallagher et al., 2002; Barnes et al., 2007; Wheatland, 2004; Lee et al., 2012; Bloomfield et al., 2012; Papaioannou et al., 2015), solar wind speed (BussyâVirat and Ridley, 2014; Owens and Riley, 2017; Napoletano et al., 2018), and relativistic electron fluxes (Miyoshi and Kataoka, 2008), among others.

As pointed out in Murray et al. (2017), most operational space weather forecasting centers worldwide still rely on human forecasters to adjust the issued probability of a given event, based on experience. Yet, a recent verification of geomagnetic storm and X-ray flare forecasts issued by the Met Office Space Weather Operations Centre (MOSWOC), has reported that these forecasts struggle to provide a better prediction than a reference model, and tend to overforecast events (Sharpe and Murray, 2017). Moreover, comparing eleven different methods to predict flares, Barnes et al. (2016) concluded that no participating method proved substantially better than climatological forecasts, for M-class flares and above (a climatological model is one where a long-term average of the quantity of interest is taken as forecast).

There are two major approaches in producing a probabilistic model. The first way is to apply a statistical method on historical records, trying to correlate some input parameters with the forecast output. Little or no physics assumptions enter in such models (other then maybe a judicious choice of input parameters based on physics). For instance, modern machine learning algorithms, often referred to as black-box models, fall in this category (Ghahramani, 2015; Murphy, 2012; Camporeale et al., 2018a, b).
A second way of producing a probabilistic forecast is based on the use of physics-based models, which range from (almost) first-principle simulations (e.g., Luhmann et al., 2017), to semi-empirical models (e.g., Möstl et al., 2017). These white-box models are typically deterministic, meaning that they return a single solution for any given set of inputs provided. How to assign a probabilistic interpretation to such single-point estimates, in a computationally cheap way, is a challenging open problem which forms the core of a research area called non-intrusive Uncertainty Quantification (Smith, 2013). Non-intrusive refers to the fact that one employs a deterministic model (and its existing software), and performs an ensemble of simulations, without changing the underlying equations. It is then straightforward to extract a probabilistic description from the results ensemble. However, this is usually very expensive, and brings the two following problems. First, if the number of inputs is large, one encounters the infamous curse of dimensionality, namely the fact that the volume of an hypercube increases exponentially with the number of dimensions. Hence, sampling the input space with a tensorial grid (i.e. with a given number of points per dimension) quickly becomes unfeasible, because each grid point corresponds to a single run of a deterministic simulation. For this reason, sampling is often done in a Monte-Carlo fashion (or one of its modifications, such as Quasi-Monte-Carlo (Caflisch, 1998)), which is very robust but also very slow in achieving convergence. Not surprisingly, an active area of research focuses on the design of adaptive sampling algorithms that yield convergence faster than Monte-Carlo (Xiu and Karniadakis, 2002; Babuška et al., 2007; Xiu, 2010; Camporeale et al., 2017).
The second problem is that the distribution of outputs collected from the ensemble of simulations (the probabilistic forecast) is obtained by mapping, through the nonlinear simulation, the probability density that is assumed for the input parameters. Any misfit in the distribution of the inputs propagates to the distribution of outputs, producing misleading results. For this reason, an essential step of ensemble simulations is the calibration of the model (Kennedy and O’Hagan, 2001), that is the derivation of the distribution of the input parameters that is most consistent with observations. Calibration can itself be rather expensive, when it also relies on a large number of simulation runs.

In this paper we introduce a new method to derive a probabilistic forecast based on a deterministic model, that avoids the computational costs associated with collecting an ensemble, and with properly calibrating a computer simulation.
We focus on the prediction of a continuous real variable.
Hence, the method produces a probabilistic forecast in terms of a probability density function (pdf) for the quantity of interest. Moreover, we restrict ourselves to the case where such probability density is by construction Gaussian.

### 1.1 Accuracy and Reliability

This method is based on the simple consideration that a probabilistic forecast needs to be both accurate and reliable. This is in line with Gneiting et al. (2007), that have proposed to evaluate the performance of a forecast based on the paradigm of maximizing the sharpness of the predictive distributions subject to calibration. Sharpness refers to the concentration of the predictive distributions and is a property of the forecasts only. Note that in this paper we refer to calibration and reliability interchangeably, the former term typically being used in meteorological literature. Following the seminal paper by Murphy and Winkler (1992), accuracy is defined as the overall degree to which forecasts correspond to observations. It can be quantified introducing a proper scoring rule (Bröcker and Smith, 2007), whose examples are the Brier score for binary events (Brier, 1950), the Rank Probability Score for multi-category events, and its generalization for forecast of continuous variables, the Continuous Rank Probability Score (CRPS) (Hersbach, 2000; Wilks, 2011), that we will use here. Reliability is the property of a probabilistic model that measures its statistical consistency with observations. In particular, for forecasts of discrete events, the reliability measures if an event occurs on average with frequency , when it has been predicted to occur with probability . For example, consider a probabilistic, binary, meteorological model that predicts rain or no-rain. Take a large enough sample of predictions of ’70% chance of rain’. The model is said to be reliable/calibrated if approximately 70% of these predictions turned out to be true (i.e., it rained), and if this holds for all forecasted probabilities. The same concept can be extended to forecasts of a continuous scalar quantity by examining the so-called rank histogram (Anderson, 1996; Hamill, 1997, 2001) or the reliability diagram (Pinson et al., 2010). A reliability diagram represents, for any value of probability predicted for a given output, what is the actual observed frequency for that output (i.e., How many times did it rain, when 70% chance of rain was predicted? More (under-confident), less (over-confident), or exactly 70% of the time?). In the case of continuous variables, the reliability diagram is obtained with the following straightforward procedure. One collects a (large) number of pairs observations-predictions (the former being a real number, the latter a probability density). For each observation, one computes what was the probability that was assigned to the the outcome being less or equal than the observed outcome. In the case of Gaussian predictions, this is simply the cumulative distribution function , where is the observed outcome, is mean of the predicted normal distribution, and its standard deviation. Once the list of all these probabilities is computed, the empirical distribution function associated to such list represents the reliability diagram. Once plotted, the range of assigned probabilities (from 0 to 1) is on the horizontal axis, and the frequency with which events occur, for each given assigned probability, is on the vertical axis. A perfectly calibrated model results in the reliability diagram following a straight diagonal line, while over- or under-confident predictions lie respectively below or above the diagonal line.

In any decision-making scenario, reliability is as important as accuracy: a non-reliable model (either because over- or under-confident) introduces a systematic bias which is hard to account for. In summary, reliability gives a quantitative measure of how consistently trustworthy (reliable, in common language) a predictive model is.

### 1.2 Proposed strategy

Our method is very general and decoupled from any particular choice for the model that predicts the output targets, which can lie anywhere in the range from white to black-box models, as long as the quantity of interest is real and continuous. Indeed, in the following we will assume that such a model, whose details are not important, is provided. It is important to emphasize that the scope of this work is not to reduce the errors associated with the model, but to estimate the uncertainty of its output, thus generating a probabilistic forecast based on a deterministic model.
The probabilistic forecast is designed to be a Gaussian probability distribution centered around the values produced by the model. In this way, the only unknown quantity is the variance of the Gaussian distribution.
The simple strategy proposed here is to estimate this unknown variance (which is in general a function of the model inputs) by enforcing it to be a minimizer of a newly introduced cost function, which encodes a trade-off between accuracy and reliability, and that we call Accuracy-Reliability (AR) cost function.
As we will show, when interpreted as a function of the variance (or its square root, the standard deviation), for fixed errors (the difference between model output and observed values), accuracy and reliability are competing objectives. This gives rise to a two-objective optimization problem and the well-known Pareto curve (Branke et al., 2008). This curve defines a boundary on which any further optimization of one objective (e.g. a better accuracy) results in worsening of the other objective (e.g. a worse reliability).

An important consideration is that, because our method boils down to a multi-dimensional optimization problem, it could in principle be directly solved using a standard algorithm such as Newton’s or quasi-Newton’s method. This would however result in a non-smooth function between inputs and variance, and it would not be easily generalizable to unseen inputs. Therefore,
in order to be able to generalize the derivation of the optimal variance for any values of model inputs, and to ensure that the variance is to a certain degree a smooth function of the inputs, we introduce an Artificial Neural Network (ANN), that is trained on a given sample of model outputs, for which the ground truth is known (that is, the true output of interest, not the true variance, that remains a latent variable).
Hence, the method reduces to a straightforward implementation of an ANN that outputs the optimal variance that minimizes the AR. As a general strategy (and the one used in all our examples), one can use the same inputs used by the deterministic model in the neural network. However, if some prior information is known about latent variables, other inputs can be used as well.

The paper is organized as follows. Section 2 introduces the mathematical background, the Accuracy-Reliability cost function, and it explains the methodology to derive the unknown uncertainties. Section 3 demonstrate the use of our methods for synthetic data, and real-world examples relevant to space weather forecasting are presented in Sections 4 and 5.
Finally, conclusions are drawn in Section 6.

## 2 Methodology

In this Section we introduce and discuss the Continuous Rank Probability Score, which is widely used in many applications (Matheson and Winkler, 1976), and the new Reliability Cost for Gaussian forecasts.

### 2.1 Continuous Rank Probability Score

The Continuous Rank Probability Score (CRPS) is a generalization of the well-known Brier score (Wilks, 2011), used to assess the probabilistic forecast of continuous scalar variables, when the forecast is given in terms of a probability density function, or its cumulative distribution. CRPS is defined as {linenomath}

(1) |

where is the cumulative distribution (cdf) of the forecast, is the Heaviside function, and is the true (observed) value of the forecasted variable. CRPS is a negatively oriented score: it is unbounded and equal to zero for a perfect forecast with no uncertainty (deterministic). In this paper we restrict our attention to the case of probabilistic forecast in the form of Gaussian distributions. Hence, a forecast is simply given by the mean value and the variance of a Normal distribution. In this case and the CRPS can be calculated analytically (Gneiting et al., 2005) as {linenomath}

(2) |

Several interesting properties of the CRPS have been studied in the literature. Notably, its decomposition into reliability and uncertainty has been shown in Hersbach (2000). The CRPS has the same unit as the variable of interest, and it collapses to the Absolute Error for , that is when the forecast becomes deterministic. CRPS is defined for a single instance of forecast and observation, hence it is usually averaged over an ensemble of predictions of size , to obtain the score relative to a given model: . Since we are approaching the problem of variance estimation by assigning an empirical variance to predictions originally made as single-point estimates, it makes sense to minimize the CRPS as a function of only, for a fixed value of the error . By differentiating Eq.(2) with respect to , one obtains {linenomath}

(3) |

and the minimizer is found to be {linenomath}

(4) |

The CRPS penalizes under- and over-confident predictions in a non-trivial way. Indeed, for any value of the error , there are always two values of (one smaller and one larger than , that is one over- and the other under-confident) that yield the same CRPS. We show in Figure 1 the isolines of CRPS in space. The black dashed line indicates . From this Figure it is clear how a smaller error (for constant ) always results in a smaller (better) score, but the same score can be achieved by changing both the error and the standard deviation . A straightforward way of understanding how CRPS works is the following. Let us start with a prediction that has a given error and no uncertainty (i.e. a deterministic forecast, ). CRPS attributes a certain score to such prediction. Now, if we increase the prediction becomes obviously worse, hence CRPS increases, unless we simultaneously increase the uncertainty . That is, accounting for the fact that the prediction is uncertain compensates for a larger error. In this way one can move along a constant CRPS curve, until the point (on the dashed line) where an increase in error cannot be compensated any further by an increase in uncertainty. After that point, larger uncertainties must then be compensated by a decrease in the error .

### 2.2 Reliability Score for Gaussian forecast

Contrary to the CRPS, that is defined for a single pair of forecast-observation, it is clear that reliability can only be defined for a large enough ensemble of such pairs, being a statistical property of a model. In the case of normally distributed forecasts, we expect the standardized errors defined as

(5) |

calculated over a sample of predictions-observations to have a standard normal distribution with cdf . Hence, we define the Reliability Score (RS) as: {linenomath}

(6) |

where is the empirical cumulative distribution of the standardized errors , that is {linenomath}

(7) |

with . RS measures the divergence of the empirical distribution of standardized errors from a standard normal distribution. Note that, by appropriately choosing , one can always obtain an approximate standard normal distribution for , for any given distribution of the errors , as long as the number of instances for and are approximately equal. From now on we will use the convention that the set is sorted (). This does not imply that or are sorted as well. Interestingly, the integral in Eq. (6) can be calculated analytically, via expansion into a telescopic series, yielding: {linenomath}

(8) |

Differentiating now the -th term of the above summation, RS, with respect to (for fixed ), one obtains {linenomath}

(9) |

which is minimized at the value that satisfies {linenomath}

(10) |

This could have been trivially derived by realizing that by minimizing RS one obtains the distribution of standardized errors that most closely approximates a standard normal distribution, for a given number of observations . This is the distribution that mapped through divides uniformly the interval : , i.e. the set . Like CRPS, RS is negatively oriented (i.e. zero is the perfect score). It can be equal to zero only for .

### 2.3 The Accuracy-Reliability cost function

The Accuracy-Reliability cost function introduced here follows from the simple principle that the empirical variances estimated from an ensemble of errors should result in a model that is both accurate (with respect to the CRPS score), and reliable (with respect to the RS score). This gives rise to a two-objective optimization problem. It is trivial to verify that CRPS and RS cannot simultaneously attain their minimum value (for fixed errors ). Note that CRPS is a function of and , while RS is only a function of their scaled ratio . By minimizing the CRPS, for any (see Eq. 4). Obviously, a constant cannot result in a minimum also for RS, according to Eq. (10). Moreover, notice that trying to minimize RS as a function of (for fixed errors ) result in an ill-posed problem, because one can have infinite combinations of that result in the same set , therefore there is no unique solution for the variances that minimizes RS. Hence, RS can be thought of as a regularization term in the Accuracy-Reliability cost function. The simplest strategy to deal with multi-objective optimization problems is to scalarize the cost function, which we define here as {linenomath}

(11) |

We choose the scaling factor as {linenomath}

(12) |

The minimum of is , which is simply the mean of the errors, rescaled by a constant. The minimum of RS follows from Eqs. (8) and (10): {linenomath}

(13) |

Notice that is only a function of the size of the sample , and it converges to zero for . The heuristic choice in Eq. (12) is justified by the fact that the two scores might have different orders of magnitude, and therefore we rescale them in such a way that they are comparable in our cost function (11). Indeed the scaling factor ensure that the two terms would be exactly equal if both could be minimized simultaneously. We believe this to be a sensible choice, although there might be applications where one would like to weigh the two scores differently. Also, in our practical implementation, we neglect the last constant term in the definition (8) so that, for sufficiently large ,

### 2.4 Neural Network

In summary, we want to estimate the input-dependent values of the empirical variances associated to a sample of observations for which we know the errors . We do so by solving a multidimensional optimization problem in which the set of estimated minimizes the AR cost function defined in Eq. (11). This newly introduced cost function has a straightforward interpretation as the trade-off between accuracy and reliability, which are two essential but conflicting properties. In practice, we want to generate a model that is able to predict as a function of the inputs on any point of a domain. This unknown function can in general be non-linear, and we assume no a-priori information to constraint its functional form. However, we want to enforce smoothness of the unknown variance, to some degree. A very general strategy is to use a regularized Artificial Neural Network to model the dependency of as a function of the inputs. However, it is important to realize that this is not the only choice, and in case the user has some prior information on the functional form of , other strategies (such as polynomial interpolation, if the input is low-dimensional) might be better suited. For simplicity, we choose a single neural network architecture, that we use for all the tests. We use a network with 2 hidden layers, respectively with 20 and 5 neurons. The activation functions are and a symmetric saturating linear function, respectively. The third (output) layer uses a linear activation function. The dataset, composed of the inputs and the corresponding observed errors , is randomly divided into training () and validation () sets. The network is trained using a standard Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton algorithm, and the iterations are forcefully stopped when the loss function does not decrease for 10 successive iterations on the validation set. These are all standard choices when training neural networks, and we refer the reader to specific monographs (e.g., Bishop, 1995). A very attractive feature of our model is that the only inputs needed are the input parameters and the corresponding errors (used for training only). The neural network outputs the values of , by minimizing the above-introduced Accuracy-Reliability cost function, Eq. (11), where is the standard deviation, and is used to enforce its positivity. In order to limit the expressive power and avoid over-fitting, we may add a regularization term equal to the L norm of the weights to the AR cost function, multiplied by a constant factor 0.2. In other words, a term can be added to the AR cost function defined in Eq. (11), where the vector represents the Neural Network weights. This is a standard procedure to constrain the amplitude of the weights, and avoid over-fitting (because highly nonlinear functions tend to increase the regularization term) (see, e.g Carè and Camporeale, 2018). In our numerical experiments (Section 3) this regularization term was needed only for 1D cases. Finally, in order to avoid local minima due to the random initialization of the neural network weights, we train five independent networks and choose the one that yields the smallest value of the cost function.

## 3 Experiments with synthetic data

In this section we show some experiments on synthetic data, to demonstrate the ease, robustness and accuracy of the presented method to derive uncertainties.
Here, we assume to have an imperfect model that produces a forecast .
The synthetic observations are generated from a Gaussian distribution , with known variance . The stochastic nature of the synthetic data can be thought to mimic the existence of latent variables that are not included in the model. In other words, close values of the input can results in very different outputs, because of unmodeled processes. The purpose of these experiments is to show that our method is capable of recovering the functional dependence of the variance , that is, for real data, unknown.
We choose some of the datasets routinely used in machine learning literature (Kersting et al., 2007). The first three datasets are one-dimensional in , while in the fourth we will test the method on a five-dimensional space, thus showing the robustness of the proposed strategy.

G dataset: , , (Goldberg et al., 1998).

Y dataset: , , (Yuan and Wahba, 2004).

W dataset: , , (Weigend and Nix, 1994; Williams, 1996).

Examples of 200 points sampled from the G, Y and W dataset are shown in Figure 2 along with their mean function in red.

For the G, Y, and W datasets we test the case where the true mean function is used as deterministic model, and two cases where the model suffers of a systematic bias and the model output is replaced by (a multiplicative error) or (an additive error). These two cases serve also the purpose of studying the behaviour of the proposed method for non-Gaussian errors. Every model is trained on 100 points uniformly sampled in the domain.

5D dataset: , , (Genz, 1984).
Figure 3 shows the distribution of , which ranges in the interval .

The 5D dataset is obviously more challenging, hence we use 10,000 points to train the model (note that this results in fewer points per dimension, compared to the one-dimensional tests).
For all experiments we test 200 independent runs.

The results for the G dataset are shown in Figure 4. The values derived for the standard deviation , averaged over 200 independent runs are shown in black, compared to the ground truth value used to generate the data (in red). The shaded gray area represents the confidence intervals of one (dark gray) and two (light gray) standard deviations calculated over the ensemble of 200 runs. The top, middle, and bottom panel show the results when the model uses the exact mean function used to generate the data , and when the model is mis-specified by a multiplicative error (), or an additive error (), respectively. One can notice that our method is capable of recovering almost exactly the true variance (top), when the model is accurate. On the other hand, when the model is mis-specified (and the errors become non-Gaussian) the method appropriately assigns a larger uncertainty (middle and bottom panels). In particular, it is interesting that the discrepancy between the true variance and the one derived by this method is larger when the true variance is small. This is because in the regions with small (true) variance a mis-specified model (mean function) causes a larger departure from Gaussianity. Since the method is designed to assign anyway a Gaussian probability density, it necessarily results in a larger uncertainty. Nevertheless, using the AR cost function as criterion to derive the empirical variance will always results in an optimally calibrated model, meaning that ill-calibrated results are very unlikely, unless the underlying mean function is very off from the appropriate value. Figure 5 shows the reliability diagram for the three cases discussed (exact model and mis-specified models). Once again, a reliability diagram represents, for any value of probability predicted for a given output, what is the actual observed frequency for that output (calculated on a large sample). A perfectly calibrated model results in a reliability diagram following the straight diagonal line (dashed black).

Not surprisingly, when we use the exact model as our mean function (blue line), the empirical variance derived with our method result in a perfectly calibrated model that indeed follow very closely the diagonal line (dashed black).
When the model is mis-specified (red and yellow lines), the method tries to achieve a trade-off between reliability and accuracy. The resulted reliability is still very good even though not perfect. It is very interesting that the reliability diagram can be used for our method to detect a mis-specified mean function. Indeed, it is important to point out that, for the G dataset, the model with additive error is worse than the one with multiplicative error, because goes through zero in three points in the domain (hence the multiplicative error plays no role in those points).

Results for the Y datasets are shown in Figures 6 and 7, with same format as previous Figures. Conclusions are very similar, with the main difference that the Y dataset has a nonlinear true variance, which is harder to learn. Nevertheless, our method provides a good estimate of it.
The W model is the most challenging, as shown in Figures 8, 9. Here, a mis-specification of the model becomes readily evident, producing almost constant variance and large errors in the reliability diagram.

For the 5D dataset it is impractical to compare graphically the real and estimated in the 5-dimensional domain. Instead, in Figure 10 we show the probability density of the real versus predicted values of the standard deviation. Values are normalized such that the maximum value in the colormap for any value of predicted is equal to one (i.e. along vertical lines). The red line shows a perfect prediction. The colormap has been generated by 10,000,000 points, while the model has been trained with 10,000 points only.
For this case, we have used an exact mean function (equal to zero), in order to focus exclusively on the estimation of the variance.
We believe that this is an excellent result for a very challenging task, given the sparsity of the training set, that shows the robustness of the method.

## 4 Estimation of electron density in the plasmasphere (DEN2D)

In this and the next section we show applications of our method that are relevant to Space Weather.
The first example is the estimation of the electron plasma density in the plasmasphere. Chu et al. (2017) have devised a neural network (NN) model, DEN2D, that takes as inputs the time history of the SYM-H and AL geomagnetic indexes, and of (solar radio flux), and outputs the logarithm of the electron density at any location in the plasmasphere, as function of magnetic shell (L), and magnetic local time (MLT), at near-equatorial latitudes. The neural network was trained and tested on about 400,000 events generated by 4 years of THEMIS data (June 2008 to December 2012). It uses 178 input attributes and outputs the logarithmic value of the electron density.
Obviously, the NN is a deterministic model, that outputs a single value for any given combinations of inputs. Hence, this model is very well-posed for the method introduced in this paper.
Moreover, a recent study performed to evaluate the propagation of uncertainties in radiation belt ensemble simulations, has shown that the uncertainty in the electron density estimation carries most of the variance of the predicted electron fluxes (Camporeale et al., 2016). Therefore, the reduction of the uncertainty for electron density is a necessary step for developing reliable forecasts of electron fluxes.
Figure 11 shows the distribution of the error of the NN output with respect to the true (log) electron density, calculated over the whole dataset. The superimposed red line shows a Gaussian fit to the distribution, which has a slighter larger variance. It is important, however, to keep in mind that our method does not assume that the model errors are normally distributed. Indeed, the method will try to enforce that the standardized errors are Gaussian, which can be achieved even for non-Gaussian errors . This is well demonstrated by the reliability diagram, which is shown in Figure 12. Our method applied to the DEN2D model produces a probabilistic estimate of electron density that has almost perfect reliability.

Once we have trained our model to estimate the standard deviation as function of the same inputs used in the DEN2D NN, one can seek for evident relationship between and any of the inputs. This is in general non-trivial, given that the model takes 178 inputs. Indeed, the only evident correlation exists with the value of the magnetic shell. Figure 13 shows the two-dimensional histogram of and . The number of counts are normalized column-wise, that is for every value of the maximum is set equal to 1. The black dashed line follows the maximum number of counts as function of . The uncertainty of the density estimation increases with increasing , reaching a maximum for . This is consistent with the distribution of errors, when ordered as function of (Figure 3 in Chu et al. (2017)), reproduced here in Figure 14, with the same format as before. Even though the mean value remains centered around zero, the spread of the errors increases with increasing , hence resulting in larger uncertainties.
We conclude this Section by reproducing the result shown in Figure 6 of Chu et al. (2017), where the authors have applied the DEN2D model to the moderate storm of 4 February 2011. Figure 15 reproduces the estimated electron density at 6 different times, ranging from the quite time before the storm to the recovery phase after the storm. The color bar indicates the (log) electron density. Superimposed to each image we show the isolines of the standard deviation calculated with our new method. It is interesting to notice how is as dynamic as the electron density. Being derived from the DEN2D model, the uncertainty is itself dependent on the time history of geomagnetic indexes and on geographical location.

## 5 Estimation of chorus wave amplitude

Whistler-mode chorus waves play a crucial role for wave-particles interaction and particles scattering in the inner magnetosphere (Thorne, 2010; Camporeale, 2015; Camporeale and Zimbardo, 2015). The estimation of the wave amplitude is an important step in the calculation of pitch angle and energy diffusion by means of quasi-linear Fokker-Planck equations.
Recently, Agapitov et al. (2018) have presented an empirical model to estimate the chorus wave amplitude and wave normal angle distribution, derived from the statistical analysis of Cluster and Van Allen Probes VLF measurements. The model takes as inputs the magnetic local time (MLT), the magnetic latitude , the value of the L-shell, and the geomagnetic index Kp (or Dst (Agapitov et al., 2015)) providing the distribution of chorus wave amplitude and wave normal angle in the outer radiation belt (from the plasmapause to L=7) for all MLT values in the latitudinal range from -45 to 45 degrees. The model was developed in the polynomial form for chorus wave amplitude with the coefficients calculated based on Cluster STAFF-SA measurements in 2001-2011 (Agapitov et al., 2015), and with the coefficients updated making use of the combined Cluster observations and the recent Van Allen Probes VLF measurements (Agapitov et al., 2018).
In order to apply our new method to the model of Agapitov et al. (2018), we have produced an estimation of the chorus wave amplitude for the period 01-01-2015 to 12-30-2016, at one-minute resolution at the corresponding location of the Van Allen Probes spacecraft and the corresponding level of the geomagnetic activity. The ground-truth value is taken directly from the Van Allen Probes EMFISIS observations. Note that this time interval was not included in the original training of the model. This produced a total of 213,937 data points for which the model error was calculated. Since the wave amplitude can range within two orders of magnitude, the errors are in log scale.

Figure 16 shows the histogram of the model error (computed as the difference between the logarithm of predictions and observations), compared with its Gaussian fit. Similarly to the model discussed in the previous Section, this model does not yield errors that are exactly log-normal distributed. This, however, does not affect the goodness of our uncertainty estimate, in terms of accuracy and reliability.
As previously, we train our algorithm to estimate the standard deviation using the same inputs as the original model. The reliability diagram, calculated over the entire dataset, is shown in Figure 17. The largest mismatch, for a predicted probability equal to 50%, is about 7%, hence demonstrating that the model is very well calibrated.
Figure 18 shows the heat map of the standard deviation at different locations , and for different ranges of the geomagnetic index Kp (left panel: ; center panel: ; right panel: ). Not surprisingly, the largest uncertainties occur during storm-time, and in the pre-noon sector.
Finally, Figure 19 shows the two-dimensional histogram of the standard deviation , as function of the magnetic local time MLT. A column-wise normalization is applied, such that the maximum value along a constant MLT is equal to one. Consistently with the previous Figure, the largest uncertainties occur for MLT in the range 5-10.

## 6 Conclusions

The estimation of uncertainties associated with the output of deterministic models is a key element of any forecasting method. The standard approach for evaluating such uncertainties is to rely on time-consuming ensemble simulations. In this paper, we have introduced a novel methodology to estimate uncertainties that does not require running costly ensembles. The guiding principle behind our method is that the uncertainty of the output distribution, here represented by the standard deviation of a Gaussian centered around the values predicted by the deterministic model, should produce a probabilistic forecast that is both accurate and reliable (well calibrated). We have introduced a cost function that encodes the trade-off between accuracy and reliability for Gaussian distributions. The minimization of such Accuracy-Reliability cost function yields the optimal standard deviation .
The proposed method is ignorant with respect to the deterministic model it is applied to. In fact, it only requires the algebraic errors between predictions and true values, in order to be trained. A deep neural network is used to generate the unknown standard deviation for inputs other than the ones used for training.

We have shown experiments with synthetic data sets (for one and five-dimensional examples), that demonstrate how our method is able to learn the underlying functional dependence of the standard deviation, which is, in real-world problems, unknown. These experiments also show how the method deals with cases when the underlying deterministic model contains a systematic error. In this cases, the reliability diagram represents a sanity check, indicating the presence of systematic errors. Indeed, it is understood that any problem with the underlying deterministic model is ultimately reflected in the reliability diagram.

Finally we have applied the method to two recently developed models, relevant to space weather: the estimation of the electron density in the plasmasphere (Section 4), and of the chorus wave amplitude (Section 5). In both cases, we use as inputs the same inputs employed in the original model. The probabilistic forecast produced with our method show excellent reliability diagrams, pinpointing the lack of a systematic bias in the original models.

Our code is available on the website www.mlspaceweather.org and zenodo.org (doi:10.5281/zenodo.1485608)
and we encourage the space weather community to produce probabilistic forecasts based on deterministic models, using our method.
Finally, we point out that an interesting future extension to this method would be the case of multivariate outputs (in contrast to scalars). In that case, the definitions of CRPS and RS will need to account for co-variances between variables.

###### Acknowledgements.

E.C. is partially funded by NWO-Vidi grant 639.072.716.All data used to generate the results are available on the website www.mlspaceweather.org and zenodo.org (doi:10.5281/zenodo.1485608)

## References

- Agapitov et al. (2015) Agapitov, O., A. Artemyev, D. Mourenas, F. Mozer, and V. Krasnoselskikh (2015), Empirical model of lower band chorus wave distribution in the outer radiation belt, Journal of Geophysical Research: Space Physics, 120(12), 10–425.
- Agapitov et al. (2018) Agapitov, O., D. Mourenas, A. Artemyev, F. Mozer, G. Hospodarsky, J. Bonnell, and V. Krasnoselskikh (2018), Synthetic empirical chorus wave model from combined van allen probes and cluster statistics, Journal of Geophysical Research: Space Physics, 123(1), 297–314.
- Aminalragia-Giamini et al. (2018) Aminalragia-Giamini, S., I. Sandberg, C. Papadimitriou, I. A. Daglis, and P. Jiggens (2018), The virtual enhancements- solar proton event radiation (vesper) model, Journal of Space Weather and Space Climate, 8, A06.
- Anderson (1996) Anderson, J. L. (1996), A method for producing and evaluating probabilistic forecasts from ensemble model integrations, Journal of Climate, 9(7), 1518–1530.
- Babuška et al. (2007) Babuška, I., F. Nobile, and R. Tempone (2007), A stochastic collocation method for elliptic partial differential equations with random input data, SIAM Journal on Numerical Analysis, 45(3), 1005–1034.
- Barnes et al. (2007) Barnes, G., K. D. Leka, E. A. Schumer, and D. J. DellaâRose (2007), Probabilistic forecasting of solar flares from vector magnetogram data, Space Weather, 5(9), doi:10.1029/2007SW000317.
- Barnes et al. (2016) Barnes, G., K. Leka, C. Schrijver, T. Colak, R. Qahwaji, O. Ashamari, Y. Yuan, J. Zhang, R. McAteer, D. Bloomfield, et al. (2016), A comparison of flare forecasting methods. i. results from the âall-clearâ workshop, The Astrophysical Journal, 829(2), 89.
- Bishop (1995) Bishop, C. M. (1995), Neural networks for pattern recognition, Oxford university press.
- Bloomfield et al. (2012) Bloomfield, D. S., P. A. Higgins, R. J. McAteer, and P. T. Gallagher (2012), Toward reliable benchmarking of solar flare forecasting methods, The Astrophysical Journal Letters, 747(2), L41.
- Branke et al. (2008) Branke, J., K. Deb, and K. Miettinen (2008), Multiobjective optimization: Interactive and evolutionary approaches, vol. 5252, Springer Science & Business Media.
- Brier (1950) Brier, G. W. (1950), Verification of forecasts expressed in terms of probability, Monthly Weather Review, 78(1), 1–3.
- Bröcker and Smith (2007) Bröcker, J., and L. A. Smith (2007), Scoring probabilistic forecasts: The importance of being proper, Weather and Forecasting, 22(2), 382–388.
- BussyâVirat and Ridley (2014) BussyâVirat, C. D., and A. J. Ridley (2014), Predictions of the solar wind speed by the probability distribution function model, Space Weather, 12(6), 337–353, doi:10.1002/2014SW001051.
- Caflisch (1998) Caflisch, R. E. (1998), Monte carlo and quasi-monte carlo methods, Acta numerica, 7, 1–49.
- Camporeale (2015) Camporeale, E. (2015), Resonant and nonresonant whistlers-particle interaction in the radiation belts, Geophysical Research Letters, 42(9), 3114–3121.
- Camporeale and Zimbardo (2015) Camporeale, E., and G. Zimbardo (2015), Wave-particle interactions with parallel whistler waves: Nonlinear and time-dependent effects revealed by particle-in-cell simulations, Physics of Plasmas, 22(9), 092,104.
- Camporeale et al. (2016) Camporeale, E., Y. Shprits, M. Chandorkar, A. Drozdov, and S. Wing (2016), On the propagation of uncertainties in radiation belt simulations, Space Weather, 14(11), 982–992.
- Camporeale et al. (2017) Camporeale, E., A. Agnihotri, and C. Rutjes (2017), Adaptive selection of sampling points for uncertainty quantification, International Journal for Uncertainty Quantification, 7(4).
- Camporeale et al. (2018a) Camporeale, E., S. Wing, J. Johnson, C. Jackman, and R. McGranaghan (2018a), Space weather in the machine learning era: a multi-disciplinary approach, Space Weather.
- Camporeale et al. (2018b) Camporeale, E., S. Wing, and J. R. Johnson (2018b), Machine Learning Techniques for Space Weather, Elsevier.
- Carè and Camporeale (2018) Carè, A., and E. Camporeale (2018), Regression, in Machine Learning Techniques for Space Weather, pp. 71–112, Elsevier.
- Chandorkar et al. (2017) Chandorkar, M., E. Camporeale, and S. Wing (2017), Probabilistic forecasting of the disturbance storm time index: An autoregressive gaussian process approach, Space Weather, 15(8), 1004–1019, doi:10.1002/2017SW001627.
- Chu et al. (2017) Chu, X., J. Bortnik, W. Li, Q. Ma, V. Angelopoulos, and R. Thorne (2017), Erosion and refilling of the plasmasphere during a geomagnetic storm modeled by a neural network, Journal of Geophysical Research: Space Physics.
- Gallagher et al. (2002) Gallagher, P. T., Y.-J. Moon, and H. Wang (2002), Active-region monitoring and flare forecasting–i. data processing and first results, Solar Physics, 209(1), 171–183.
- Genz (1984) Genz, A. (1984), Testing multidimensional integration routines, in Proc. Of International Conference on Tools, Methods and Languages for Scientific and Engineering Computation, pp. 81–94, Elsevier North-Holland, Inc., New York, NY, USA.
- Ghahramani (2015) Ghahramani, Z. (2015), Probabilistic machine learning and artificial intelligence, Nature, 521(7553), 452.
- Gneiting et al. (2005) Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman (2005), Calibrated probabilistic forecasting using ensemble model output statistics and minimum crps estimation, Monthly Weather Review, 133(5), 1098–1118.
- Gneiting et al. (2007) Gneiting, T., F. Balabdaoui, and A. E. Raftery (2007), Probabilistic forecasts, calibration and sharpness, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2), 243–268.
- Goldberg et al. (1998) Goldberg, P. W., C. K. Williams, and C. M. Bishop (1998), Regression with input-dependent noise: A gaussian process treatment, in Advances in neural information processing systems, pp. 493–499.
- Gruet et al. (2018) Gruet, M., M. Chandorkar, A. Sicard, and E. Camporeale (2018), Multiple hours ahead forecast of the dst index using a combination of long short-term memory neural network and gaussian process, Space Weather, under review.
- Hamill (1997) Hamill, T. M. (1997), Reliability diagrams for multicategory probabilistic forecasts, Weather and forecasting, 12(4), 736–741.
- Hamill (2001) Hamill, T. M. (2001), Interpretation of rank histograms for verifying ensemble forecasts, Monthly Weather Review, 129(3), 550–560.
- Hersbach (2000) Hersbach, H. (2000), Decomposition of the continuous ranked probability score for ensemble prediction systems, Weather and Forecasting, 15(5), 559–570.
- Kahler and Ling (2015) Kahler, S. W., and A. Ling (2015), Dynamic sep event probability forecasts, Space Weather, 13(10), 665–675, doi:10.1002/2015SW001222.
- Kennedy and O’Hagan (2001) Kennedy, M. C., and A. O’Hagan (2001), Bayesian calibration of computer models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3), 425–464.
- Kersting et al. (2007) Kersting, K., C. Plagemann, P. Pfaff, and W. Burgard (2007), Most likely heteroscedastic gaussian process regression, in Proceedings of the 24th international conference on Machine learning, pp. 393–400, ACM.
- Lee et al. (2012) Lee, K., Y.-J. Moon, J.-Y. Lee, K.-S. Lee, and H. Na (2012), Solar flare occurrence rate and probability in terms of the sunspot classification supplemented with sunspot area and its changes, Solar Physics, 281(2), 639–650.
- Luhmann et al. (2017) Luhmann, J. G., M. L. Mays, D. Odstrcil, Y. Li, H. Bain, C. O. Lee, A. B. Galvin, R. A. Mewaldt, C. M. S. Cohen, R. A. Leske, D. Larson, and Y. Futaana (2017), Modeling solar energetic particle events using enlil heliosphere simulations, Space Weather, 15(7), 934–954, doi:10.1002/2017SW001617.
- Matheson and Winkler (1976) Matheson, J. E., and R. L. Winkler (1976), Scoring rules for continuous probability distributions, Management science, 22(10), 1087–1096.
- McPherron and Siscoe (2004) McPherron, R. L., and G. Siscoe (2004), Probabilistic forecasting of geomagnetic indices using solar wind air mass analysis, Space Weather, 2(1), doi:10.1029/2003SW000003.
- Miyoshi and Kataoka (2008) Miyoshi, Y., and R. Kataoka (2008), Probabilistic space weather forecast of the relativistic electron flux enhancement at geosynchronous orbit, Journal of Atmospheric and Solar-Terrestrial Physics, 70(2-4), 475–481.
- Möstl et al. (2017) Möstl, C., T. Amerstorfer, E. Palmerio, A. Isavnin, C. J. Farrugia, C. Lowder, R. M. Winslow, J. M. Donnerer, E. K. J. Kilpua, and P. D. Boakes (2017), Forward modeling of coronal mass ejection flux ropes in the inner heliosphere with 3dcore, Space Weather, 16(3), 216–229.
- Murphy and Winkler (1992) Murphy, A. H., and R. L. Winkler (1992), Diagnostic verification of probability forecasts, International Journal of Forecasting, 7(4), 435–455.
- Murphy (2012) Murphy, K. P. (2012), Machine learning: A probabilistic perspective. adaptive computation and machine learning.
- Murray et al. (2017) Murray, S. A., S. Bingham, M. Sharpe, and D. R. Jackson (2017), Flare forecasting at the met office space weather operations centre, Space Weather, 15(4), 577–588, doi:10.1002/2016SW001579.
- Napoletano et al. (2018) Napoletano, G., R. Forte, D. Del Moro, E. Pietropaolo, L. Giovannelli, and F. Berrilli (2018), A probabilistic approach to the drag-based model, arXiv preprint arXiv:1801.04201.
- Owens and Riley (2017) Owens, M. J., and P. Riley (2017), Probabilistic solar wind forecasting using large ensembles of nearâsun conditions with a simple oneâdimensional âupwindâ scheme, Space Weather, 15(11), 1461–1474, doi:10.1002/2017SW001679.
- Papaioannou et al. (2015) Papaioannou, A., A. Anastasiadis, I. Sandberg, M. Georgoulis, G. Tsiropoula, K. Tziotziou, P. Jiggens, and A. Hilgers (2015), A novel forecasting system for solar particle events and flares (forspef), in Journal of Physics: Conference Series, vol. 632, p. 012075, IOP Publishing.
- Pinson et al. (2010) Pinson, P., P. McSharry, and H. Madsen (2010), Reliability diagrams for non-parametric density forecasts of continuous variables: Accounting for serial correlation, Quarterly Journal of the Royal Meteorological Society, 136(646), 77–90.
- Prikryl et al. (2012) Prikryl, P., P. Jayachandran, S. Mushini, and I. Richardson (2012), Toward the probabilistic forecasting of high-latitude gps phase scintillation, Space Weather, 10(8).
- Riley and Love (2016) Riley, P., and J. J. Love (2016), Extreme geomagnetic storms: Probabilistic forecasts and their uncertainties, Space Weather, 15(1), 53–64, doi:10.1002/2016SW001470.
- Sharpe and Murray (2017) Sharpe, M. A., and S. A. Murray (2017), Verification of space weather forecasts issued by the met office space weather operations centre, Space Weather, 15(10), 1383–1395, doi:10.1002/2017SW001683.
- Smith (2013) Smith, R. C. (2013), Uncertainty quantification: theory, implementation, and applications, vol. 12, Siam.
- Thorne (2010) Thorne, R. M. (2010), Radiation belt dynamics: The importance of wave-particle interactions, Geophysical Research Letters, 37(22).
- Weigend and Nix (1994) Weigend, A. S., and D. A. Nix (1994), Predictions with confidence intervals (local error bars), in Proceedings of the international conference on neural information processing, pp. 847–852.
- Wheatland (2004) Wheatland, M. (2004), A bayesian approach to solar flare prediction, The Astrophysical Journal, 609(2), 1134.
- Wilks (2011) Wilks, D. S. (2011), Statistical methods in the atmospheric sciences, vol. 100, Academic press.
- Williams (1996) Williams, P. M. (1996), Using neural networks to model conditional multivariate densities, Neural Computation, 8(4), 843–854.
- Xiu (2010) Xiu, D. (2010), Numerical methods for stochastic computations: a spectral method approach, Princeton university press.
- Xiu and Karniadakis (2002) Xiu, D., and G. E. Karniadakis (2002), The wiener–askey polynomial chaos for stochastic differential equations, SIAM journal on scientific computing, 24(2), 619–644.
- Yuan and Wahba (2004) Yuan, M., and G. Wahba (2004), Doubly penalized likelihood estimator in heteroscedastic regression, Statistics & probability letters, 69(1), 11–20.
- Zhang and Moldwin (2014) Zhang, X., and M. B. Moldwin (2014), Probabilistic forecasting analysis of geomagnetic indices for southward imf events, Space Weather, 13(3), 130–140, doi:10.1002/2014SW001113.