Abstract
Due to their flexibility and predictive performance, machinelearning based regression methods have become an important tool for predictive modeling and forecasting. However, most methods focus on estimating the conditional mean or specific quantiles of the target quantity and do not provide the full conditional distribution, which contains uncertainty information that might be crucial for decision making. In this article, we provide a general solution by transforming a conditional distribution estimation problem into a constrained multiclass classification problem, in which tools such as deep neural networks. We propose a novel joint binary crossentropy loss function to accomplish this goal. We demonstrate its performance in various simulation studies comparing to stateoftheart competing methods. Additionally, our method shows improved accuracy in a probabilistic solar energy forecasting problem.
[inlinelist,1]label=(),
Deep Distribution Regression
Rui Li^{1}, Howard D. Bondell^{1,2} and Brian J. Reich^{1}
^{1}Department of Statistics, North Carolina State University
^{2}School of Mathematics and Statistics, University of Melbourne
Keywords— conditional distribution, deep learning, machine learning, probabilistic forecasting
1 Introduction
In recent years, a variety of machine learning methods, such as random forest, gradient boosting trees and neural networks have gained popularity and been widely adopted. These methods are often flexible enough to uncover complex relationships in highdimensional data without strong assumptions on the underlying data structure. Offtheshelf software is available to put these algorithms into production [Pedregosa et al. (2011), Abadi et al. (2016) and Paszke et al. (2017)]. However, in regression and forecasting tasks, many of the machine learning methods only provide a point estimate, without any additional information regarding the uncertainty of the target quantity. Understanding uncertainties are often crucial in fields such as financial markets and risk analysis [Diebold et al. (1997), Timmermann (2000)], population and demographic studies [Wilson and Bell (2007)], transportation and traffic analysis [Zhu and Laptev (2017), Rodrigues and Pereira (2018)] and energy forecasting [Hong et al. (2016)]. In this article, we establish a framework that can directly extend offtheshelf machine learning algorithms to provide the full conditional distribution of the response given the covariates. Instead of estimating specific quantiles [Koenker and Hallock (2001), Taylor (2000) and Friedman (2001)], or prediction intervals [Khosravi et al. (2011), Shrestha and Solomatine (2006)], we aim to directly estimate the full conditional distribution, as other quantities can be directly extracted from it.
This research is motivated by the necessity of probabilistic prediction in energy forecasting. The Global Energy Forecasting Competition 2014 [Hong et al. (2016)] focused on probabilistic forecasting, due to the high demand of uncertainty estimation in energy forecasting, yet few existing methods are readily available. The solar energy forecasting track in this competition aimed to estimate the full conditional distribution of solar energy generation based on covariates such as solar radiance, temperature, time of the day, etc. This is a crucial task in actual daytoday operation as solar energy generation is highly weather dependent and thus volatile. To ensure stability in the electrical grid, grid operation not only requires accurate point prediction of solar energy generation, but also its volatility based on weather conditions. Driven by this practical problem, we establish a method to provide comprehensive information regarding the energy generation uncertainties, by estimating the full conditional distribution. Our method shows superior estimation accuracy in our real data analysis compared to competing methods.
There are a number of approaches to estimate the conditional distribution of the target quantity given the input vector . A major class of methods estimates the density functions and through kernel density estimators [Rosenblatt (1969)] to obtain the conditional distribution estimate . This approach is limited by the dimensionality of the input space , due to having to estimate the full joint distribution of . Several methods have then been proposed to address some of the limitations in both locally parametric and nonparametric ways [Hyndman et al. (1996), Hyndman and Yao (2002), Holmes et al. (2012), Fan et al. (2009), Izbicki and Lee (2016)]. Another popular approach is to approximate the distribution of interest by a parametric distribution family or mixture of such distributions, such as a mixture of Gaussians [Escobar and West (1995), Song et al. (2004), Rojas et al. (2005), Fahey et al. (2007)]. This approach faces the challenge of determining the number of mixtures, and the approximation performance will depend on the complexity of the true underlying distribution. Bootstrapbased aggregation provides an alternative method, and a good example is quantile regression forest [Meinshausen (2006)], which obtains the full conditional distribution from the empirical distribution in the aggregated tree leaves. A boosting approach to this problem was discussed in Schapire et al. (2002).
Machine learning approaches have experienced major success in classification problems. We leverage this success to build conditional distribution estimates. In our paper, we propose a twostage method for conditional density estimation. In the first stage, we partition the response space into bins; in the second stage, the probabilities that the target variable falls into each bin given the input covariates are estimated. In this way, we transform a conditional density estimation problem into a multiclass classification task, where we can take advantage of many offtheshelf machine learning algorithms. This framework enjoys the modelagnostic property in the second stage, allows for plugging in any suitable multiclass classification method, such as deep learning for example. To further accommodate the fact that the classes are ordered, we use a joint binary cross entropy loss to couple with the neural network model. The design of our neural network also ensures the monotonicity of the estimated cumulative distribution function, which is not guaranteed by other ordinal classification methods as in Frank and Hall (2001) and Cheng et al. (2008). In addition, we explore random partitioning in the first stage followed by the ensemble approach to obtain a smoother and more stable density estimator.
The paper is organized as follows: In Section 2, we describe the model set up. In Section 2.1, we examine approaches and loss functions that can be used in the multiclass classification stage and in Section 2.2 we examine an alternative partition method and model ensemble. In Section 3, we show that given the consistency of the classification model, we can achieve consistency for the density estimator. We evaluate the model performance with simulation studies in Section 4. In the simulation study, we thoroughly examine the effect of number of bins, different binning strategies as well as different loss functions. In Section 5, the method is illustrated using the aforementioned solar energy example where we demonstrate superior performance to quantile random forests. We conclude with the discussion in Section 6.
2 Distribution Estimation by Partitioning
Our approach is based on partitioning the range of the response variable into bins and approximating the conditional density function by a piecewise constant function. We also propose ensemble random partitioning which allows for a smooth density. Formally, assume we are interested in the density of in the range of . This interval is partitioned by cutpoints into bins. Let and denotes the th bin, for .
Given the independent variables and the partition, denote the for as the conditional probability that belongs to the th interval of the partition. Assume the density function in the th bin can be approximated by the constant function , for all , then the conditional density function has the form:
(1) 
where is the size of the interval .
We then estimate with a classification model. Let be the estimator for obtained from the classification model. Plugging into (1) gives the density estimator:
(2) 
A natural approach to estimate is discussed in Section 2.1.
2.1 Probability Estimation for Each Partitioned Bin
Below we describe two different modeling strategies for the multiclass classification task of estimating .
2.1.1 Multinomial Loglikelihood
A natural approach to obtain the estimates for the conditional probability of a response observation in each bin is to model the conditional distribution as a multinomial distribution. We note that this does not take into account the fact that the bins are ordered. We will discuss how to deal with this fact in the next section. For a given observation , where , we let represent the event , where is the vector of zeros except for in the th element. We then assume that , where is the probability vector describing the conditional probability of belonging to each bin given , and denotes the parameters in the classification model. The loglikelihood function is:
(3) 
Given an appropriate classification model , our goal is to maximize the loglikelihood function with respect to . In this work, we use deep neural networks as a flexible and robust method. Under this model, corresponds to its biases and weights. The softmax function was used as the activation function for the output layer of the neural network model, where is the output layer vector prior to the softmax transformation. The softmax activation function ensures that the estimator is a valid probability vector, such that and .
Although we choose the neural network model, our framework is modelagnostic, and any method that is appropriate for multiclass classification can be utilized. This approach is very straightforward in practice as many offtheshelf machine learning tools have multiclass classification algorithms implemented. However, one potential caveat is that this approach can be sensitive to the number of cutpoints . While larger is required to uncover the fine details of the target distribution and reduce bias, it results in increased variance due to smaller number of observations per partitioned bin. Additionally, this approach does not take the order of the bins into consideration. In order to address some of these concerns, we provide a modified method in Section 2.1.2.
2.1.2 Joint Binary Cross Entropy Loss
Instead of utilizing multiclass classification to directly estimate the function , we reframe the problem as binary classification tasks, and binary classifications are evaluated at each of the cutpoints. In other words, we will obtain the conditional cumulative distribution function by estimating jointly.
In this approach, we retain the model (i.e. neural networks) for the conditional probability that belongs to each bin given , but specify the loss function in terms of the cumulative distribution function . Because the estimates are positive, we are able to automatically obtain the monotonicity property of . For a given cutpoint , the binary cross entropy (BCE) loss is:
(4) 
Combining the BCEs across all cutpoints gives the joint binary cross entropy (JBCE) loss:
(5) 
Our goal becomes minimizing the JBCE loss with respect to . This approach takes the order of the bins into consideration in contrast to multiclass classification where the relationships among the bins are ignored. In scenarios where the number of cutpoints is large, the number of observations per bin can be small and the estimation of can be poor in the multiclass classification setting. Thus this alternative approach can provide an advantage in that it will remain stable even for larger number of cutpoints. This will be seen in our simulation study.
The concept is motivated by the ordinal classification approach proposed by Frank and Hall (2001), yet we provide two additional advantages over the original method. First, instead of building independent binary classification models, we model the binary classification events jointly, which greatly reduces the computational cost. Secondly, our model ensures monotonicity of the estimated cumulative distribution function while the original method by Frank and Hall (2001) does not have this guarantee.
2.2 Distribution Estimation with Ensemble Random Partitioning
Approximating the conditional density with a piecewise constant function can suffer from two drawbacks: {enumerate*} [label=(0)]
The results are sensitive to the choice of the cutpoints, and
The density or the cumulative distribution functions are not smooth. In order to address the two issues, we propose an ensemble random partitioning method. This method fits independent density estimators, each of which has different positions of the cutpoints and the final ensemble estimator is an average over the individual estimators. For the th estimator, we generate for and . Let be the th smallest value in so that , and . Then the th interval of the th partition is . The random partition estimator is then defined as:
(6) 
where is the estimator of for the th estimator.
The density estimator is an average from random partitioning estimators. This estimator alleviates the necessity of choosing the cutpoints locations and can approximate a smooth function as . The tradeoff is that the computation time increases linearly in .
3 Density Estimation Consistency
In this section, we study the conditions that are required for consistent density estimation. We denote the density estimator as , and consider the estimator to be consistent if the integrated mean squared error asymptotically.
Theorem 1.
Let be the target conditional density, and be equally sized, consecutive and nonoverlapping bins on the true density support . The true probability in bin is . If the target density and classification estimator follow conditions (i) to (iv) , then the density estimator is consistent.

, and .

As , and .

.

.
The proof of Theorem 1 is given in the appendix. This theorem is agnostic to the classification model. Note that we are not aware of results showing that deep learning estimation of obtains properties (iii) and (iv). However, for illustration purposes, here we show as an example that multiclass logistic regression satisfies conditions (iii) and (iv), given a proper choice of class number .
Let denote the bin width and assume we are interested in the target density on a finite support, such that . The input vector has the dimensionality . For the logistic regression model, , where the coefficient has dimensionality and let denote the full parameter with dimensionality . Thus the bias for is:
(7) 
By the mean value theorem, is a point between and the true value . Based on results from He and Shao (2000), given , we have , thus , and . By condition , the density is finite everywhere, we have and . Thus from , we can obtain the .
The Taylor expansion of is :
(8) 
He and Shao (2000) showed that is asymptotically normal with , if , where in this case . Similar to the derivation in (7), we can obtain , and thus .
Given , we have that both bias and variance meet conditions (iii) and (iv).
4 Simulation Study
We conduct a simulation study to examine the distribution estimation accuracy of our method comparing to quantile regression forest (QRF), originally proposed by Meinshausen (2006). Quantile regression forest is increasingly being used in the energy and weather forecasting field [Taillardat et al. (2016), Van der Meer et al. (2018)], due to its robustness, flexibility and mature implementation in R (quantregForest) and Python (scikitgarden).
We applied our method to data generated from a variety of distributions. The scenarios we have tested are a linear model with normally distributed errors (Model 1), mixture distributions with nonlinear mean functions (Models 2 and 3) and a skewed distribution with nonlinear mean function (Model 4). The model specification details are:

Model 1: , where

Model 2: , where

Model 3: , where

Model 4: , where
Model performances are evaluated using several proper scoring rules summarized by Gneiting and Raftery (2007). Specifically, the scoring rules that are used here are:

Continuous ranked probability score (CRPS):
(9) The integral is approximated by summing over evenly spaced grid points between and , and normalized by its range. Lower CRPS score indicates better model performance.

Average quantile loss (AQTL):
(10) (11) is the estimated quantile function (derived from ) and AQTL is the averaged quantile loss over percentiles. AQTL is also the evaluation metric used in GEFCom2014 [Hong et al. (2016)]. Lower AQTL indicates better model performance.

Empirical coverage of the prediction interval .
In all simulations, the neural network model are constructed using TensorFlow and Keras. It has three hidden layers with 100 neurons each. The exponential linear unit (ELU) activation function is used at each hidden layer and the softmax activation function is used at the output layer. Regularization via dropout is also applied [Srivastava et al. (2014)], with dropout rate at for each hidden layer. The quantile regression forest model is built with scikitgarden, with trees per forest. In addition, we also include the logistic regression as an alternative multiclass classification method to compare with the neural network model. For both the neural network and logistic regression, we evaluated their performance across different number of partitioned bins. We also examine the performance with different objective functions for the neural network model, as described in Section 2.1. The effect of ensemble random partitioning is also evaluated. The ensemble model consisted of independent density estimators as described in Section 2.2.
The simulations are run over independent datasets, each with observations in the training set and observations in the testing set, on which evaluations are compared. The results are summarized in Figure 1.
In all the tested scenarios, using the neural network shows superior performance over logistic regression. Additionally, the neural network with JBCE loss demonstrated superior performance over the quantile regression forest. With the JBCE loss, increasing the number of partitions results in stable improvement of the performance, while with the multinomial loglikelihood as objective function, the performance is sensitive to the number of partitions and requires finetuning. Ensemble random partitioning provides slight improvements for the neural network approach. Note that the random partitioning will produce smoothed estimates which would not be achieved with fixed partitioning.
5 Application to the GEFCOM2014 Dataset
The Global Energy Forecasting Competition 2014 was an IEEE sponsored competition that focused on probabilistic forecasting [Hong et al. (2016)] and attracted hundreds of teams and individuals. It had four competition tracks: electricity load forecasting, electricity price forecasting, solar energy forecasting and wind energy forecasting. We applied our method to the solar energy forecasting dataset and compared its performance with quantile regression forest, which was widely used among the top teams.
The dataset has two years of data from three solar farms in Australia (no other spatial information is available). The solar power outputs were collected at hourly frequency, together with 12 weather forecast variables from the European Centre for Mediumrange Weather Forecasts (ECMWF), which include surface solar radiance, net solar radiation, surface thermal radiation, temperature, total cloud coverage, precipitation, total column liquid water, total column ice water, relative humidity, surface pressure, 10meter U wind component and 10meter V wind component. All weather forecast variables are provided as point forecasts. The solar power output was scaled by the competition organizers to the range of [0,1], so and . The forecasting task is to use these weather forecast variables to predict the conditional distribution of solar power output for holdout months.
We preprocessed the data to include additional variables that are commonly used in solar energy forecasting, such as indicator variables for each solar farm and hour of the day. We also created sine and cosine transformations for the date of the year, as solar energy output shows strong seasonality. We conducted the experiments to compare our method with quantile regression forest in a rolling simulation setting: We used the first year of data as the initial training dataset, and used the immediate next month as the testing set for evaluation. Then we appended the testing dataset to the initial training set to retrain the models, and the model performances were reevaluated on the next month as the new testing set. This process was repeated for 12 hold out months on which the models were evaluated. The results are included in Figure 2. We compared different model specifications with the quantile regression forest. The yaxis of the first two panels are the percent change in CRPS or AQTL by comparing with the quantile regression forest. The percent change is evaluated for each testing month and averaged across months. Lower scores indicate better performance.
Our method with JBCE loss consistently achieves about a reduction in CRPS or AQTL compared to the quantile regression forest. However, with the multinomial loglikelihood objective function, the performance is sensitive to the number of cut points, and in this dataset, a smaller number of cut points is preferred when using multinomial loglikelihood objective function. This result is consistent with our simulations that models trained with JBCE loss are less sensitive to the number of cut points, and thus require less tuning.
In Figure 3, we show an example of the distribution pattern of solar energy output revealed by our model. We only focus on the time period during the day, since the solar energy at night is always zero and does not require forecasting. For days such as March 10th, 2014, the weather is clear, and the prediction intervals are narrow, and thus the predicted energy output has little variability. The predicted distribution on days forecasted to be sunny is often negatively skewed, as energy output cannot be higher than its maximum capacity, yet can be negatively impacted by unexpected cloudy conditions. On March 12th, 2014, wide prediction intervals indicate volatility in solar energy output. This information helps grid operators prepare for unexpected fluctuations in energy output, where point forecasts fail to do so.
We zoom into March 10th and March 12th, 0:00 AM UTC of Figure 3 and show the estimated conditional density in Figure 4. March 10th represents a sunny day and shows a relatively concentrated density with slight negative skewness, whereas March 12th represents a day with fluctuating weather and a wide conditional density. The densities are evaluated at evenly spaced grid points in . Though both methods generate nonsmooth cumulative distribution functions, and thus spiky density functions, our method with ensemble strategy shows a much smoother pattern.
6 Discussion
In this paper, we leverage the success of machine learning classification models such as neural network to build conditional distribution estimates. Our proposed twostage framework transforms the conditional distribution estimation problem into a multiclass classification problem, which allows us to use flexible and robust machine learning tools. To thoroughly understand this framework, we examined the effect of different partitioning methods, classification models, and objective functions. We also established theoretical foundations for our framework to achieve consistent estimation. Our results revealed that a neural network trained with joint binary cross entropy loss can achieve superior performance without the sensitivity to the number of partitions. In both simulations and the solar energy generation dataset, we showed our model can obtain better performance than popular quantile regression forest. With the ability to provide full conditional distribution, our model can give useful insight in solar energy forecasting practices.
This research is focused on harnessing the power of machine learning methods for uncertainty estimation. We believe our research opens up opportunities for bringing in machine learning models into distribution estimation problems.
7 Acknowledgements
The authors’ work was partially supported by King Abdullah University of Science and Technology (grant number 3800.2).
8 Appendix
Following is the proof for the Theorem 1. This proof is related to the proof of histogram consistency as in Theorem 6.11 of Wasserman (2006), with the consideration of the bias and variance from the classification model. Let denote , denote and denote . Without the loss of generality, let’s assume the range of is . Since denotes the number of equally sized, consecutive and nonoverlapping bins, let denote the bin width, and we have . For a given in the support of , , such that . Then we have
Where is between and . is the middle point of interval .
So the bias for the density estimator is following:
Integrate over the interval :
Where .
We then integrate over the range of .
If the model bias , we can achieve as .
For the variance part:
If the model variance , we can achieve as .
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y. and Zheng, X. (2016) Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 265–283. URLhttps://www.usenix.org/system/files/conference/osdi16/osdi16abadi.pdf.
 Cheng et al. (2008) Cheng, J., Wang, Z. and Pollastri, G. (2008) A neural network approach to ordinal regression. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1279–1284. IEEE.
 Diebold et al. (1997) Diebold, F. X., Gunther, T. A. and Tay, A. (1997) Evaluating density forecasts.
 Escobar and West (1995) Escobar, M. D. and West, M. (1995) Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588.
 Fahey et al. (2007) Fahey, M. T., Thane, C. W., Bramwell, G. D. and Coward, W. A. (2007) Conditional gaussian mixture modelling for dietary pattern analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170, 149–166.
 Fan et al. (2009) Fan, J.q., Peng, L., Yao, Q.w. and Zhang, W.y. (2009) Approximating conditional density functions using dimension reduction. Acta Mathematicae Applicatae Sinica, English Series, 25, 445–456.
 Frank and Hall (2001) Frank, E. and Hall, M. (2001) A simple approach to ordinal classification. In European Conference on Machine Learning, 145–156. Springer.
 Friedman (2001) Friedman, J. H. (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
 Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.
 He and Shao (2000) He, X. and Shao, Q.M. (2000) On parameters of increasing dimensions. Journal of Multivariate Analysis, 73, 120–135.
 Holmes et al. (2012) Holmes, M. P., Gray, A. G. and Isbell, C. L. (2012) Fast nonparametric conditional density estimation. arXiv preprint arXiv:1206.5278.
 Hong et al. (2016) Hong, T., Pinson, P., Fan, S., Zareipour, H., Troccoli, A. and Hyndman, R. J. (2016) Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. International Journal of Forecasting, 32, 896–913.
 Hyndman et al. (1996) Hyndman, R. J., Bashtannyk, D. M. and Grunwald, G. K. (1996) Estimating and visualizing conditional densities. Journal of Computational and Graphical Statistics, 5, 315–336.
 Hyndman and Yao (2002) Hyndman, R. J. and Yao, Q. (2002) Nonparametric estimation and symmetry tests for conditional density functions. Journal of Nonparametric Statistics, 14, 259–278.
 Izbicki and Lee (2016) Izbicki, R. and Lee, A. B. (2016) Nonparametric conditional density estimation in a highdimensional regression setting. Journal of Computational and Graphical Statistics, 25, 1297–1316.
 Khosravi et al. (2011) Khosravi, A., Nahavandi, S., Creighton, D. and Atiya, A. F. (2011) Comprehensive review of neural networkbased prediction intervals and new advances. IEEE Transactions on Neural Networks, 22, 1341–1356.
 Koenker and Hallock (2001) Koenker, R. and Hallock, K. F. (2001) Quantile regression. Journal of economic perspectives, 15, 143–156.
 Van der Meer et al. (2018) Van der Meer, D. W., Widén, J. and Munkhammar, J. (2018) Review on probabilistic forecasting of photovoltaic power production and electricity consumption. Renewable and Sustainable Energy Reviews, 81, 1484–1512.
 Meinshausen (2006) Meinshausen, N. (2006) Quantile regression forests. Journal of Machine Learning Research, 7, 983–999.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L. and Lerer, A. (2017) Automatic differentiation in pytorch. In NIPSW.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. et al. (2011) Scikitlearn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
 Rodrigues and Pereira (2018) Rodrigues, F. and Pereira, F. C. (2018) Beyond expectation: Deep joint mean and quantile regression for spatiotemporal problems. arXiv preprint arXiv:1808.08798.
 Rojas et al. (2005) Rojas, A. L., Genovese, C. R., Miller, C. J., Nichol, R. and Wasserman, L. (2005) Conditional density estimation using finite mixture models with an application to astrophysics.
 Rosenblatt (1969) Rosenblatt, M. (1969) Conditional probability density and regression estimators. In P. R. Krishnaiah (Ed.), Multivariate analysis ii, 25–31.
 Schapire et al. (2002) Schapire, R. E., Stone, P., McAllester, D., Littman, M. L. and Csirik, J. A. (2002) Modeling auction price uncertainty using boostingbased conditional density estimation. In ICML, 546–553.
 Shrestha and Solomatine (2006) Shrestha, D. L. and Solomatine, D. P. (2006) Machine learning approaches for estimation of prediction interval for the model output. Neural Networks, 19, 225–235.
 Song et al. (2004) Song, X., Yang, K. and Pavel, M. (2004) Density boosting for gaussian mixtures. In International Conference on Neural Information Processing, 508–515. Springer.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15, 1929–1958.
 Taillardat et al. (2016) Taillardat, M., Mestre, O., Zamo, M. and Naveau, P. (2016) Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Monthly Weather Review, 144, 2375–2393.
 Taylor (2000) Taylor, J. W. (2000) A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19, 299–311.
 Timmermann (2000) Timmermann, A. (2000) Density forecasting in economics and finance. Journal of Forecasting, 19, 231–234.
 Wasserman (2006) Wasserman, L. (2006) All of nonparametric statistics. Springer Science & Business Media.
 Wilson and Bell (2007) Wilson, T. and Bell, M. (2007) Probabilistic regional population forecasts: The example of queensland, australia. Geographical Analysis, 39, 1–25.
 Zhu and Laptev (2017) Zhu, L. and Laptev, N. (2017) Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 103–110. IEEE.