1 Introduction

Due to their flexibility and predictive performance, machine-learning based regression methods have become an important tool for predictive modeling and forecasting. However, most methods focus on estimating the conditional mean or specific quantiles of the target quantity and do not provide the full conditional distribution, which contains uncertainty information that might be crucial for decision making. In this article, we provide a general solution by transforming a conditional distribution estimation problem into a constrained multi-class classification problem, in which tools such as deep neural networks. We propose a novel joint binary cross-entropy loss function to accomplish this goal. We demonstrate its performance in various simulation studies comparing to state-of-the-art competing methods. Additionally, our method shows improved accuracy in a probabilistic solar energy forecasting problem.


Deep Distribution Regression

Rui Li1, Howard D. Bondell1,2 and Brian J. Reich1

1Department of Statistics, North Carolina State University

2School of Mathematics and Statistics, University of Melbourne

Keywords— conditional distribution, deep learning, machine learning, probabilistic forecasting

1 Introduction

In recent years, a variety of machine learning methods, such as random forest, gradient boosting trees and neural networks have gained popularity and been widely adopted. These methods are often flexible enough to uncover complex relationships in high-dimensional data without strong assumptions on the underlying data structure. Off-the-shelf software is available to put these algorithms into production [Pedregosa et al. (2011), Abadi et al. (2016) and Paszke et al. (2017)]. However, in regression and forecasting tasks, many of the machine learning methods only provide a point estimate, without any additional information regarding the uncertainty of the target quantity. Understanding uncertainties are often crucial in fields such as financial markets and risk analysis [Diebold et al. (1997), Timmermann (2000)], population and demographic studies [Wilson and Bell (2007)], transportation and traffic analysis [Zhu and Laptev (2017), Rodrigues and Pereira (2018)] and energy forecasting [Hong et al. (2016)]. In this article, we establish a framework that can directly extend off-the-shelf machine learning algorithms to provide the full conditional distribution of the response given the covariates. Instead of estimating specific quantiles [Koenker and Hallock (2001), Taylor (2000) and Friedman (2001)], or prediction intervals [Khosravi et al. (2011), Shrestha and Solomatine (2006)], we aim to directly estimate the full conditional distribution, as other quantities can be directly extracted from it.

This research is motivated by the necessity of probabilistic prediction in energy forecasting. The Global Energy Forecasting Competition 2014 [Hong et al. (2016)] focused on probabilistic forecasting, due to the high demand of uncertainty estimation in energy forecasting, yet few existing methods are readily available. The solar energy forecasting track in this competition aimed to estimate the full conditional distribution of solar energy generation based on covariates such as solar radiance, temperature, time of the day, etc. This is a crucial task in actual day-to-day operation as solar energy generation is highly weather dependent and thus volatile. To ensure stability in the electrical grid, grid operation not only requires accurate point prediction of solar energy generation, but also its volatility based on weather conditions. Driven by this practical problem, we establish a method to provide comprehensive information regarding the energy generation uncertainties, by estimating the full conditional distribution. Our method shows superior estimation accuracy in our real data analysis compared to competing methods.

There are a number of approaches to estimate the conditional distribution of the target quantity given the input vector . A major class of methods estimates the density functions and through kernel density estimators [Rosenblatt (1969)] to obtain the conditional distribution estimate . This approach is limited by the dimensionality of the input space , due to having to estimate the full joint distribution of . Several methods have then been proposed to address some of the limitations in both locally parametric and non-parametric ways [Hyndman et al. (1996), Hyndman and Yao (2002), Holmes et al. (2012), Fan et al. (2009), Izbicki and Lee (2016)]. Another popular approach is to approximate the distribution of interest by a parametric distribution family or mixture of such distributions, such as a mixture of Gaussians [Escobar and West (1995), Song et al. (2004), Rojas et al. (2005), Fahey et al. (2007)]. This approach faces the challenge of determining the number of mixtures, and the approximation performance will depend on the complexity of the true underlying distribution. Bootstrap-based aggregation provides an alternative method, and a good example is quantile regression forest [Meinshausen (2006)], which obtains the full conditional distribution from the empirical distribution in the aggregated tree leaves. A boosting approach to this problem was discussed in Schapire et al. (2002).

Machine learning approaches have experienced major success in classification problems. We leverage this success to build conditional distribution estimates. In our paper, we propose a two-stage method for conditional density estimation. In the first stage, we partition the response space into bins; in the second stage, the probabilities that the target variable falls into each bin given the input covariates are estimated. In this way, we transform a conditional density estimation problem into a multi-class classification task, where we can take advantage of many off-the-shelf machine learning algorithms. This framework enjoys the model-agnostic property in the second stage, allows for plugging in any suitable multi-class classification method, such as deep learning for example. To further accommodate the fact that the classes are ordered, we use a joint binary cross entropy loss to couple with the neural network model. The design of our neural network also ensures the monotonicity of the estimated cumulative distribution function, which is not guaranteed by other ordinal classification methods as in Frank and Hall (2001) and Cheng et al. (2008). In addition, we explore random partitioning in the first stage followed by the ensemble approach to obtain a smoother and more stable density estimator.

The paper is organized as follows: In Section 2, we describe the model set up. In Section 2.1, we examine approaches and loss functions that can be used in the multi-class classification stage and in Section 2.2 we examine an alternative partition method and model ensemble. In Section 3, we show that given the consistency of the classification model, we can achieve consistency for the density estimator. We evaluate the model performance with simulation studies in Section 4. In the simulation study, we thoroughly examine the effect of number of bins, different binning strategies as well as different loss functions. In Section 5, the method is illustrated using the aforementioned solar energy example where we demonstrate superior performance to quantile random forests. We conclude with the discussion in Section 6.

2 Distribution Estimation by Partitioning

Our approach is based on partitioning the range of the response variable into bins and approximating the conditional density function by a piecewise constant function. We also propose ensemble random partitioning which allows for a smooth density. Formally, assume we are interested in the density of in the range of . This interval is partitioned by cut-points into bins. Let and denotes the th bin, for .

Given the independent variables and the partition, denote the for as the conditional probability that belongs to the th interval of the partition. Assume the density function in the th bin can be approximated by the constant function , for all , then the conditional density function has the form:


where is the size of the interval .

We then estimate with a classification model. Let be the estimator for obtained from the classification model. Plugging into (1) gives the density estimator:


A natural approach to estimate is discussed in Section 2.1.

2.1 Probability Estimation for Each Partitioned Bin

Below we describe two different modeling strategies for the multi-class classification task of estimating .

2.1.1 Multinomial Log-likelihood

A natural approach to obtain the estimates for the conditional probability of a response observation in each bin is to model the conditional distribution as a multinomial distribution. We note that this does not take into account the fact that the bins are ordered. We will discuss how to deal with this fact in the next section. For a given observation , where , we let represent the event , where is the vector of zeros except for in the th element. We then assume that , where is the probability vector describing the conditional probability of belonging to each bin given , and denotes the parameters in the classification model. The log-likelihood function is:


Given an appropriate classification model , our goal is to maximize the log-likelihood function with respect to . In this work, we use deep neural networks as a flexible and robust method. Under this model, corresponds to its biases and weights. The softmax function was used as the activation function for the output layer of the neural network model, where is the output layer vector prior to the softmax transformation. The softmax activation function ensures that the estimator is a valid probability vector, such that and .

Although we choose the neural network model, our framework is model-agnostic, and any method that is appropriate for multi-class classification can be utilized. This approach is very straight-forward in practice as many off-the-shelf machine learning tools have multi-class classification algorithms implemented. However, one potential caveat is that this approach can be sensitive to the number of cut-points . While larger is required to uncover the fine details of the target distribution and reduce bias, it results in increased variance due to smaller number of observations per partitioned bin. Additionally, this approach does not take the order of the bins into consideration. In order to address some of these concerns, we provide a modified method in Section 2.1.2.

2.1.2 Joint Binary Cross Entropy Loss

Instead of utilizing multi-class classification to directly estimate the function , we reframe the problem as binary classification tasks, and binary classifications are evaluated at each of the cut-points. In other words, we will obtain the conditional cumulative distribution function by estimating jointly.

In this approach, we retain the model (i.e. neural networks) for the conditional probability that belongs to each bin given , but specify the loss function in terms of the cumulative distribution function . Because the estimates are positive, we are able to automatically obtain the monotonicity property of . For a given cut-point , the binary cross entropy (BCE) loss is:


Combining the BCEs across all cut-points gives the joint binary cross entropy (JBCE) loss:


Our goal becomes minimizing the JBCE loss with respect to . This approach takes the order of the bins into consideration in contrast to multi-class classification where the relationships among the bins are ignored. In scenarios where the number of cut-points is large, the number of observations per bin can be small and the estimation of can be poor in the multi-class classification setting. Thus this alternative approach can provide an advantage in that it will remain stable even for larger number of cut-points. This will be seen in our simulation study.

The concept is motivated by the ordinal classification approach proposed by Frank and Hall (2001), yet we provide two additional advantages over the original method. First, instead of building independent binary classification models, we model the binary classification events jointly, which greatly reduces the computational cost. Secondly, our model ensures monotonicity of the estimated cumulative distribution function while the original method by Frank and Hall (2001) does not have this guarantee.

2.2 Distribution Estimation with Ensemble Random Partitioning

Approximating the conditional density with a piecewise constant function can suffer from two drawbacks: {enumerate*} [label=(0)]

The results are sensitive to the choice of the cut-points, and

The density or the cumulative distribution functions are not smooth. In order to address the two issues, we propose an ensemble random partitioning method. This method fits independent density estimators, each of which has different positions of the cut-points and the final ensemble estimator is an average over the individual estimators. For the th estimator, we generate for and . Let be the th smallest value in so that , and . Then the th interval of the th partition is . The random partition estimator is then defined as:


where is the estimator of for the th estimator.

The density estimator is an average from random partitioning estimators. This estimator alleviates the necessity of choosing the cut-points locations and can approximate a smooth function as . The trade-off is that the computation time increases linearly in .

3 Density Estimation Consistency

In this section, we study the conditions that are required for consistent density estimation. We denote the density estimator as , and consider the estimator to be consistent if the integrated mean squared error asymptotically.

Theorem 1.

Let be the target conditional density, and be equally sized, consecutive and non-overlapping bins on the true density support . The true probability in bin is . If the target density and classification estimator follow conditions (i) to (iv) , then the density estimator is consistent.

  1. , and .

  2. As , and .

  3. .

  4. .

The proof of Theorem 1 is given in the appendix. This theorem is agnostic to the classification model. Note that we are not aware of results showing that deep learning estimation of obtains properties (iii) and (iv). However, for illustration purposes, here we show as an example that multi-class logistic regression satisfies conditions (iii) and (iv), given a proper choice of class number .

Let denote the bin width and assume we are interested in the target density on a finite support, such that . The input vector has the dimensionality . For the logistic regression model, , where the coefficient has dimensionality and let denote the full parameter with dimensionality . Thus the bias for is:


By the mean value theorem, is a point between and the true value . Based on results from He and Shao (2000), given , we have , thus , and . By condition , the density is finite everywhere, we have and . Thus from , we can obtain the .

The Taylor expansion of is :


He and Shao (2000) showed that is asymptotically normal with , if , where in this case . Similar to the derivation in (7), we can obtain , and thus .

Given , we have that both bias and variance meet conditions (iii) and (iv).

4 Simulation Study

We conduct a simulation study to examine the distribution estimation accuracy of our method comparing to quantile regression forest (QRF), originally proposed by Meinshausen (2006). Quantile regression forest is increasingly being used in the energy and weather forecasting field [Taillardat et al. (2016), Van der Meer et al. (2018)], due to its robustness, flexibility and mature implementation in R (quantregForest) and Python (scikit-garden).

We applied our method to data generated from a variety of distributions. The scenarios we have tested are a linear model with normally distributed errors (Model 1), mixture distributions with nonlinear mean functions (Models 2 and 3) and a skewed distribution with nonlinear mean function (Model 4). The model specification details are:

  1. Model 1: , where

  2. Model 2: , where

  3. Model 3: , where

  4. Model 4: , where

Model performances are evaluated using several proper scoring rules summarized by Gneiting and Raftery (2007). Specifically, the scoring rules that are used here are:

  1. Continuous ranked probability score (CRPS):


    The integral is approximated by summing over evenly spaced grid points between and , and normalized by its range. Lower CRPS score indicates better model performance.

  2. Average quantile loss (AQTL):


    is the estimated quantile function (derived from ) and AQTL is the averaged quantile loss over percentiles. AQTL is also the evaluation metric used in GEFCom2014 [Hong et al. (2016)]. Lower AQTL indicates better model performance.

  3. Empirical coverage of the prediction interval .

In all simulations, the neural network model are constructed using TensorFlow and Keras. It has three hidden layers with 100 neurons each. The exponential linear unit (ELU) activation function is used at each hidden layer and the softmax activation function is used at the output layer. Regularization via dropout is also applied [Srivastava et al. (2014)], with dropout rate at for each hidden layer. The quantile regression forest model is built with scikit-garden, with trees per forest. In addition, we also include the logistic regression as an alternative multi-class classification method to compare with the neural network model. For both the neural network and logistic regression, we evaluated their performance across different number of partitioned bins. We also examine the performance with different objective functions for the neural network model, as described in Section 2.1. The effect of ensemble random partitioning is also evaluated. The ensemble model consisted of independent density estimators as described in Section 2.2.

The simulations are run over independent datasets, each with observations in the training set and observations in the testing set, on which evaluations are compared. The results are summarized in Figure 1.

In all the tested scenarios, using the neural network shows superior performance over logistic regression. Additionally, the neural network with JBCE loss demonstrated superior performance over the quantile regression forest. With the JBCE loss, increasing the number of partitions results in stable improvement of the performance, while with the multinomial log-likelihood as objective function, the performance is sensitive to the number of partitions and requires fine-tuning. Ensemble random partitioning provides slight improvements for the neural network approach. Note that the random partitioning will produce smoothed estimates which would not be achieved with fixed partitioning.

Figure 1: Model Comparison in Simulations

multinomial logistic regression: Logistic regression as the multi-class classification model, with evenly spaced fixed partition. multinomial ensemble NN: Ensemble model consisted of neural network models trained with multinomial log-likelihood as the objective function. multinomial NN: Neural network model trained with multinomial log-likelihood as objective function, with evenly spaced fixed partition. JBCE ensemble NN: Ensemble model consisted of neural network models trained with JBCE loss as objective function. JBCE NN: Neural network model trained with JBCE loss as objective function, with evenly spaced fixed partition. CRPS: Continuous ranked probability score. AQTL: Average quantile loss.

5 Application to the GEFCOM2014 Dataset

The Global Energy Forecasting Competition 2014 was an IEEE sponsored competition that focused on probabilistic forecasting [Hong et al. (2016)] and attracted hundreds of teams and individuals. It had four competition tracks: electricity load forecasting, electricity price forecasting, solar energy forecasting and wind energy forecasting. We applied our method to the solar energy forecasting dataset and compared its performance with quantile regression forest, which was widely used among the top teams.

The dataset has two years of data from three solar farms in Australia (no other spatial information is available). The solar power outputs were collected at hourly frequency, together with 12 weather forecast variables from the European Centre for Medium-range Weather Forecasts (ECMWF), which include surface solar radiance, net solar radiation, surface thermal radiation, temperature, total cloud coverage, precipitation, total column liquid water, total column ice water, relative humidity, surface pressure, 10-meter U wind component and 10-meter V wind component. All weather forecast variables are provided as point forecasts. The solar power output was scaled by the competition organizers to the range of [0,1], so and . The forecasting task is to use these weather forecast variables to predict the conditional distribution of solar power output for holdout months.

We preprocessed the data to include additional variables that are commonly used in solar energy forecasting, such as indicator variables for each solar farm and hour of the day. We also created sine and cosine transformations for the date of the year, as solar energy output shows strong seasonality. We conducted the experiments to compare our method with quantile regression forest in a rolling simulation setting: We used the first year of data as the initial training dataset, and used the immediate next month as the testing set for evaluation. Then we appended the testing dataset to the initial training set to retrain the models, and the model performances were re-evaluated on the next month as the new testing set. This process was repeated for 12 hold out months on which the models were evaluated. The results are included in Figure 2. We compared different model specifications with the quantile regression forest. The y-axis of the first two panels are the percent change in CRPS or AQTL by comparing with the quantile regression forest. The percent change is evaluated for each testing month and averaged across months. Lower scores indicate better performance.

Figure 2: Model Comparison with the Solar Energy Generation Datasets

multinomial ensemble NN: Ensemble model consisted of neural network models trained with multinomial log-likelihood as the objective function. multinomial NN: Neural network model trained with multinomial log-likelihood as objective function, with evenly spaced fixed partition. JBCE ensemble NN: Ensemble model consisted of neural network models trained with JBCE loss as objective function. JBCE NN: Neural network model trained with JBCE loss as objective function, with evenly spaced fixed partition. CRPS: Continuous ranked probability score. AQTL: Average quantile loss. QRF: Quantile regression forest.

Our method with JBCE loss consistently achieves about a reduction in CRPS or AQTL compared to the quantile regression forest. However, with the multinomial log-likelihood objective function, the performance is sensitive to the number of cut points, and in this dataset, a smaller number of cut points is preferred when using multinomial log-likelihood objective function. This result is consistent with our simulations that models trained with JBCE loss are less sensitive to the number of cut points, and thus require less tuning.

In Figure 3, we show an example of the distribution pattern of solar energy output revealed by our model. We only focus on the time period during the day, since the solar energy at night is always zero and does not require forecasting. For days such as March 10th, 2014, the weather is clear, and the prediction intervals are narrow, and thus the predicted energy output has little variability. The predicted distribution on days forecasted to be sunny is often negatively skewed, as energy output cannot be higher than its maximum capacity, yet can be negatively impacted by unexpected cloudy conditions. On March 12th, 2014, wide prediction intervals indicate volatility in solar energy output. This information helps grid operators prepare for unexpected fluctuations in energy output, where point forecasts fail to do so.

We zoom into March 10th and March 12th, 0:00 AM UTC of Figure 3 and show the estimated conditional density in Figure 4. March 10th represents a sunny day and shows a relatively concentrated density with slight negative skewness, whereas March 12th represents a day with fluctuating weather and a wide conditional density. The densities are evaluated at evenly spaced grid points in . Though both methods generate non-smooth cumulative distribution functions, and thus spiky density functions, our method with ensemble strategy shows a much smoother pattern.

Figure 3: Solar Power Prediction with Intervals

The solar power prediction for solar farm 1 in the datasets are plotted for the first five days of March, 2014. The prediction method is using the ensemble model consisted of neural network models trained with the JBCE loss.

Figure 4: Solar Power Conditional Density

JBCE ensemble NN: Ensemble model consisted of neural network models trained with JBCE loss as objective function. The densities are evaluated at evenly spaced grid points in . For a given grid point, the density is derived from the estimated cumulative distribution function , where is the grid width.

6 Discussion

In this paper, we leverage the success of machine learning classification models such as neural network to build conditional distribution estimates. Our proposed two-stage framework transforms the conditional distribution estimation problem into a multi-class classification problem, which allows us to use flexible and robust machine learning tools. To thoroughly understand this framework, we examined the effect of different partitioning methods, classification models, and objective functions. We also established theoretical foundations for our framework to achieve consistent estimation. Our results revealed that a neural network trained with joint binary cross entropy loss can achieve superior performance without the sensitivity to the number of partitions. In both simulations and the solar energy generation dataset, we showed our model can obtain better performance than popular quantile regression forest. With the ability to provide full conditional distribution, our model can give useful insight in solar energy forecasting practices.

This research is focused on harnessing the power of machine learning methods for uncertainty estimation. We believe our research opens up opportunities for bringing in machine learning models into distribution estimation problems.

7 Acknowledgements

The authors’ work was partially supported by King Abdullah University of Science and Technology (grant number 3800.2).

8 Appendix

Following is the proof for the Theorem 1. This proof is related to the proof of histogram consistency as in Theorem 6.11 of Wasserman (2006), with the consideration of the bias and variance from the classification model. Let denote , denote and denote . Without the loss of generality, let’s assume the range of is . Since denotes the number of equally sized, consecutive and non-overlapping bins, let denote the bin width, and we have . For a given in the support of , , such that . Then we have

Where is between and . is the middle point of interval .

So the bias for the density estimator is following:

Integrate over the interval :

Where .

We then integrate over the range of .

If the model bias , we can achieve as .

For the variance part:

If the model variance , we can achieve as .


  • Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y. and Zheng, X. (2016) Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 265–283. URLhttps://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.
  • Cheng et al. (2008) Cheng, J., Wang, Z. and Pollastri, G. (2008) A neural network approach to ordinal regression. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1279–1284. IEEE.
  • Diebold et al. (1997) Diebold, F. X., Gunther, T. A. and Tay, A. (1997) Evaluating density forecasts.
  • Escobar and West (1995) Escobar, M. D. and West, M. (1995) Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588.
  • Fahey et al. (2007) Fahey, M. T., Thane, C. W., Bramwell, G. D. and Coward, W. A. (2007) Conditional gaussian mixture modelling for dietary pattern analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170, 149–166.
  • Fan et al. (2009) Fan, J.-q., Peng, L., Yao, Q.-w. and Zhang, W.-y. (2009) Approximating conditional density functions using dimension reduction. Acta Mathematicae Applicatae Sinica, English Series, 25, 445–456.
  • Frank and Hall (2001) Frank, E. and Hall, M. (2001) A simple approach to ordinal classification. In European Conference on Machine Learning, 145–156. Springer.
  • Friedman (2001) Friedman, J. H. (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
  • Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.
  • He and Shao (2000) He, X. and Shao, Q.-M. (2000) On parameters of increasing dimensions. Journal of Multivariate Analysis, 73, 120–135.
  • Holmes et al. (2012) Holmes, M. P., Gray, A. G. and Isbell, C. L. (2012) Fast nonparametric conditional density estimation. arXiv preprint arXiv:1206.5278.
  • Hong et al. (2016) Hong, T., Pinson, P., Fan, S., Zareipour, H., Troccoli, A. and Hyndman, R. J. (2016) Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. International Journal of Forecasting, 32, 896–913.
  • Hyndman et al. (1996) Hyndman, R. J., Bashtannyk, D. M. and Grunwald, G. K. (1996) Estimating and visualizing conditional densities. Journal of Computational and Graphical Statistics, 5, 315–336.
  • Hyndman and Yao (2002) Hyndman, R. J. and Yao, Q. (2002) Nonparametric estimation and symmetry tests for conditional density functions. Journal of Nonparametric Statistics, 14, 259–278.
  • Izbicki and Lee (2016) Izbicki, R. and Lee, A. B. (2016) Nonparametric conditional density estimation in a high-dimensional regression setting. Journal of Computational and Graphical Statistics, 25, 1297–1316.
  • Khosravi et al. (2011) Khosravi, A., Nahavandi, S., Creighton, D. and Atiya, A. F. (2011) Comprehensive review of neural network-based prediction intervals and new advances. IEEE Transactions on Neural Networks, 22, 1341–1356.
  • Koenker and Hallock (2001) Koenker, R. and Hallock, K. F. (2001) Quantile regression. Journal of economic perspectives, 15, 143–156.
  • Van der Meer et al. (2018) Van der Meer, D. W., Widén, J. and Munkhammar, J. (2018) Review on probabilistic forecasting of photovoltaic power production and electricity consumption. Renewable and Sustainable Energy Reviews, 81, 1484–1512.
  • Meinshausen (2006) Meinshausen, N. (2006) Quantile regression forests. Journal of Machine Learning Research, 7, 983–999.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L. and Lerer, A. (2017) Automatic differentiation in pytorch. In NIPS-W.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. et al. (2011) Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
  • Rodrigues and Pereira (2018) Rodrigues, F. and Pereira, F. C. (2018) Beyond expectation: Deep joint mean and quantile regression for spatio-temporal problems. arXiv preprint arXiv:1808.08798.
  • Rojas et al. (2005) Rojas, A. L., Genovese, C. R., Miller, C. J., Nichol, R. and Wasserman, L. (2005) Conditional density estimation using finite mixture models with an application to astrophysics.
  • Rosenblatt (1969) Rosenblatt, M. (1969) Conditional probability density and regression estimators. In P. R. Krishnaiah (Ed.), Multivariate analysis ii, 25–31.
  • Schapire et al. (2002) Schapire, R. E., Stone, P., McAllester, D., Littman, M. L. and Csirik, J. A. (2002) Modeling auction price uncertainty using boosting-based conditional density estimation. In ICML, 546–553.
  • Shrestha and Solomatine (2006) Shrestha, D. L. and Solomatine, D. P. (2006) Machine learning approaches for estimation of prediction interval for the model output. Neural Networks, 19, 225–235.
  • Song et al. (2004) Song, X., Yang, K. and Pavel, M. (2004) Density boosting for gaussian mixtures. In International Conference on Neural Information Processing, 508–515. Springer.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15, 1929–1958.
  • Taillardat et al. (2016) Taillardat, M., Mestre, O., Zamo, M. and Naveau, P. (2016) Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Monthly Weather Review, 144, 2375–2393.
  • Taylor (2000) Taylor, J. W. (2000) A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19, 299–311.
  • Timmermann (2000) Timmermann, A. (2000) Density forecasting in economics and finance. Journal of Forecasting, 19, 231–234.
  • Wasserman (2006) Wasserman, L. (2006) All of nonparametric statistics. Springer Science & Business Media.
  • Wilson and Bell (2007) Wilson, T. and Bell, M. (2007) Probabilistic regional population forecasts: The example of queensland, australia. Geographical Analysis, 39, 1–25.
  • Zhu and Laptev (2017) Zhu, L. and Laptev, N. (2017) Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 103–110. IEEE.
Comments 10
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description