Conditional Density Estimation Tools in Python and R
with Applications to Photometric Redshifts and LikelihoodFree Cosmological Inference
Abstract
It is well known in astronomy that propagating nonGaussian prediction uncertainty in photometric redshift estimates is key to reducing bias in downstream cosmological analyses. Similarly, likelihoodfree inference approaches, which are beginning to emerge as a tool for cosmological analysis, require the full uncertainty landscape of the parameters of interest given observed data. However, most machine learning (ML) based methods with opensource software target point prediction or classification, and hence fall short in quantifying uncertainty in complex regression and parameter inference settings such as the applications mentioned above. As an alternative to methods that focus on predicting the response (or parameters) from features , we provide nonparametric conditional density estimation (CDE) tools for approximating and validating the entire probability density given training data for and . This density approach offers a more nuanced accounting of uncertainty in situations with, e.g., nonstandard error distributions and multimodal or heteroskedastic response variables that are often present in astronomical data sets. As there is no onesizefitsall CDE method, and the ultimate choice of model depends on the application and the training sample size, the goal of this work is to provide a comprehensive range of statistical tools and opensource software for nonparametric CDE and method assessment which can accommodate different types of settings — involving, e.g., mixedtype input from multiple sources, functional data, and image covariates — and which in addition can easily be fit to the problem at hand. Specifically, we introduce CDE software packages in Python and R based on four ML prediction methods adapted and optimized for CDE: NNKCDE, RFCDE, FlexCode, and DeepCDE. Furthermore, we present the cdetools package, which includes functions for computing a CDE loss function for model selection and tuning of parameters, together with diagnostic functions for computing posterior quantiles and coverage probabilities. We provide sample code in Python and R as well as examples of applications to photometric redshift estimation and likelihoodfree cosmology via CDE.
1 Introduction
Machine learning (ML) has seen a surge in popularity in almost all fields of astronomy that involve massive amounts of complex data (Ball & Brunner, 2010; Way et al., 2012; Ivezić et al., 2014; Ntampaka et al., 2019). Most ML methods target prediction and classification, whose primary goal is to return a point estimate of an unknown response variable (e.g., a supernova’s type, or a galaxy’s mass or age) given observed features or covariates (e.g., a supernova’s light curve, or a galaxy’s spectrum, respectively), often falling short in quantifying nontrivial uncertainty in (e.g., degeneracies in the mapping from to ). Neglecting to propagate these uncertainties through complex analyses may lead to imprecise or even inaccurate inferences of physical parameters.
Consider the following two examples of problems where uncertainty quantification can be impactful:

Photometric redshift estimation. In photometric redshift (photo) estimation, one attempts to constrain a galaxy’s Doppler shift or redshift () after observing the shifted spectrum using a handful of broadband filters, and sometimes additional variables such as morphology and environment. In a prediction setting, the response could be the galaxy’s redshift but could also include other galaxy properties () such as the galaxy’s mass or age; the covariates would represent the collection of directly observable inputs such as photometric magnitudes and colors used to predict or more generically a multivariate response . The use of photo posterior estimates — that is, estimates of the probability density functions (PDFs) for individual galaxies — is crucial for cosmological inference from photometric galaxy surveys, as of currently cataloged galaxies are observed solely via photometry, a percentage that will only grow in the coming decade as the Large Synoptic Survey Telescope (LSST) begins gathering data (Ivezić et al., 2019). Though the benefits of using posteriors over have yet to be fully exploited (Viironen et al., 2018), it is thoroughly established that downstream cosmological analysis can be much improved by properly propagating photo estimate uncertainties via probability density functions (PDFs) rather than simple point predictions of (Mandelbaum et al., 2008; Wittman, 2009; Sheldon et al., 2012; Carrasco Kind & Brunner, 2013; Graff et al., 2014; Schmidt et al., 2019).

Forwardmodeled observables in cosmology. Some cosmological probes, such as the type Ia supernova (SN Ia) distanceredshift relationship, have observables that are straightforward to simulate in spite of an intractable likelihood. LikelihoodFree Inference (LFI) methods allow for parameter inference in settings where the likelihood function, which relates the observable data to the parameters of interest , is too complex to work with directly, but one is able to simulate from a stochastic forward model at fixed parameter settings . The most common approach to LFI or simulationbased parameter inference is Approximate Bayesian Computation (ABC), whose many variations (see Beaumont et al. 2002 and Sisson et al. 2018 for a review) repeatedly draw from the simulator and compare the output with observed data in real time to arrive at a set of plausible parameter values consistent with . With computationally intensive simulations, however, a classical ABC approach may not be practical. Furthermore, there is also the additional challenge in ABC of how to tell which ABC estimate is closest to the “true posterior,” and similarly how to choose “approximately sufficient” summary statistics in the analysis, if the underlying distribution of data and parameters is unknown.
From a statistical perspective, the right tool for quantifying the uncertainty about once is observed is the conditional density . In a prediction context such as for photo problems, conditional density estimation (CDE) of the probability density functions for the redshift of individual galaxies given photometric data provides a more nuanced accounting of uncertainty than a point estimate or prediction interval, especially in the presence of heteroskedastic or multimodal responses. Similarly, CDE and LFI/ABC can be seamlessly combined into a trainingbased approach that has gained recent attention in machine learning and astronomy with the growing popularity of Deep Learning; for example, neural density estimation in LFI via mixture density networks and masked autoregressive flows (Papamakarios & Murray, 2016; Lueckmann et al., 2017, 2019; Papamakarios et al., 2019; Greenberg et al., 2019; Alsing et al., 2018, 2019) and LFI via CDE using Gaussian copulas (Li et al., 2017; Chen & Gutmann, 2019) and random forests (Marin et al., 2016). More concretely, in LFI settings with a cosmological forward model, CDE can be used to approximate the posterior probability distribution of cosmological parameters given observed data ; see Section 4.2 for an example. Though some recent work has performed LFI via CDE using Gaussian copulas (Li et al., 2017; Chen & Gutmann, 2019) and random forests (Marin et al., 2016), we follow Izbicki et al. (2014, 2019) to combine ABC and CDE to estimate the uncertainty about the unknown parameters — with our notation the “response” — given observed data by directly applying CDE trainingbased techniques (Section 2) and loss functions (Section 3.1) to simulated data .
Data in astronomy present a challenge to estimating conditional densities, due to both the complexity of the underlying physical processes and the complicated observational noise. Precision cosmology, for example, requires combining data from different scientific probes, producing complicated likelihood functions in a highdimensional parameter space. Recent examples of such multipleprobe studies include Krause et al. (2017), Joudaki et al. (2017), Aghanim et al. (2018), van Uitert et al. (2018), and Abbott et al. (2019). These measurements are all subject to different forms of systematic uncertainties, already difficult to quantify, complicating the problem of constructing probability models that jointly describe these disparate data and their interrelationships. In such situations, CDE methods that target a variety of settings and nonstandard data (images, correlation functions, mixed data types) become especially valuable. However, for any given data type, there is no onesizefitsall CDE method. For example, deep neural networks often perform well in settings with large amounts of representative training data but in applications with smaller training samples one may need a different tool. There is also additional value in models that are interpretable and easy to fit to the data at hand.
The goal of this paper is to provide statistical tools and opensource software for nonparametric CDE and method assessment appropriate for challenging data in a variety of inference scenarios. To our knowledge, there are no other publicly available implementations of nontrivial machine learning algorithms for uncertainty quantification via CDE that apply to the same range of data settings and that use the same syntax as existing ML code for prediction. More specifically, in our paper and associated Github repositories we provide documentation and software in Python and R for four flexible machinelearning methods for CDE (optimized with respect to the CDE loss in Section 2.2):^{1}^{1}1All of these methods, except for DeepCDE, have occurred in previously published or archived papers. Hence, we only briefly review the methods in this paper and instead focus on software usage, algorithmic aspects, and the settings under which each method can be applied.

NNKCDE —a simple and easily interpretable nearestneighbor kernel CDE method (Section 2.1); it computes a kernel density estimate of using the nearest neighbors of the evaluation point . The model has only two tuning parameters which we choose by minimizing the CDE loss on validation data.

RFCDE —a randomforest CDE method (Section 2.2) that partitions the covariate space so as to minimize the CDE loss; like NNKCDE, it computes a kernel density estimate of but with nearest neighbor weightings defined by the location of the evaluation point relative to the leaves in the random forest. RFCDE is well suited for sparse and mixedtype features and does not require the user to specify distances or similarities between data points. Additionally we provide fRFCDE, a random forest implementation that can accommodate functional features by partitioning in the continuous domain of such features.

FlexCode —a CDE method that uses a basis expansion for the univariate response and poses CDE as a series of univariate regression problems (Section 2.3). The main advantage of this method is its flexibility as any regression method can be applied towards CDE, enabling us to tailor our choice of regression method to the intrinsic structure and type of data at hand. Here we focus specifically on FlexCode, based on XGBoost (Chen & Guestrin, 2016), due to its good performance in, e.g., photo estimation (Schmidt et al., 2019) and scalability to large data sets; we refer to this version of FlexCode as FlexZBoost.

DeepCDE —a fully nonparametric neuralnetworkbased CDE method (Section 2.4). DeepCDE combines the advantages of basis expansions for the univariate response with the flexibility of neural network architectures for, e.g., image covariates and timeseries data by adding a linear output layer of basis expansion coefficients that are computed by minimizing the CDE loss.
Our software makes uncertainty quantification straightforward for users of standard opensource machine learning Python packages; NNKCDE, RFCDE, FlexCode share the sklearn (Pedregosa et al., 2011) API (with fit and predict methods), which also makes our methods compatible with sklearn wrapper functions for, e.g., crossvalidation and model ensembling. DeepCDE has implementations for both Tensorflow (Abadi et al., 2015) and Pytorch (Paszke et al., 2017), two of the most widely used deep learning frameworks. Finally, FlexCode is essentially a plugin method where the user can add a regression function of choice.
Performing method assessments for estimated probability density is challenging as indeed we never observe the true conditional probability density, merely samples from the true probability density (observations). Whereas loss functions such as the rootmeansquared error (RMSE) are typically used in regression problems, they are not appropriate for the task of uncertainty quantification of estimated probability densities. In addition to the CDE methods above, we provide cdetools for method assessment and model selection. The package cdetools provides functions for CDE loss functions (defined by Equation 4 in Section 3.1) for parameter tuning and relative comparison of probability density estimates; cdetools also includes diagnostic tools (Section 3.2) such as Highest Posterior Density and Probability Integral Transform values for checking how well the final density estimates fit the data in terms of coverage probabilities.
As mentioned, our goal is not to compare the general performance of the CDE methods above. Indeed, each CDE approach has particular usage for different settings of response dimensionality, feature types, and computational requirements. Table 1 lists some properties of each method.
The organization of the paper is as follows: In Section 2, we introduce tools for conditional density estimation (NNKCDE, RFCDE/fRFCDE, FlexCode, DeepCDE). In Section 3, we describe tools for model selection and diagnostics. Then, in Section 4, we illustrate our CDE and method assessment tools in three different applications: photo estimation, likelihoodfree cosmological inference and spec estimation. Python and R code usage examples can be found in Appendix A.
2 Tools for Conditional Density Estimation
We start by reviewing the conditional density estimators. In all cases, we leverage or modify powerful MLbased regression or classification methods so as to implement CDE. Tuning parameters are chosen by minimizing the CDE empirical loss in Equation 5 via either crossvalidation or the use of validation data.
2.1 Nnkcde
NearestNeighbors Kernel CDE (NNKCDE; Izbicki et al. 2017, Freeman et al. 2017) consists of a kernel density estimate of using only the points closest to the evaluation point in covariate space:
(1) 
where is a normalized kernel (e.g., a Gaussian function) with bandwidth , is a distance metric, and is the index of the nearest neighbor of . It is a smoother version of the histogram estimator proposed by Cunha et al. (2009), because it approximates the density with a smooth continuous function rather than by binning.
We provide the NNKCDE^{2}^{2}2https://github.com/tpospisi/nnkcde software in both Python and R (Pospisil & Dalmasso, 2019a), with examples in Section A.1. NNKCDE can accommodate any number of training examples. However, selecting tuning parameters scales quadratically in , the number of nearest neighbors, which prohibits using . Our implementation (Izbicki et al. 2019, Appendix D) is more computationally efficient than standard nearestneighbor kernel smoothers. For instance, we are able to efficiently evaluate the loss function in Equation 5 on large validation samples by expressing the first integral in terms of convolutions of the kernel function; these have a fasttocompute closed solution for a Gaussian kernel. We also choose the two tuning parameters and by crossvalidation (CV) rather than by hand.
2.2 Rfcde
Introduced by Pospisil & Lee (2018), RFCDE is an extension of random forests (Breiman, 2001) to conditional density estimation and multivariate responses. RFCDE inherits the advantages of random forests in that it can handle mixedtyped data. It has good performance while remaining relatively interpretable. For ease of use in the statistics and astronomy communities, we provide RFCDE^{3}^{3}3https://github.com/tpospisi/RFCDE in both Python and R, which call a common C++ implementation of the core training functions that can easily be wrapped in other languages (Pospisil & Dalmasso, 2019b).
The main departure from the standard random forest algorithm is our criterion for covariate space partitioning decisions. In regression contexts, the splitting variable and split point are typically chosen so as to minimize the meansquared error loss. In classification contexts, the splits are typically chosen so as to minimize a classification error. Existing random forest density estimation methods such as quantile regression forests by Meinshausen (2006) and the TPZ algorithm by Carrasco Kind & Brunner (2013) use the same tree structure as regression and classification random forests, respectively. RFCDE, however, builds trees that minimize the CDE loss (see Equation 5), allowing the forest to adapt to structure in the conditional density as opposed to adapting to structure in the conditional mean, overcoming some of the limitations of the usual regression approach for data with heteroskedasticity and multimodality; see Pospisil & Lee (2018) for some examples and comparisons.
Another unique feature of RFCDE is that it can handle multivariate responses with joint densities by applying a weighted kernel smoother to . This added feature enables analysis of complex conditional distributions that describe relationships between multiple responses and covariates, or equivalently between multiple parameters and observables in an LFI setting. Like quantile regression forests, the RFCDE algorithm takes advantage of the fact that random forests can be viewed as a form of adaptive nearestneighbor method with the aggregated tree structures determining a weighting scheme. This weighting scheme can then be used to estimate the conditional density , as well as the conditional mean and quantiles, as in quantile regression forests (but for CDEoptimized trees). RFCDE computes the latter density by a weighted kernel density estimate (KDE) in using training points near the evaluation point . These distances are effectively defined by how often a training point belongs to the same leaf node as in the forest (see Equation 1 in Pospisil & Lee 2018 for details).
Despite the increased complexity of our CDE trees, RFCDE still scales to large data sets. While we use kernel density estimates for predictions on new observations, we do not use KDE when evaluating splits, as the calculations of the loss would then require evaluating pairwise distances between all training points. For fast computations, we instead take advantage of orthogonal series to compute density estimates for the partitioning of the covariate space. The orthogonal series estimator for the CDE loss (Equation 5) reduces to a formula that is much faster to evaluate for each split than KDE. For multivariate responses, RFCDE extends the same partitioning procedure by using a tensor basis for where . Similarly, RFCDE extends the density estimates on new to the multivariate case through the use of multivariate kernel density estimators (Epanechnikov, 1969). In both the univariate and multivariate cases, bandwidth selection can be handled by either plugin estimators or by tuning using a density estimation loss.
Remark: Estimating a CDE loss is an inherently harder task than calculating the mean squared error (MSE) in regression. As a consequence, RFCDE might not provide meaningful tree splits in settings with a large number of noisy covariates. In such settings, one may benefit from combining a regular random forest tree structure (optimized for regression) with a weighted kernel density estimator (for the density calculation). See Section 4.3 for an application to functional data along with software implementation.^{4}^{4}4Available at https://github.com/Mr8ND/cdetools_applications/spec_z/
2.2.1 fRFCDE
RFCDE also has a functional variant — fRFCDE (Pospisil & Lee, 2019) — which can efficiently handle functional inputs that vary over a continuum. The spectral energy distribution (SED) of a galaxy, which plots energy versus continuous wavelength of light, is an example of functional data. Another example of functional data is the shear correlation function of weak lensing, which measures the mean product of the shear at two points as a function of a continuous distance between those points. Treating functional covariates (like spectra, correlation functions or images) as unordered multivariate vectors on a grid suffers from a curse of dimensionality. As the resolution of the grid becomes finer the dimensionality of the data increases but little additional information is added, due to high correlation between nearby grid points. fRFCDE adapts to this setting by partitioning the domain of each functional covariate (or curve) into intervals, and passing the mean values of the function in each interval as input to RFCDE. Feature selection is then effectively done over regions of the domain rather than over single points. More specifically, the partitioning in fRFCDE is governed by the parameter of a Poisson process, with each functional covariate entering as a highdimensional vector . Starting with the first element of the vector, we group the first elements together. We then repeat the procedure sequentially until we have assigned all elements into a group; this effectively partitions the function domain into disjoint intervals . The function mean values of each interval are finally treated as new inputs to a standard (vectorial) RFCDE tree. The splitting is done independently for each tree in the forest. Other steps of fRFCDE, such as the computation of variable importance, also proceed as in (vectorial) RFCDE but with the averaged values of a region as inputs. As a result, fRFCDE has the capability of identifying the functional inputs and the regions in the input domain that are important for estimating the response .
The fRFCDE method is as scalable as standard random forests, accommodating training sets on the order of . As the examples in Section 4.3 show, we can obtain substantial gains with a functional approach, both in terms of statistical performance (that is, CDE loss) and computational time. In addition, the change in code syntax is minimal, as one only needs to pass the parameter as the flambda argument during the forest initialization. Examples in Python and R are provided in Appendix A.2.1.
2.3 FlexCode
Introduced by Izbicki & Lee (2017), FlexCode^{5}^{5}5R implementation: https://github.com/rizbicki/FlexCoDE, Python implementation:https://github.com/tpospisi/FlexCode reframes CDE as a regression using an orthogonal series representation of the conditional density. As an example, may represent an orthonormal basis for (e.g., squareintegrable) functions of , like a Fourier or wavelet basis. The key idea of FlexCode is to express the unknown density as a basis expansion:
(2) 
By the orthogonality property of the basis, the (unknown) expansion coefficients are then just conditional means. More precisely, . Thus, we can estimate these coefficients using a training set of data by regressing the transformed response variables on predictors for every basis function , where the number of basis functions is chosen by minimizing a CDE loss function on validation data. The estimated density, , may contain small spurious bumps induced by the Fourier approximation and may not integrate to one. We remove such artifacts as described in Izbicki & Lee (2016) by applying a thresholding parameter chosen via crossvalidation. FlexCode turns a challenging density estimation problem into a simpler regression problem, where we can choose any regression method that fits the problem at hand.
To provide a concrete example, for highdimensional (such as galaxy spectra) we can use methods such as sparse or spectral series (“manifold”) regression methods (Tibshirani, 1996; Ravikumar et al., 2009; Lee & Izbicki, 2016); see Section 4.3 for an example with FlexCodeSeries. For multiprobe studies with mixed data types, we can use random forests regression (Breiman, 2001). On the other hand, largescale photometric galaxy surveys such as LSST require methods that can work with data from millions, if not billions, of galaxies. Schmidt et al. (2019) present the results of an initial study of the LSST Dark Energy Science Collaboration (LSSTDESC). Their initial data challenge (“Photo DC 1”) compares the CDEs of a dozen photo codes run on simulations of LSST galaxy photometry catalogues in the presence of complete, correct, and representative training data. FlexZBoost — a version of FlexCode based on the scalable gradient boosting regression technique by Chen & Guestrin (2016) — was entered into the data challenge because of the method’s ability to scale to massive data. In the DC1 analysis, FlexZBoost was among the the strongest performing codes according to established performance metrics of such PDFs and was the only code to beat the experimental control under a more discriminating metric, the CDE loss.
For massive surveys such as LSST, FlexCode also has another advantage compared to other CDE methods, namely its compact, lossless storage format. Juric et al. (2017) establishes that LSST has allocated 100 floating point numbers to quantify the redshift of each galaxy. As is shown in Schmidt et al. (2019), the myriad methods for deriving photo PDFs yield radically different results, motivating a desire to store the results of more than one algorithm in the absence of an obvious best choice. For the photo PDFs of most codes, one may need to seek a clever storage parameterization to meet LSST’s constraints (Carrasco Kind & Brunner, 2014; Malz et al., 2018), but FlexCode is virtually immune to this restriction. Since FlexCode relies on a basis expansion, one only needs to store coefficients per target density for a lossless compression of the estimated PDF with no need for binning. Indeed, for DC1, we can with FlexZBoost reconstruct our estimate at any resolution from estimates of the first 35 coefficients in a Fourier basis expansion. In other words, FlexZBoost enables the creation and storage of highresolution photo catalogs for several billion galaxies at no added cost.
Our public implementation of FlexCode— available in both Python (Pospisil et al., 2019) and R (Izbicki & Pospisil, 2019), respectively — crossvalidates over regression tuning parameters (such as the number of nearest neighbors in FlexCodekNN) and computes the FlexCode coefficients in parallel for further time savings. The computational complexity of FlexCode will be the same as parallel individual regressions. In particular, the scalability of FlexCode is determined by the underlying regression method. A scalable method like XGBoost leads to scalable FlexCode fitting. In the Python version of the code, the user can choose between the following regression methods: XGBoost for FlexZBoost but also nearest neighbors, lasso (Tibshirani, 1996) and random forests regression (Breiman, 2001). In the R version, the following choices are available: XGBoost, nearest neighbors, lasso, random forests, NadarayaWatson kernel smoothing (Nadaraya, 1964; Watson, 1964), sparse additive models (Ravikumar et al., 2009), and spectral series (“manifold”) regression (Lee & Izbicki, 2016). In both implementations, the user may also use a custom regression method; we illustrate how this can be done with a vignette in both packages. For the Python implementation, the user can add any custom regression method following the sklearn API, i.e., with fit and predict methods. Examples in both languages are presented in Appendix A.3.
2.4 DeepCDE
Recently, neural networks have reemerged as a powerful tool for prediction settings where large amounts of representative training data are available; see LeCun et al. (2015) and Goodfellow et al. (2016) for a full review. Neural networks for CDE, such as Mixture Density Networks (MDNs; Bishop 1994) and variational methods (Tang & Salakhutdinov, 2013; Sohn et al., 2015), usually assume a Gaussian or some other parametric form of the conditional density. More recently, MDNs have also been used for photometric redshift estimation (D’Isanto & Polsterer, 2018; Pasquet et al., 2019) and for direct estimation of likelihoods and posteriors in cosmological parameter inference (see Alsing et al. 2019 and references within).
Our DeepCDE^{6}^{6}6https://github.com/tpospisi/DeepCDE (Dalmasso & Pospisil, 2019) takes a different, fully nonparametric approach and is based on the orthogonal series representation in FlexCode, given in Equation 2. However, rather than relying on regression methods to estimate the expansion coefficients in Equation 2, DeepCDE computes the coefficients jointly with a neural network that minimizes the CDE loss in Equation 4. Indeed, one can show that for an orthogonal basis, the problem of minimizing this CDE loss is (asymptotically) equivalent to finding the best basis coefficients in FlexCode under mean squared error loss for the individual regressions; see Appendix C for a proof. The value of this result is that DeepCDE with a CDE loss directly connects prediction with uncertainty quantification, implying that one can leverage the stateoftheart deep architectures for an application at hand toward uncertainty quantification for the same prediction setting.
From a neural network architecture perspective, DeepCDE only adds a linear output layer of coefficients for a series expansion of the density according to
(3) 
where is an orthogonal basis for functions of . Like FlexCode, we normalize and remove spurious bumps from the final density estimates according to the procedure in Section 2.2 of Izbicki & Lee (2016).
The greatest benefit of DeepCDE is that it can be implemented with both convolutional and recurrent neural network architectures, extending to both image and sequential data. For most deep architectures, adding a linear layer represents a small modification, and a negligible increase in the number of parameters. For instance, with the AlexNet architecture (Krizhevsky et al., 2012), one of the smallest convolutional networks, adding a final layer with 30 coefficients for a cosine basis only adds extra parameters. This represents a increase over the million already existing parameters, and hence a neglible increase in training and prediction time. Moreover, the CDE loss for DeepCDE is especially easy to evaluate; see Appendix C for details.
We provide both TensorFlow and Pytorch implementations of DeepCDE (Dalmasso & Pospisil, 2019). We also include examples that shows how one can easily build DeepCDE on top of AlexNet (Krizhevsky et al., 2012); in this case, for the task of estimating the probability distribution of the correct orientation of a color image . Note that Fischer et al. (2015) use AlexNet for the corresponding regression task of predicting the orientation for a color image (without uncertainty quantification).
3 Tools for Method Assessment
After fitting CDEs, it is important to assess the quality of our models of uncertainty. For instance, after computing photo PDF estimates for some galaxies, one may ask whether these estimates accurately quantify the true uncertainty distributions . Similarly, in the LFI task, a key question is whether an estimate of the posterior distribution, of the cosmological parameters is close enough to the true posterior given the observations .
We present two methodassessment tools suitable to different situations, which are complementary and can be performed simultaneously, and provide public implementations in the cdetools^{7}^{7}7https://github.com/tpospisi/cdetools package in both Python and R (Dalmasso et al., 2019). First, we describe a CDE loss function that directly provides relative comparisons between conditional density estimators, such as the methods presented in Section 2 or, equivalently, between a set of models (for the same method) with different tuning parameters. Second, we describe visual diagnostic tools, such as Probability Integral Transforms (PIT) and Highest Probability Density (HPD) plots, that can provide insights on the absolute fit quality of a given estimator to observed data.
3.1 CDE loss
Here we briefly review the CDE loss from Izbicki & Lee (2016) for assessing conditional density estimators and discuss it in the context of the cosmology LFI case.
The goal of a loss function is to provide relative comparisons between different estimators, so that it is easy to directly choose the best fitted model among a list of candidates. Given an estimate of , we define the CDE loss by
(4) 
where is the marginal distribution of the covariates . Our CDE loss is similar to the integrated squared error (ISE) loss for density estimation except weighted by the marginal distribution of the covariates. The weighting by the marginal emphasizes that errors in the estimation of for unlikely covariates are less important. The CDE loss can be estimated (up to a constant) by
(5) 
where represents our validation data, i.e., a heldout set not used to construct .
In our implementation, the function cde_loss returns the estimated CDE loss as well as an estimate of the standard deviation or the standard error (SE) of the empirical loss.
3.1.1 CDE loss for LFI
Observational cosmology presents a challenge for the loss function’s marginalization over all possible data ; as we have only one instantiation of the universe, the LFI setting requires a modified loss function for the posterior distribution of the cosmological parameters at a single instance of data . The CDE loss we would ideally aim to minimize is the integrated squared error (ISE) loss
(6) 
for the conditional density at , which is difficult to estimate with only one observation . Hence, Izbicki et al. (2019) approximate the above loss with a surrogate loss
(7) 
where is the marginal distribution of restricted to an neighborhood of . As , the surrogate loss approaches the ISE for the true posterior. The choice of serves a biasvariance tradeoff, as smaller results in less bias in the estimate of the loss, while restricting the neighborhood to fewer validation data points at the same time leads to higher variance.
In an LFI setting, we typically create training and validation data for our estimators by first simulating pairs using a stochastic forward model, then applying an ABC algorithm that rejects all data outside an neighborhood, and finally splitting the ABC sample into training and validation sets. As an ABC sample includes cosmological parameters that are likely to have generated the observation , we effectively estimate the surrogate loss by just computing the empirical CDE loss in Equation 5 at the observed data with distance cutoff . In Section 4.2, we compute the loss on forwardsimulated cosmological data for some different ABC cutoffs but with translated to acceptance rates ( or no ABC cutoff is equivalent to accepting all data in the simulated ABC sample, and the smaller is, the lower the proportion of sample points that fall within the specified neighborhood).
3.2 PIT and HPD values
The CDE loss function is a relative measure of performance that cannot address absolute goodnessoffit. To quantify overall goodnessoffit, we examine how well calibrated an ensemble of conditional density estimators are on average.
Given a true probability density of a variable conditioned on data , an estimated probability density cannot be wellcalibrated unless . Built on the same logic, the probability integral transform (PIT; Polsterer et al. 2016)
(8) 
assesses the calibration quality of an ensemble of CDEs for scalar representing the cumulative distribution function (CDF) of evaluated at . A statistically selfconsistent population of densities has a uniform distribution of PIT values, and deviations from uniformity indicate inaccuracies of the estimated PDFs. Overly broad CDEs manifest as underrepresentation of the lowest and highest PIT values, whereas overly narrow CDEs manifest as overrepresentation of the lowest and highest values.
However, PIT values do not easily generalize to multiple responses. For instance, for a bivariate response , the quantity is not in general uniformly distributed (Genest & Rivest, 2001). The highest probability density (HPD; Izbicki et al. 2017, Appendix A)
(9) 
can be seen as a PIT equivalent for multivariate , and is easy to compute in practice. The HPD value is based on the definition of the highest density region (HDR) of a random variable ; that is, the subset of the sample space of where all points in the region have a probability above a certain value. The HDR of can be seen as an interval estimate of y when is observed. The set is the smallest HDR containing the point , and the HPD value in Equation 9 is the probability or “confidence level” of such a region (Hyndman, 1996). Similar to PIT values, HPD values for validation data follow a distribution if the CDEs are well calibrated on the population level. In our implementation the functions pit_coverage(cde_test, y_grid, y_test) and hpd_coverage(cde_test, y_grid, y_test), respectively, calculate PIT and HPD values for CDE estimates cde_test using the grid y_grid with observed values y_test.
The PIT and HPD are not without their limitations, however, as demonstrated in the control case of Schmidt et al. (2019) and Figure 1 of Section 4.1 here. Because the PIT and HPD values can be uniformly distributed even if is not well estimated, they must be used in conjunction with loss functions for method assessment. A popular way of visualizing PIT and HPD diagnostics is through coverage plots, sometimes referred to as probabilityprobability plots or PP plots of the empirical of the test data as a function of the theoretical corresponding to the PIT or HPD.
Figure 1 (c)(d) contains example coverage plots for the PIT and HPD values of photo estimators, where the empirical (observed) coverage is close to the theoretical (ideal) coverage.
4 Examples
We demonstrate^{8}^{8}8Code for these examples is publicly available at https://github.com/Mr8ND/cdetools_applications. the breadth of our CDE methods in three different astronomical use cases.

Photo estimation: univariate response with multivariate input. This is the standard prediction setting in which all four methods apply. In Section 4.1, we apply the methods to the Teddy photometric redshift data (Beck et al., 2017), and illustrate the need for loss functions to properly assess PDF estimates of redshift given photometric colors .

Likelihoodfree cosmological inference: multivariate response. For multiple response components we want to model the often complicated dependencies between these components; this is in contrast to approaches which model each component separately, implicitly introducing an assumption of conditional independence. In Section 4.2, we use an example of LFI for simulated weak lensing shear data to show how NNKCDE and RFCDE can capture more challenging bivariate distributions with curved structures; in this toy example represents cosmological parameters in the CDMmodel, and represents (coarsely binned) weak lensing shear correlation functions.

Spec estimation: functional input. Standard prediction methods, such as random forests, do not typically fare well with functional covariates, simply treating them as unordered vectorial data and ignoring the functional structure. However, often there are substantial benefits to explicitly taking advantage of that structure as in fRFCDE. In Section 4.3, we compare the performance of a vectorial implementation of RFCDE with fRFCDE and FlexCodeSeries for a spectroscopic sample from the Sloan Digital Sky Survey (SDSS; Alam et al. 2015), where the input is a highresolution spectrum of a galaxy, and the response the galaxy’s redshift .
4.1 Photo estimation: Univariate response with multivariate input
Here we estimate photo posterior PDFs using representative training and test data (Samples A and B, respectively) from the Teddy catalog by Beck et al. (2017)^{9}^{9}9Data available at https://github.com/COINtoolbox/photoz_catalogues, comprised of 74309 training and 74557 test observations. Each observation has five features (covariates): the magnitude of the band and the pairwise differences (colors) , , , and . For the purposes of this paper, we do not examine the problem of disparity in colorspace coverage, which is a topic by itself.
We fit NNKCDE, RFCDE, DeepCDE, and FlexZBoost to the data. In addition we fit an RFCDE model, “RFCDELimited,” restricted to the first three of the five covariates, as a toy model which fails to extract some information from the covariates, allowing us to showcase the difference between coverage diagnostics and the CDE loss function. For DeepCDE we use a threelayer deep neural network with linear layers and reLu activations with neurons per layers, trained for epochs with Adam (Kingma & Ba, 2014). As our goal is to showcase its applicability we do not optimize the neural architecture (e.g., number of layers, number of neurons per layer, activation functions) nor the learning parameters (i.e., number of epochs, learning rate, momentum). To illustrate our validation methods, we also include the marginal distribution as an estimate of individual photo distributions . This estimate will be the same regardless of .
Figure 2 presents the estimated CDE loss of Equation 5 with estimated standard error (SE), as well as the storage space for each galaxy’s photo CDE. Figure 1 shows that the different models, including the clearly misspecified “Marginal” model, achieve comparable performance on goodnessoffit diagnostics. However, Figure 2 shows the discriminatory power of the CDE loss function, which distinguishes the methods from one another. This emphasizes the need for method comparison through loss functions in addition to goodnessoffit diagnostics. We note that the ranking in CDE loss also correlates roughly with the quality of the point estimates, as shown in Appendix B.
4.2 Likelihoodfree cosmological inference: Multivariate response
To showcase the ability to target joint distributions, we apply NNKCDE and RFCDE to the problem of estimating multivariate posteriors of the cosmological parameters in a likelihoodfree setting via ABCCDE (Izbicki et al., 2019). We also use this example to illustrate the usage of the CDE surrogate loss.
ABC is a method for parameter inference in settings where the likelihood is not tractable but we have a forward model that can simulate data under fixed parameter settings . The simplest form of ABC is the ABC rejection sampling algorithm, where a set of parameters is first drawn from a prior distribution. The simulated data is accepted with tolerance if for some distance metric (e.g., the Euclidean distance) that measures the discrepancy between the simulated data and observed data . The outcome of the ABC rejection algorithm for small enough is a sample of parameter values approximately distributed according to the desired posterior distribution .
The basic idea of ABCCDE is to improve the ABC estimate — and hence reduce the number of required simulations — by using the ABC sample/output as input to a CDE method tuned with the CDE loss in Equation 3.1.1. Hence, in ABCCDE, our CDE method can be seen as a postadjustment method: it returns an estimate which we evaluate at the point to obtain a more accurate approximation of .
In this example, we consider the problem of cosmological parameter inference via cosmic shear, caused by weak gravitational lensing inducing a distortion in images of distant galaxies. The size and direction of the distortion is directly related to the size and shape of the matter distribution along the line of sight, which varies across the universe. We use shear correlation functions to constrain the dark matter density and matter power spectrum normalization parameters of the CDM cosmological model, which predicts the properties and evolution of the large scale structure of the matter distribution. For further background see Hoekstra & Jain (2008), Munshi et al. (2008) and Mandelbaum (2017). Here we use the GalSim^{10}^{10}10https://github.com/GalSimdevelopers/GalSim toolkit (Rowe et al., 2015) to generate simplified galaxy shears distributed according to a Gaussian random field determined by . The binned shear correlation functions serve as our input data or summary statistics . For the inference, we assume uniform priors and and fix , , .
The top row of Figure 3 shows the estimated bivariate posterior distribution of from ABC rejection sampling only at varying acceptance rates (20%, 50%, and 100%). An acceptance rate of 100% just returns the (uniform) ABC prior distribution, whereas the ABC posteriors for smaller acceptance rates (that is, smaller values of ) concentrate around the parameter degeneracy curve (shown as a dashed line) on which the data are indistinguishable. The second and third rows show the estimated posteriors when we, respectively, improve the initial ABC estimate by applying NNKCDE and RFCDE tuned with our CDE surrogate loss. A result that is apparent from the figure is that NNKCDE and RFCDE fitted with a CDE loss are able to capture the degeneracy curve at a larger acceptance rate, that is, for a smaller number of simulations, than when using ABC only.
CDE Loss (with SE) ()  

Method \Acceptance Rate  20%  50%  100% 
ABC  0.686 (0.009)  0.392 (0.004)  0.227 (0.001) 
NNKCDE  1.652 (0.022)  1.199 (0.016)  0.844 (0.010) 
RFCDE  4.129 (0.063)  3.698 (0.064)  2.817 (0.055) 
We also have a direct measure of performance via the CDE loss and can adapt to different types of data by leveraging different CDE codes. Table 2 shows how the methods compare in terms of the surrogate loss. While decreasing the acceptance rate benefits all methods, it is clear that CDEbased approaches have better performance in all cases.
4.3 Spec estimation: Functional input
In this example, we compare CDE methods in the context of spectroscopic redshift determination using 2812 highresolution spectra from the Sloan Digital Sky Survey (SDSS) Data Release 6 (preprocessed with the cuts of Richards et al. 2009), corresponding to covariates of flux measurements at wavelengths. The highresolution spectra can be seen as functional inputs on a continuum of wavelength values. Spectroscopic redshifts (or redshifts predicted from spectra ) tend to be both accurate and precise, so the density is wellapproximated by a delta function at the true redshift. For the purposes of illustrating the use of the CDE codes, we define a noisified redshift , where are independent and identically distributed variables drawn from a normal distribution and is the true redshift of galaxy provided by SDSS. Thus the conditional density of this example is a Gaussian distribution with mean and variance 0.02.
We compare fRFCDE, a “Functional” adaptation of RFCDE, with a standard “Vector” implementation of RFCDE, which treats functional data as a vector. For completeness, we also compare against standard regression random forest combined with KDE, as well as FlexCodeSpec, an extension of FlexCode with a Spectral Series regression (Richards et al., 2009; Freeman et al., 2009; Lee & Izbicki, 2016) for determining the expansion coefficients. We train on 2000 galaxies and test on the remaining galaxies.The RFCDE and the fRFCDE trees are both trained with , , and bandwidths chosen by plugin estimators. We use for the fRFCDE rate parameter of the Poisson process that defines the variable groupings.
Method  Train Time (in sec)  CDE Loss (with SE) 

Functional RFCDE  24.89  3.38 (0.155) 
Vector RFCDE  41.60  2.52 (0.104) 
Regression RF + KDE  50.63  3.42 (0.120) 
FlexCodeSpec  4017.93  3.53 (0.136) 
Table 3 contains the CDE loss and train time for both the vectorbased and functionalbased RFCDE models on the SDSS data, as well as for FlexCodeSpec. We obtain substantial gains when incorporating functional features both in terms of CDE loss and computational time. The computational gains are attributed to requiring fewer searches for each split point as the default value of is reduced. As anticipated, vectorbased RFCDE underperforms with functional data due to the tree splits on CDE loss struggling to pick up signals in the data and returning almost the same conditional density regardless of the input. In contrast, splitting on mean squared error is an easier task and we include the results of regular regression random forest and KDE, which are comparable to the results of fRFCDE. Whereas random forests rely on variable selection (either individual variables as in vectorial RFCDE or grouped together as in fRFCDE) for dimension reduction, FlexCodeSpec is based on the Spectral Series mechanism for dimension reduction that finds (potentially) nonlinear, sparse structure in the data distribution (Izbicki & Lee, 2016). Both types of dimension reduction are reasonable for spec estimation: Particular wavelength locations or regions in the galaxy SED could carry information of the galaxy’s true redshift; RFCDE and fRFCDE are effective in finding such locations by variable selection. Spectral Series on the other hand are able to recover lowdimensional manifold structure in the entire data ensemble; as Richards et al. 2009 show, the main direction of variation in SDSS spectra is directly related to the spectroscopic redshift.
5 Conclusions
This paper presents statistical tools and software for uncertainty quantification in complex regression and parameter inference tasks. Given a set of features with associated response , our methods extend the usual point prediction task of classification and regression to nonparametric estimation of the entire (conditional) probability density . The described CDE methods are meant to handle a range of different data settings, and we have provided examples of code usage in the contexts of photo estimation, likelihoodfree inference for cosmology analysis, and specz estimation. In addition, we have provided tools that can be used for CDE method assessment and for choosing tuning parameters of the CDE methods in a principled way.
Our software is composed of four packages for CDE, NNKCDE, RFCDE, FlexCode and DeepCDE, each using a different machine learning algorithm, and one for model assessment, cdetools. All packages are implemented in both Python and R, with no dependence on proprietary software, making them compatible with any operating system supporting either language (e.g., Windows, MacOS, Ubuntu and Unixbased OS). The code is provided in publicly available Github repositories, documented using Python and R function documentation standards, and equipped with Travis CI^{11}^{11}11https://docs.travisci.com/ automatic bugtracking. Our software makes uncertainty quantification straightforward for those used to standard opensource machine learning Python packages. NNKCDE, RFCDE, FlexCode share the sklearn API (with fit and predict methods), which also makes our methods compatible with sklearn wrapper functions (for e.g., crossvalidation and model ensembling). DeepCDE has implementations for both Tensorflow and Pytorch, two of the most widely used deep learning frameworks, and can easily be combined with most existing network architectures.
Acknowledgements
ND and ABL are partially supported by the National Science Foundation under Grant Nos. DMS1521786. RI is supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (grant number 306943/20174) and Fundação de Amparo à Pesquisa do Estado de São Paulo (grant number 2017/033638). AIM acknowledges support from the Max Planck Society and the Alexander von Humboldt Foundation in the framework of the Max PlanckHumboldt Research Award endowed by the Federal Ministry of Education and Research. During the completion of this work, AIM was advised by David W. Hogg and supported in part by National Science Foundation grant AST1517237.
References
 Abadi et al. (2015) Abadi M., et al., 2015, TensorFlow: LargeScale Machine Learning on Heterogeneous Systems, http://tensorflow.org/
 Abbott et al. (2019) Abbott T. M. C., et al., 2019, Phys. Rev. Lett., 122, 171301
 Aghanim et al. (2018) Aghanim N., et al., 2018, arXiv preprint arXiv:1807.06209
 Alam et al. (2015) Alam S., et al., 2015, The Astrophysical Journal Supplement Series, 219, 12
 Alsing et al. (2018) Alsing J., Wandelt B., Feeney S., 2018, Monthly Notices of the Royal Astronomical Society, 477, 2874
 Alsing et al. (2019) Alsing J., Charnock T., Feeney S., Wandelt B., 2019, arXiv preprint arXiv:1903.00007
 Ball & Brunner (2010) Ball N. M., Brunner R. J., 2010, International Journal of Modern Physics D, 19, 1049
 Beaumont et al. (2002) Beaumont M. A., Zhang W., Balding D. J., 2002, Genetics, 162, 2025
 Beck et al. (2017) Beck R., et al., 2017, Monthly Notices of the Royal Astronomical Society, 468, 4323
 Bishop (1994) Bishop C. M., 1994, Technical report, Mixture density networks. Citeseer
 Breiman (2001) Breiman L., 2001, Machine learning, 45, 5
 Carrasco Kind & Brunner (2013) Carrasco Kind M., Brunner R. J., 2013, Monthly Notices of the Royal Astronomical Society, 432, 1483
 Carrasco Kind & Brunner (2014) Carrasco Kind M., Brunner R. J., 2014, Mon Not R Astron Soc, 441, 3550
 Chen & Guestrin (2016) Chen T., Guestrin C., 2016, in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. pp 785–794, doi:10.1145/2939672.2939785
 Chen & Gutmann (2019) Chen Y., Gutmann M. U., 2019, in Chaudhuri K., Sugiyama M., eds, Proceedings of Machine Learning Research Vol. 89, Proceedings of Machine Learning Research. PMLR, pp 1584–1592, http://proceedings.mlr.press/v89/chen19d.html
 Cunha et al. (2009) Cunha C. E., Lima M., Oyaizu H., Frieman J., Lin H., 2009, Monthly Notices of the Royal Astronomical Society, 396, 2379
 D’Isanto & Polsterer (2018) D’Isanto A., Polsterer K., 2018, Astronomy & Astrophysics, 609, A111
 Dahlen et al. (2013) Dahlen T., et al., 2013, The Astrophysical Journal, 775, 93
 Dalmasso & Pospisil (2019) Dalmasso N., Pospisil T., 2019, tpospisi/DeepCDE 0.1, doi:10.5281/zenodo.3364862, https://doi.org/10.5281/zenodo.3364862
 Dalmasso et al. (2019) Dalmasso N., Pospisil T., Izbicki R., 2019, tpospisi/cdetools 0.0.1, doi:10.5281/zenodo.3364810, https://doi.org/10.5281/zenodo.3364810
 Epanechnikov (1969) Epanechnikov V. A., 1969, Theory of Probability & Its Applications, 14, 153
 Fischer et al. (2015) Fischer P., Dosovitskiy A., Brox T., 2015, in GCPR. , doi:10.1007/9783319249476_30
 Freeman et al. (2009) Freeman P. E., Newman J. A., Lee A. B., Richards J. W., Schafer C. M., 2009, Monthly Notices of the Royal Astronomical Society, 398, 2012? 2021
 Freeman et al. (2017) Freeman P. E., Izbicki R., Lee A. B., 2017, Monthly Notices of the Royal Astronomical Society, 468, 4556
 Genest & Rivest (2001) Genest C., Rivest L.P., 2001, Statistics & Probability Letters, 53, 391
 Goodfellow et al. (2016) Goodfellow I., Bengio Y., Courville A., 2016, Deep learning. MIT press
 Graff et al. (2014) Graff P., Feroz F., Hobson M. P., Lasenby A., 2014, Monthly Notices of the Royal Astronomical Society, 441, 1741
 Greenberg et al. (2019) Greenberg D. S., Nonnenmacher M., Macke J. H., 2019, in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA. pp 2404–2414, %****␣CDE_tools.tex␣Line␣1150␣****http://proceedings.mlr.press/v97/greenberg19a.html
 Hoekstra & Jain (2008) Hoekstra H., Jain B., 2008, Annual Review of Nuclear and Particle Science, 58, 99
 Hyndman (1996) Hyndman R. J., 1996, The American Statistician, 50, 120
 Ivezić et al. (2014) Ivezić Ž., Connolly A. J., VanderPlas J. T., Gray A., 2014, Statistics, data mining, and machine learning in astronomy: a practical Python guide for the analysis of survey data. Vol. 1, Princeton University Press
 Ivezić et al. (2019) Ivezić Ž., et al., 2019, The Astrophysics Journal, 873, 111
 Izbicki & Lee (2016) Izbicki R., Lee A. B., 2016, Journal of Computational and Graphical Statistics, 25, 1297
 Izbicki & Lee (2017) Izbicki R., Lee A. B., 2017, Electronic Journal of Statistics, 11, 2800
 Izbicki & Pospisil (2019) Izbicki R., Pospisil T., 2019, rizbicki/FlexCode v5.9beta.3, doi:10.5281/zenodo.3366065, https://doi.org/10.5281/zenodo.3366065
 Izbicki et al. (2014) Izbicki R., Lee A., Schafer C., 2014, in Artificial Intelligence and Statistics. pp 420–429
 Izbicki et al. (2017) Izbicki R., Lee A. B., Freeman P. E., 2017, The Annals of Applied Statistics, 11, 698
 Izbicki et al. (2019) Izbicki R., Lee A. B., Pospisil T., 2019, Journal of Computational and Graphical Statistics, pp 1–20
 Joudaki et al. (2017) Joudaki S., et al., 2017, Monthly Notices of the Royal Astronomical Society, 474, 4894
 Juric et al. (2017) Juric M., et al., 2017, Data Products Definition Document, https://docushare.lsstcorp.org/docushare/dsweb/Get/LSE163/
 Kingma & Ba (2014) Kingma D. P., Ba J., 2014, CoRR, abs/1412.6980
 Krause et al. (2017) Krause E., et al., 2017, arXiv preprint arXiv:1706.09359
 Krizhevsky et al. (2012) Krizhevsky A., Sutskever I., Hinton G. E., 2012, in Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1. NIPS’12. Curran Associates Inc., USA, pp 1097–1105, http://dl.acm.org/citation.cfm?id=2999134.2999257
 LeCun et al. (2015) LeCun Y., Bengio Y., Hinton G., 2015, nature, 521, 436
 Lee & Izbicki (2016) Lee A. B., Izbicki R., 2016, Electronic Journal of Statistics, 10, 423
 Li et al. (2017) Li J., Nott D., Fan Y., Sisson S., 2017, Computational Statistics & Data Analysis, 106, 77
 Lueckmann et al. (2017) Lueckmann J.M., Gonçalves P. J., Bassetto G., Öcal K., Nonnenmacher M., Macke J. H., 2017, in Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Curran Associates Inc., USA, pp 1289–1299, %****␣CDE_tools.tex␣Line␣1250␣****http://dl.acm.org/citation.cfm?id=3294771.3294894
 Lueckmann et al. (2019) Lueckmann J.M., Bassetto G., Karaletsos T., Macke J. H., 2019, in Symposium on Advances in Approximate Bayesian Inference. pp 32–53
 Malz et al. (2018) Malz A. I., Marshall P. J., DeRose J., Graham M. L., Schmidt S. J., Wechsler R., Collaboration) L. D. E. S., 2018, AJ, 156, 35
 Mandelbaum (2017) Mandelbaum R., 2017, arXiv preprint:1710.03235
 Mandelbaum et al. (2008) Mandelbaum R., et al., 2008, Monthly Notices of the Royal Astronomical Society, 386, 781/806
 Marin et al. (2016) Marin J.M., Raynal L., Pudlo P., Ribatet M., Robert C., 2016, Bioinformatics (Oxford, England), 35
 Meinshausen (2006) Meinshausen N., 2006, Journal of Machine Learning Research, 7, 983
 Munshi et al. (2008) Munshi D., Valageas P., Van Waerbeke L., Heavens A., 2008, Physics Reports, 462, 67
 Nadaraya (1964) Nadaraya E. A., 1964, Theory of Probability & Its Applications, 9, 141
 Ntampaka et al. (2019) Ntampaka M., et al., 2019, arXiv preprint arXiv:1902.10159
 Papamakarios & Murray (2016) Papamakarios G., Murray I., 2016, in Lee D. D., Sugiyama M., Luxburg U. V., Guyon I., Garnett R., eds, Advances in Neural Information Processing Systems 29. Curran Associates, Inc., pp 1028–1036
 Papamakarios et al. (2019) Papamakarios G., Sterratt D., Murray I., 2019, in The 22nd International Conference on Artificial Intelligence and Statistics. pp 837–848
 Pasquet et al. (2019) Pasquet J., Bertin E., Treyer M., Arnouts S., Fouchez D., 2019, Astronomy & Astrophysics, 621, A26
 Paszke et al. (2017) Paszke A., et al., 2017
 Pedregosa et al. (2011) Pedregosa F., et al., 2011, J. Mach. Learn. Res., 12, 2825
 Polsterer et al. (2016) Polsterer K. L., D’Isanto A., Gieseke F., 2016, Uncertain Photometric Redshifts
 Pospisil & Dalmasso (2019a) Pospisil T., Dalmasso N., 2019a, tpospisi/NNKCDE 0.3, doi:10.5281/zenodo.3364858, https://doi.org/10.5281/zenodo.3364858
 Pospisil & Dalmasso (2019b) Pospisil T., Dalmasso N., 2019b, tpospisi/RFCDE 0.3.2, doi:10.5281/zenodo.3364856, https://doi.org/10.5281/zenodo.3364856
 Pospisil & Lee (2018) Pospisil T., Lee A. B., 2018, arXiv preprint arXiv:1804.05753
 Pospisil & Lee (2019) Pospisil T., Lee A. B., 2019, (f)RFCDE: Random Forests for Conditional Density Estimation and Functional Data (arXiv:1906.07177)
 Pospisil et al. (2019) Pospisil T., Dalmasso N., Inacio M., 2019, tpospisi/FlexCode 0.1.5, doi:10.5281/zenodo.3364860, https://doi.org/10.5281/zenodo.3364860
 Ravikumar et al. (2009) Ravikumar P., Lafferty J., Liu H., Wasserman L., 2009, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 1009
 Richards et al. (2009) Richards J. W., Freeman P. E., Lee A. B., Schafer C. M., 2009, Astrophysical Journal, 691, 32
 Rowe et al. (2015) Rowe B., et al., 2015, Astronomy and Computing, 10, 121
 Schmidt et al. (2019) Schmidt et al. i. p., 2019, Under internal review by LSSTDESC
 Sheldon et al. (2012) Sheldon E. S., Cunha C. E., Mandelbaum R., Brinkmann J., Weaver B. A., 2012, The Astrophysical Journal Supplement Series, 201, 32
 Sisson et al. (2018) Sisson S. A., Fan Y., Beaumont M., 2018, Handbook of approximate Bayesian computation. Chapman and Hall/CRC, doi:10.1201/9781315117195
 Sohn et al. (2015) Sohn K., Lee H., Yan X., 2015, in Cortes C., Lawrence N. D., Lee D. D., Sugiyama M., Garnett R., eds, , Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp 3483–3491, http://dl.acm.org/citation.cfm?id=2969442.2969628
 Tang & Salakhutdinov (2013) Tang Y., Salakhutdinov R. R., 2013, in Burges C. J. C., Bottou L., Welling M., Ghahramani Z., Weinberger K. Q., eds, , Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp 530–538, http://dl.acm.org/citation.cfm?id=2999611.2999671
 Tibshirani (1996) Tibshirani R., 1996, Journal of the Royal Statistical Society: Series B (Methodological), 58, 267
 Viironen et al. (2018) Viironen K., et al., 2018, A&A, 614, A129
 Watson (1964) Watson G. S., 1964, Sankhyā: The Indian Journal of Statistics, Series A, pp 359–372
 Way et al. (2012) Way M. J., Scargle J. D., Ali K. M., Srivastava A. N., 2012, Advances in machine learning and data mining for astronomy. Chapman and Hall/CRC
 Wittman (2009) Wittman D., 2009, The Astrophysical Journal Letters, 700, L174
 van Uitert et al. (2018) van Uitert E., et al., 2018, Monthly Notices of the Royal Astronomical Society, 476, 4662
Appendix A Examples of code usage
In this section we provide examples of code usage in Python and R. To save space, we have included the DeepCDE examples directly in the code release^{12}^{12}12https://github.com/tpospisi/DeepCDE/tree/master/model_examples (Dalmasso & Pospisil, 2019) rather than in this section.
a.1 Nnkcde
{Verbatim}[commandchars=
{},codes=]
import numpy as np
import nnkcde
# Fit the model model = nnkcde.NNKCDE() model.fit(x_train, y_train)
# Tune parameters: bandwidth, number of neighbors k # Pick best loss on validation data k_choices = [5, 10, 100, 1000] bandwith_choices = [0.01, 0.05, 0.1] model.tune(x_validation, y_validation, k_choices, bandwidth_choices)
# Predict new densities on grid y_grid = np.linspace(y_min, y_max, n_grid) cde_test = model.predict(x_test, y_grid)
{Verbatim}[commandchars=
{},codes=]
library(NNKCDE)
# Fit the model model < NNKCDE::NNKCDE$new(x_train, y_train)
# Tune parameters: bandwidth, number of neighbors k # Pick best loss on validation data k_choices < c(5, 10, 100, 1000) bandwith_choices < c(0.01, 0.05, 0.1) model$tune(x_validation, y_validation, k_grid = k_choices, h_grid = bandwidth_choices)
# Predict new densities on grid y_grid < seq(y_min, y_max, length.out = n_grid) cde_test < model$predict(x_test, y_grid)
a.2 Rfcde
{Verbatim}[commandchars=
{},codes=]
import numpy as np
import rfcde
# Parameters n_trees = 1000 # Number of trees in the forest mtry = 4 # Number of variables to potentially split at in each node node_size = 20 # Smallest node size n_basis = 15 # Number of basis functions bandwidth = 0.2 # Kernel bandwith  used for prediction only
# Fit the model forest = rfcde.RFCDE(n_trees=n_trees, mtry=mtry, node_size=node_size, n_basis=n_basis) forest.train(x_train, y_train)
# Predict new densities on grid y_grid = np.linspace(0, 1, n_grid) cde_test = forest.predict(x_test, y_grid, bandwidth)
# Predict conditional means for CDEoptimized forest cond_mean_test = forest.predict_mean(x_test)
# Predict quantiles (i.e., quantile regression) for CDEoptimized forest alpha_quantile = .90 quant_test = forest.predict_quantile(x_test, alpha_quantile)
{Verbatim}[commandchars=
{},codes=]
library(RFCDE)
# Parameters n_trees < 1000 # Number of trees in the forest mtry < 4 # Number of variables to potentially split at in each node node_size < 20 # Smallest node size n_basis < 15 # Number of basis functions bandwidth < 0.2 # Kernel bandwith  used for prediction only
# Fit the model forest < RFCDE::RFCDE(x_train, y_train, n_trees = n_trees, mtry = mtry, node_size = node_size, n_basis = n_basis)
# Predict new densities on grid y_grid < seq(0, 1, length.out = n_grid) cde_test < predict(forest, x_test, y_grid, response = ’CDE’, bandwidth = bandwidth)
# Predict conditional means for CDEoptimized forest cond_mean_test < predict(forest, x_test, y_grid, response = ’mean’, bandwidth = bandwidth)
# Predict quantiles (i.e., quantile regression) for CDEoptimized forest alpha_quantile < .90 quant_test < predict(forest, x_test, y_grid, response = ’quantile’, quantile = alpha_quantile, bandwidth = bandwidth)
a.2.1 fRFCDE
{Verbatim}[commandchars=
{},codes=]
import numpy as np
import rfcde
# Parameters n_trees = 1000 # Number of trees in the forest mtry = 4 # Number of variables to potentially split at in each node node_size = 20 # Smallest node size n_basis = 15 # Number of basis functions bandwidth = 0.2 # Kernel bandwith  used for prediction only lambda_param = 10 # Poisson Process parameter
# Fit the model functional_forest = rfcde.RFCDE(n_trees=n_trees, mtry=mtry, node_size=node_size, n_basis=n_basis) functional_forest.train(x_train, y_train, flamba=lambda_param)
# … Same as RFCDE for prediction …
{Verbatim}[commandchars=
{},codes=]
library(RFCDE)
# Parameters n_trees < 1000 # Number of trees in the forest mtry < 4 # Number of variables to potentially split at in each node node_size < 20 # Smallest node size n_basis < 15 # Number of basis functions bandwidth < 0.2 # Kernel bandwith  used for prediction only lambda_param < 10 # Poisson Process parameter
# Fit the model functional_forest < RFCDE::RFCDE(x_train, y_train, n_trees = n_trees, mtry = mtry, node_size = node_size, n_basis = n_basis, flambda = lambda_param)
# … Same as RFCDE for prediction …
a.3 FlexCode
{Verbatim}[commandchars=
{},codes=]
import numpy as np
import flexcode
# Select regression method from flexcode.regression_models import NN
# Parameters basis_system = "cosine" # Basis system max_basis = 31 # Maximum number of basis. If the model is not tuned, # max_basis is set as number of basis
# Regression Parameters # If a list is passed for any parameter automatic 5fold CV is used to # determine the best parameter combination. params = {"k": [5, 10, 15, 20]} # A dictionary with methodspecific regression parameters.
# Parameterize model model = flexcode.FlexCodeModel(NN, max_basis, basis_system, regression_params=params)
# Fit model  this will also choose the optimal number of neighbors ‘k‘ model.fit(x_train, y_train)
# Tune model  Select the best number of basis model.tune(x_validation, y_validation)
# Predict new densities on grid cde_test, y_grid = model.predict(x_test, n_grid=n_grid)
{Verbatim}[commandchars=
{},codes=]
library(FlexCoDE)
# Parameters basis_system < "Cosine" # Basis system max_basis < 31 # Maximum number of basis.
# Fit and tune FlexCode via KNN regression using 4 cores fit < FlexCoDE::fitFlexCoDE(x_train, y_train, x_validation, y_validation, nIMax = max_basis, regressionFunction.extra = list(nCores = 4), regressionFunction = FlexCoDE::regressionFunction.NN, system = basis_system)
# Predict new densities on a grid n_points_grid < 500 # Number of points in the grid cde_test < predict(fit, x_test, B = n_points_grid)
## Shortcut for FlexZBoost
# Fit and tune FlexZBoost fit < FlexCoDE::FlexZBoost(x_train, y_train, x_validation, y_validation, nIMax = max_basis, system = basis_system)
# Predict new densities on a grid n_points_grid < 500 # Number of points in the grid cde_test < predict(fit, x_test, B = n_points_grid)
a.4 cdetools
{Verbatim}[commandchars=
{},codes=]
import cdetools
cde_test # numpy matrix of conditional density evaluations on a grid #  each row is an observation, each column is a grid point y_grid # the grid at which cde_test is evaluated y_true # the observed y values
# Calculate the cde_loss cde_loss(cde_test, y_grid, y_test)
# Calculate coverage values cdf_coverage(cde_test, y_grid, y_test) # PITvalues hpd_coverage(cde_test, y_grid, y_test) # HPDvalues
{Verbatim}[commandchars=
{},codes=]
library(cde_tools)
cde_test # matrix of conditional density evaluations on a grid #  each row is an observation, each column is a grid point y_grid # the grid at which cde_test is evaluated y_true # the observed y values
# Calculate the cde_loss cde_loss(cde_test, y_grid, y_test)
# Calculate coverage values cdf_coverage(cde_test, y_grid, y_test) # PITvalues hpd_coverage(cde_test, y_grid, y_test) # HPDvalues
Appendix B Point Estimate PhotoZ Redshifts (Univariate Response with Multivariate Data)
Plots of point estimates can sometimes be a useful qualitative diagnostic of photo code performance in that they may quickly identify a bad model. For example, the marginal model would always lead to the same point estimate and hence a horizontal line in the plot. Figure 4 shows point photo predictions versus the true redshift for the photo codes in Example 4.1.
The second moment of the distribution and outlier fraction (OLF) are also commonly used diagnostic tools for photo estimators. Table 4 lists the , and OLF point estimate metrics from Dahlen et al. (2013). These metrics are defined as follows:
(B1)  
(B2)  
(B3) 
where indicates the spectroscopic redshift and is the difference between the predicted and spectroscopic redshift. According to the point estimate metrics, RFCDE and DeepCDE perform the best on the Teddy data, followed by FlexZBoost and NNKCDE. These results are consistent with the CDE loss rankings in Figure 2.
Method  OLF (%)  

Marginal  0.0049  0.0740  1.018 
RFCDELimited  0.0010  0.0205  0.604 
NNKCDE  0.0008  0.0180  0.335 
FlexZBoost  0.0008  0.0169  0.362 
DeepCDE  0.0007  0.0164  0.359 
RFCDE  0.0007  0.0169  0.345 
Appendix C Loss Function and Model Assumption Equivalence
In this section we connect the DeepCDE loss in Equation 4 to the FlexCode model setup in Equation 2.