Smoothed Dyadic Partitioning  Appendix
Abstract
We present an approach to deep estimation of discrete conditional probability distributions. Such models have several applications, including generative modeling of audio, image, and video data. Our approach combines two main techniques: dyadic partitioning and graphbased smoothing of the discrete space. By recursively decomposing each dimension into a series of binary splits and smoothing over the resulting distribution using graphbased trend filtering, we impose a strict structure to the model and achieve much higher sample efficiency. We demonstrate the advantages of our model through a series of benchmarks on both synthetic and realworld datasets, in some cases reducing the error by nearly half in comparison to other popular methods in the literature. All of our models are implemented in Tensorflow and publicly available at this url.
Deep Nonparametric Estimation of Discrete Conditional Distributions via Smoothed Dyadic Partitioning
Wesley Tansey tansey@cs.utexas.edu
Department of Computer Science, University of Texas at Austin
Karl Pichotta kpich@cs.utexas.edu
Department of Computer Science, University of Texas at Austin
James G. Scott james.scott@mccombs.utexas.edu
Department of Information, Risk, and Operations Management; Department of Statistics and Data Sciences, University of Texas at Austin
Recently there has been a flurry of interest in using deeplearning methods to estimate conditional probability distributions. The applications of such models cover a wide variety of scientific areas, from cosmology (Ravanbakhsh et al., 2016) to health care (Ranganath et al., 2016; Ng et al., 2017). A subset of this area deals specifically with discrete conditional distributions, where an explicit estimation of the likelihood is desired—as opposed to simply the ability to sample the distribution, as with GANbased models (Goodfellow et al., 2014). Deep learning models that output discrete probability distributions have achieved stateoftheart results in texttospeech synthesis (van den Oord et al., 2016a), image generation (van den Oord et al., 2016b; c; d; Gulrajani et al., 2016; Salimans et al., 2017), image super resolution (Dahl et al., 2017), image colorization (Deshpande et al., 2016), and EHR survival modeling (Ranganath et al., 2016). Methodological improvements to deep discrete conditional probability estimation (CPE) therefore have the potential to make substantial improvements in many fields of interest in machine learning.
In this paper we focus on the specific form of the deep CPE model used when estimating lowdimensional data such as audio waveforms (1d) or pixels (3d). Typically, this is the output layer of a deep neural network and represents either the logits of a multinomial distribution or the parameters of a mixture model, such as a Gaussian mixture model (also known as a mixture density network (Bishop, 1994)). Previous work (van den Oord et al., 2016a; b) has found empirically that using a multinomial model often outperforms GMMs on discrete data. Methods to improve performance over the naive multinomial model often involve sophisticated compression of the space into a smaller number of bins (van den Oord et al., 2016a) or handcrafting a mixture model to bettersuit the marginal distribution of the data (Salimans et al., 2017).
We propose an alternative model, Smoothed Dyadic Partitions (SDP), as a dropin replacement for these conventional, widely used CPE approaches. SDP performs a dyadic decomposition of the discrete space, transforming the probability mass estimation into a series of left/right split estimations. During training, SDP locally smooths the area around the target value in the discrete space using a graphbased smoothing technique. These two techniques seek to leverage the inherent spatial structure in the discrete distribution to enable points to borrow statistical strength from their nearby neighbors and consequently improve the estimation of the latent conditional distribution.
SDP outperforms both multinomial and mixture models on synthetic and realworld datasets. These empirical results show that SDP is unique in its ability to provide a flexible fit that smooths well without suffering from strong inductive bias, particularly at the boundaries of the space (e.g. pixel intensities 0 and 255 in an image problem). Our SDP design also involves specific attention to efficient implementation on GPUs. These design choices, together with a local neighborhood sampling scheme for evaluating the regularizer, ensure that the approach can scale to large (finely discretized) domains in lowdimensional space.
The remainder of this paper is organized as follows. Section Smoothed Dyadic Partitioning  Appendix outlines our dyadic partitioning strategy. Section Smoothed Dyadic Partitioning  Appendix details our approach to smoothing the underlying discrete probability space using graphbased trend filtering. Section Smoothed Dyadic Partitioning  Appendix presents our experimental analysis of SDP and our benchmarks confirming its strong performance. Section Smoothed Dyadic Partitioning  Appendix provides a discussion of related work and the limitations of our model. Finally, Section Smoothed Dyadic Partitioning  Appendix gives concluding remarks.
Our model relies on representing the target discrete distribution using a tree rather than a flat grid. Treebased models for distribution estimation have a long history of success in machine learning. This includes seminal work using kd trees for nonparametric density estimation (Gray & Moore, 2003) and hierarchical softmax for neural language models (Morin & Bengio, 2005). More recent work includes, for instance, spatial discrete distribution estimation of background radiation (Tansey et al., 2016). We draw inspiration from these past works in the design of our SDP model.
Rather than outputting the logits of a multinomial distribution directly, we instead create a balanced binary tree with its root node in the center of the discrete space. From the root, we recursively partition the space into a series of half spaces, resulting in nodes for a discrete space of size . The deep learning model then outputs the splitting probabilities for every node, , parameterized as the logits in a series of independent binary classification tasks,
(1) 
where is the center of the node.
Figure 1 presents an illustration of the dyadic partitioning (DP) approach. For ease of exposition, we denote the conditional probabilty of y being greater than the node value as simply for a given node . For a target value of with some training example , we calculate the log probability during training as . The training objective for the model is then the sum of the log probabilities of the training data.
There are several computational advantages to using this particular structure compared to a multinomial. For large spaces, multinomial models typically require some form of negative sampling (Mikolov et al., 2013; Jean et al., 2014) at training time to remain efficient. In the DP model, however, every split is conditionally independent of the rest of the tree and there is no partition function to estimate. Instead, we only require the path from the root to the target node to estimate the probability of a given training example. Using a balanced tree also guarantees that every path has a fixedlength of , making vectorization on a GPU straightforward. Finally, because each node is only dealing with splitting its local region, the resulting computations are much more numerically stable and rarely result in very small or large logprobabilities– a problem that often plagues both multinomial and mixture model approaches.
We extend the DP approach to multidimensional distributions in a manner similar to a balanced kd tree. We enumerate the splits in the tree in a breadthfirst fashion, alternating dimensions at each level of the tree. This has two distinct advantages over a depthfirst approach of enumerating the first dimension before proceeding to the next dimension. The breadthfirst approach means that all nodes close in euclidean space will share more coarsegrained parents. This makes training more efficient by imposing a more principled notion of structure on the discrete space. It also improves computational efficiency for the local smoothing strategy described in Section Smoothed Dyadic Partitioning  Appendix, as nearby values have heavilyoverlapping paths; this results in a wellutilized GPU cache when training.
The DP approach described above imposes a spatial structure on the discrete space. In the example from Figure 1, an example with is likely to result in an increase in probability of as well, since both and will increase in the direction of and only will be downweighted. However, it will clearly decrease the likelihood of , since it will shift the probability of away from and leave the other nodes in the path of unchanged. This imabalance in updates is likely to lead to jagged estimations of the underlying conditional distribution. To address this issue, we incorporate a smoothing regularizer into SDP that spreads out the probability mass to nearby neighbors as a function of distance in the underlying discrete space, rather than only their specific DP paths.
Trend filtering (Kim et al., 2009; Tibshirani et al., 2014) is a method for performing adaptive smoothing over discrete spatial lattices. In the univariate case with a Gaussian loss, the solution to the trend filtering minimization problem results in a piecewise polynomial fit similar to a spline with adaptive knot placement. The order of the polynomial is a hyperparameter chosen to minimize some objective criterion such as AIC or BIC; in SDP, we use validation error. Recent methods (Wang et al., 2016) extend trend filtering to arbitrary graphs and theoretical results show that trend filtering has strong minimax rates (Sadhanala et al., 2016) and is optimally spatially adaptive for univariate discrete spaces (Guntuboyina et al., 2017).
To smooth the conditional distributions, we employ a graphbased trend filtering penalty applied to the conditional logprobabilities output by our model. This yields a regularized loss function for the training sample,
(2) 
where is the sparse order graph trend filtering penalty matrix as defined in (Wang et al., 2016) and vec is a function mapping the dimensional discrete conditional distribution over to a vector. The term is a hyperparameter that controls the tradeoff between the fit to the training data and the smoothness of the underlying distribution. As we show in Section Smoothed Dyadic Partitioning  Appendix, as the size of the dataset increases, the best SDP setting will drift towards smaller values of . Thus, in smallsample regimes SDP relies on trend filtering to smooth out the underlying space, whereas in largesample regimes it converges to the pure DP model.
A naive implementation of the trend filtering regularizer would require evaluating all the nodes in the discrete space. This would remove many of the computational performance advantages of the DP model described in Section Smoothed Dyadic Partitioning  Appendix. To ensure that SDP scales to large spaces, we smooth only over a local neighborhood around the target value. Specifically, for a given , we smooth over all nodes in the hypercube of radius centered at . The resulting regularization loss is then only over this subset of the space,
(3) 
In (3), is the graph trend filtering matrix for a discrete grid graph of size and is the neighborhood selection function that returns the vector of local conditional logits to smooth. Figure 1 provides an illustration of the local sampling for a neighborhood radius of size 2 and a target label of 4.
By only needing to compute the values of a local neighborhood, the SDP model regains its computational efficiency. For instance, in the case of a neighborhood radius of size 5 in a 3d scenario where each dimension is of size 64, the full smoothing model would have to calculate output DP nodes. The local smoother on the other hand only needs paths of size 24 for 1331 logits for an upper bound of nodes. Even though this is already a sharp reduction (), most of the local neighborhood will have highlyoverlapping paths and thus the average number of nodes sampled is much lower than the upper bound.
We evaluate SDP on a series of benchmarks against real and synthetic data. First, we show how the dyadic partitioning is effected by the trend filtering with different neighborhood sizes. We then compare SDP against approaches found in the recent literature and highlight the particular pathologies of each method. Finally, we measure the performance of each method on real datasets of one, two, and threedimensional discrete conditional target distributions.
As noted in Section Smoothed Dyadic Partitioning  Appendix, the local smoothing strategy of SDP introduces a new hyperparameter to tune. To evaluate the effect of different choices of neighborhood size we consider a toy example of performing marginal density estimation on a large 1000bin 1d grid. We use a piecewiselinear ground truth function to parameterize the logits of the true distribution:
(4) 
We then standardize the logits and draw 5000 samples from the corresponding multinomial.
As a baseline, we consider an unsmoothed dyadic partitioning model which simply performs unregularized maximum likelihood estimation (MLE). We then evaluate SDP with five different neighborhood radius sizes: 1, 3, 5, 10, and 25. We fix the other settings to use firstorder () trend filtering with the penalty fixed at . All models are fit via Adam with a learning rate of , of , and minibatch size of . We run the experiment for 50K steps and plot the total variation (TV) error between the true distribution and the estimated distribution in Figure 2.
The baseline unsmoothed method (gray solid line) quickly reaches an error of around (gray dashed line), much lower than the empirical MLE (dashed red line). However, as the model continues to train it begins to overfit and will eventually converge to the empirical MLE if training is allowed to continue. The smoothing in SDP acts as a regularizer which prevents this overfitting, as is seen in the case of the larger radii of size 10 and 25. These models converge nearly as quickly as the unsmoothed model, but both reach a better TV error and do not begin to overfit.
When the neighborhood size is small, as in the case of the radii of size 1 and 3, the smoothing can substantially slow down learning. On the other hand, wallclock time for SDP scales linearly with the neighborhood size. This creates a clear tradeoff for SDP: smaller neighborhoods are computationally more efficient, but may require many more samples to converge. Fortunately, a radius of 25 is still only 5% of the total size of the grid, yielding a considerable wallclock speedup over the full trend filtering penalty. We found that neighborhoods larger than 25 did not yield any additional sample efficiency nor asymptotic performance benefits on this experiment. It may also be possible to employ a hybrid approach of fitting an unsmoothed model initially, then smoothing later, though we have not explored such an approach.
We next create a synthetic benchmark to evaluate the sample efficiency and systematic pathologies of both our method and other methods used in the recent literature. Our task is a variant on the wellknown MNIST classification problem but with the twist that rather than mapping each digit to a latent class, each digit is mapped to a latent discrete distribution. For each sample image, we generate a label () by first mapping the digit to its corresponding distribution and then sampling as a draw from that distribution, resulting in a training set of (X, y) values where X is an image (whose digit class is not explicitly known by the model) and y is an integer.
We compare six methods:

Multinomial (MN): A simple multinomial model with no knowledge of the structure of the underlying space.

Gaussian Mixture Model (GMM): An component GMM or Mixture Density Network (MDN) (Bishop, 1994). For multidimensional data, we use a Cholesky parameterization of the covariance matrix.

Logistic Mixture Model (LMM): An component mixture of logistics, implemented using the CDF method of PixelCNN++ (Salimans et al., 2017).

Unsmoothed Dyadic Partitions (UDP): Our dyadic partitioning model with no smoothing.

Smoothed Multinomial (SMN): A multinomial model where structure of the space is added by applying a trend filtering penalty on the logits.

Smoothed Dyadic Partitions (SDP): Our model with dyadic partitioning and a local smoothing window.
The first three methods have all been used in recent works in the literature. The UDP and SMN models are similar to ablation models in that they evaluate the effectiveness of SDP if one component were removed.
We consider two different ground truth distribution classes, both one dimensional. The first uses a 3component GMM where component means and standard deviations are sampled uniformly from the range and , respectively. The model is then discretized by evaluating the PDF at an evenlyspaced (zeroindexed) 128bin grid along the range . The resulting distribution always has modes that fall far away from the boundaries at 0 and 127.
This does not reflect the typical nature of real discrete data, however, which often exhibits spikes near the boundaries. To address these cases, we generate a second set of experiments where the ground truth is drawn from a mixture model of the following form:
(5) 
where Exp is the exponential distribution. We sample and uniformly randomly from the range and sample and as in the GMM, then discretize this method following the same procedure used for the GMM. This creates an edgebiased distribution, where a smooth mode exists somewhere in the middle of the space, but at the boundaries the probability mass increases exponentially—similar to the observed marginal subpixel intensity in the CIFAR dataset (Salimans et al., 2017).
For both distributions, we evaluate all six models on sample sizes of 500, 1K, 3K, 5K, 10K, 15K, 30K, and 60K. The base network architecture for each model uses two convolution layers of size 32 and 64 with max pooling, followed by a dense hidden layer of size ; all layers use ReLU activations and dropout. All models are trained for 100K steps using Adam with a learning rate of , , and batchsize of with of the training samples used as a validation set and validation every 100 steps used to save the best model and prevent overfitting (i.e. the overfitting problems noted in Section Smoothed Dyadic Partitioning  Appendix are not an issue here). For the GMM and LMM models, we evaluated over . For the smoothed models, we fixed the neighborhood radius at 5 and evaluted over and . All hyperparameters were selected using the validation set performance.
Figure 3 shows the results, measured in terms of total variation distance from the true distribution, averaged across ten independent trials. For the GMM distribution (Figure 3a), the GMM model is wellspecified and consequently performs very well. In the lowsample GMM regime, the SDP model is competitive with the GMM model, despite the fact that the GMM matches the parametric form of the ground truth. As previously noted, however, most data sets do not follow such an ideal form; for example, previous work (van den Oord et al., 2016a; b; c) has noted a multinomial model often outperforms a GMM model. If the GMM distribution were reflective of real data, we would not expect the multinomial model to outperform it.
The edgebiased results in Figure 3b may be of more practical interest, as the design of this experiment is directly motivated by the real marginal subpixel intensity distributions seen in natural images. In the edgebiased scenario the multinomial model does in fact outperform the GMM model. However SDP is clearly the best model here, with much stronger performance across all sample sizes. Interestingly, the LMM model performs very poorly, despite its design also being inspired by modeling pixel intensities. To better understand the performance of each of the models on the edgebiased dataset, we generated example conditional distributions when the model is trained with 3K samples.
Figure 4 shows plots of each model’s estimation of the conditional distribution of the label for a single example image, with the ground truth shown in gray. The multinomial model (Figure 4a) treats every value as independent and results in a jagged reconstruction, especially in the tails where the variance is particularly high. The GMM model (Figure 4b) provides a smooth estimation that captures the middle mode well, but it drastically underestimates the tails because of the symmetric assumption of the model components. Conversely, the LMM (Figure 4c) produces large spikes at the two boundaries. This is due to the formulation of the model where the boundaries are taken as the total component mass from and . This is an intentional bias in the model designed to better match CIFAR pixel intensities which also have spikes at the boundaries. But it is quite a strong bias, as it effectively results in a twopointinflated smooth model with a nontrivial bias towards the boundaries. Finally, the UDP and SMN models (Figures 4d and 4e) result in slightly better fits than the simple multinomial model, but the combination of the two in the SDP model (Figure 4f) results in a smooth fit that is able to estimate the tails well.
In both distributions, we observe that the dyadic partitioning and local smoothing are jointly beneficial. Both unsmoothed DP and smoothed MN outperform a simple multinomial, but combining them both in the SDP model is superior to both. As the sample size grows, the SDP model converges to the UDP model in performance. This is unsurprising, as increased data results in a decreased advantage to smoothing. Indeed, as we show in Figure 5, the average chosen value (i.e. the amount of smoothing) decreases as the sample size grows.
As a final validation of our method, we compile a benchmark of realworld datasets with discrete conditional distributions as a target output (or where the target variable is discrete). We use seven datasets from the UCI database; three are onedimensional targets, three are twodimensional, and one is threedimensional. Every model uses a network architecture of three hidden layers of sizes 256, 128, and 64 with ReLU activation, weight decay, and dropout. All models were trained with Adam and a decaying learning rate with initial rate at , minimum rate at , and decay rate of ; we dynamically schedule the decay by decaying the rate after the current model has failed to improve for 10 epochs. Training stops after 1000 epochs or if the current learning rate is below the minimum learning rate or training. All results are averages using 10fold crossvalidation and we use 20% of the training data in each trial as a validation set. For all datasets, we select hyperparameter settings as in Section Smoothed Dyadic Partitioning  Appendix. We plot the marginal distributions of each real dataset in Figure 7.
We also evaluate on a pixel prediction task for both MNIST and CIFAR10, where we sample a patch of the image and must predict the pixel located at , relative to the origin of the patch. For both image datasets, we consider 3 different training sample sizes (500, 5K, and 50K). Every model uses a network architecture of two convolution layers of size 32 and 64, with max pooling, followed by three dense hidden layers of size 1024, 128, and 32; all layers use ReLU activation, weight decay, and dropout. Other training details are identical to the UCI dataset, with the exception that we only perform a single trial on the CIFAR datasets due to computational constraints. Similarly, we reduce the resolution of the CIFAR dataset from to . Plots of the marginal distributions of all our datasets are available in the supplementary material.
Multinomial  GMM  LMM  SDP  
Model  Grid Size  Samples  logprobs  RMSE  logprobs  RMSE  logprobs  RMSE  logprobs  RMSE 
Abalone  29  4177  822.83  2.17  907.78  2.42  857.23  2.30  851.88  2.31 
AutoMPG  377  392  177.81  37.34  187.02  31.59  186.67  32.49  160.31  30.84 
Housing  451  506  297.59  69.20  247.23  40.81  240.20  39.53  246.44  36.18 
MNIST500  256  500  1416.69  80.90  2588.79  92.78  1658.78  81.57  1466.65  86.68 
MNIST5K  256  5000  1229.77  64.10  2096.24  94.16  1231.01  64.29  1224.42  63.06 
MNIST50K  256  50000  1173.80  58.82  2365.57  94.24  1191.11  60.41  1161.69  57.16 
Students  395  209.07  5.27  219.67  5.44  209.43  5.26  200.76  5.18  
Energy  768  323.10  6.21  492.49  10.89  437.90  14.24  279.01  4.17  
Parkinsons  5875  1941.91  6.42  3969.63  10.91  3633.29  13.53  3530.22  14.93  
Concrete  103  115.88  21.34  107.46  18.91  108.63  21.07  102.34  18.09  
CIFAR500  500  9980.57  26.35  9177.59  24.69  9109.57  26.00  8519.21  25.81  
CIFAR5K  5000  9688.08  26.26  9106.04  23.01  9213.49  26.11  7504.35  14.89  
CIFAR50K  50000  8409.60  22.66  9099.49  23.02  9214.51  26.08  6796.39  13.42 
Table 1 presents the results on all the candidate datasets, with the bestperforming score in bold for each dataset and metric. We measure performance both in logprobability of the specific observed point and root mean squared error (RMSE), as the discrete space has a natural measurement of distance. In general, the SDP model performs very well in cases where the size of the discrete space dominates the sample size. The datasets where this is not the case (Abalone and Parkinsons), the multinomial model has sufficient data to model the space well. The LMM model outperforms in terms of logprobs on the Housing dataset likely due to the large peaks at the boundaries in the data (Figure 7c in the supplement). The CIFAR dataset also has substantial peaks– specifically at the corners– resulting in the LMM outperforming the multinomial model as has been demonstrated previously in PixelCNN++ (Salimans et al., 2017). However, the additional structure in the dataset is much better modeled via the SDP model, which has nearly half the RMSE of the other methods.
As our experiments demonstrated, the SDP model outperforms several alternative models commonly used in the literature. In the onedimensional case, many other models have been proposed ranging from more flexible parametric component distributions for mixture models (Carreau & Bengio, 2009) to quantile regression (Taylor, 2000; Lee & Yang, 2006). Extending these models to higher dimensions is nontrivial, making them unsuitable for use in many of our target applications. Furthermore, even in the 1d case, it is often unclear a priori which parametric components should be included in a mixture model and simply adding a large number may result in overfitting. A quantile regression model would also suffer from the same overfitting issues as the unsmoothed DP model in Section Smoothed Dyadic Partitioning  Appendix, as it does not impose any smoothness explicitly. From a computational perspective, quantile regression would also require all nodes to be calculated at every iteration and would not scale well to large (finely discretized) 1d spaces.
There have also been other multidimensional models, notably the line of work in neural autoregressive models such as NADE (Uria et al., 2016), RNADE (Uria et al., 2013), and MADE (Germain et al., 2015); and variational autoencoders (Kingma & Welling, 2013) such as DRAW (Gregor et al., 2015). We see such models as complementary approaches rather than competitive approaches to SDP. For instance, one could modify the outputs of MADE to be a separate discrete distribution for each dimension rather than a single likelihood. This would also address the main scalability issue of our model. Currently SDP requires output nodes for a space of possible values. In the lowdimensional problems explored in this paper this was not a problem, but it quickly exceeds the memory of a GPU once one moves beyond three or four dimensions.
Our choice of local trend filtering for smoothing introduces three new hyperparameters: neighborhood radius size (), the order of trend filtering (), and the penalty weight (). The model appears to function well with a fairly small neighborhood of about of the underlying space, and even with just a fixed choice of a radius of 5 we performed well across all our realworld datasets. The choice of is also less important, as both linear and quadratic trend filtering tend to drastically improve on the model. Computationally, the penalty matrix is somewhat more efficient to use, though we have only optimized our 1d model for GPUs. The main parameter of interest is the choice of . In all of our experiments we chose to enumerate each value and fix it for the entire training procedure. In practice, one may wish to anneal up as training proceeds so that the model can quickly converge to a jaggedbutgood solution and then focus on smoothing.
Given the computational overhead of smoothing during training (around 34 times the wallclock time of the other models in our experiments), one may be tempted to smooth the logits posttraining, but there are several problems with such an approach. First, it would be unclear what inferential principle to use to determine the best fit. Second, given some adhoc approach to choosing , smoothing the logits posttraining would impose a separate degree of smoothness to each sample conditional distribution. By using a single lambda for all training examples, we are in effect imposing the same degree of smoothness on all distributions in a way that constrains the degrees of freedom of the model explicitly across all classes. Evaluation of such a model on test data would also require generating the region (full or local) to smooth and then solving the full solution path each time, substantially impacting performance.
Finally, our experiments have all been done on relatively small problems with fairly simple models. Computational constraints prevented us from running extensive experiments with more sophisticated methods like PixelCNN++, which requires a multiGPU machine and multiple days of training for a single model. It is our hope that other researchers with access to more substantial resources will evaluate SDP as an option in their largerscale systems in the future.
We have presented SDP, a method for deep conditional estimation of discrete probability distributions. By dividing the discrete space into a series of halfspaces, SDP transforms the distribution estimation task into a hierarchical classification task which overcomes many of the disadvantages of simple multinomial modeling. These dyadic partitions are then smoothed using graphbased trend filtering on the resulting logit probabilities in a local region around the target label at training time. The combination of dyadic partitioning and logit smoothing was shown to have an additive effect in total variation error reduction in synthetic datasets. The benchmark results on both real and synthetic datasets suggest that SDP is a powerful new method that may substantially improve the performance of generative models for audio, images, video, and other discrete data.
References
 Bishop (1994) Bishop, Christopher M. Mixture density networks. 1994.
 Carreau & Bengio (2009) Carreau, Julie and Bengio, Yoshua. A hybrid Pareto mixture for conditional asymmetric fattailed distributions. Neural Networks, IEEE Transactions on, 20(7):1087–1101, 2009.
 Dahl et al. (2017) Dahl, Ryan, Norouzi, Mohammad, and Shlens, Jonathon. Pixel recursive super resolution. arXiv preprint arXiv:1702.00783, 2017.
 Deshpande et al. (2016) Deshpande, Aditya, Lu, Jiajun, Yeh, MaoChuang, and Forsyth, David. Learning diverse image colorization. arXiv preprint arXiv:1612.01958, 2016.
 Germain et al. (2015) Germain, Mathieu, Gregor, Karol, Murray, Iain, and Larochelle, Hugo. MADE: Masked autoencoder for distribution estimation. In ICML, pp. 881–889, 2015.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Gray & Moore (2003) Gray, Alexander G and Moore, Andrew W. Nonparametric density estimation: Toward computational tractability. In Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 203–211. SIAM, 2003.
 Gregor et al. (2015) Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende, Danilo Jimenez, and Wierstra, Daan. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 Gulrajani et al. (2016) Gulrajani, Ishaan, Kumar, Kundan, Ahmed, Faruk, Taiga, Adrien Ali, Visin, Francesco, Vazquez, David, and Courville, Aaron. PixelVAE: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 Guntuboyina et al. (2017) Guntuboyina, Adityanand, Lieu, Donovan, Chatterjee, Sabyasachi, and Sen, Bodhisattva. Spatial adaptation in trend filtering. arXiv preprint arXiv:1702.05113, 2017.
 Jean et al. (2014) Jean, Sébastien, Cho, Kyunghyun, Memisevic, Roland, and Bengio, Yoshua. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.
 Kim et al. (2009) Kim, SeungJean, Koh, Kwangmoo, Boyd, Stephen, and Gorinevsky, Dimitry. trend filtering. SIAM review, 51(2):339–360, 2009.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
 Lee & Yang (2006) Lee, TaeHwy and Yang, Yang. Bagging binary and quantile predictors for time series. Journal of econometrics, 135(1):465–497, 2006.
 Mikolov et al. (2013) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
 Morin & Bengio (2005) Morin, Frederic and Bengio, Yoshua. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246–252. Citeseer, 2005.
 Ng et al. (2017) Ng, Nathan, Gabriel, Rodney A., McAuley, Julian, Elkan, Charles, and Lipton, Zachary C. Predicting surgery duration with neural heteroscedastic regression. arXiv preprint arXiv:1702.05386, 2017.
 Ranganath et al. (2016) Ranganath, Rajesh, Perotte, Adler, Elhadad, Noémie, and Blei, David. Deep survival analysis. arXiv preprint arXiv:1608.02158, 2016.
 Ravanbakhsh et al. (2016) Ravanbakhsh, Siamak, Lanusse, Francois, Mandelbaum, Rachel, Schneider, Jeff, and Poczos, Barnabas. Enabling dark energy science with deep generative models of galaxy images. arXiv preprint arXiv:1609.05796, 2016.
 Sadhanala et al. (2016) Sadhanala, Veeranjaneyulu, Wang, YuXiang, and Tibshirani, Ryan J. Total variation classes beyond 1d: Minimax rates, and the limitations of linear smoothers. In Advances in Neural Information Processing Systems, pp. 3513–3521, 2016.
 Salimans et al. (2017) Salimans, Tim, Karpathy, Andrej, Chen, Xi, and Kingma, Diederik P. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
 Tansey et al. (2016) Tansey, Wesley, Athey, Alex, Reinhart, Alex, and Scott, James G. Multiscale spatial density smoothing: an application to largescale radiological survey and anomaly detection. Journal of the American Statistical Association, 2016.
 Taylor (2000) Taylor, James W. A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19(4):299–311, 2000.
 Tibshirani et al. (2014) Tibshirani, Ryan J et al. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014.
 Uria et al. (2013) Uria, Benigno, Murray, Iain, and Larochelle, Hugo. Rnade: The realvalued neural autoregressive densityestimator. In Advances in Neural Information Processing Systems, pp. 2175–2183, 2013.
 Uria et al. (2016) Uria, Benigno, Côté, MarcAlexandre, Gregor, Karol, Murray, Iain, and Larochelle, Hugo. Neural autoregressive distribution estimation. Journal of Machine Learning Research, 17(205):1–37, 2016.
 van den Oord et al. (2016a) van den Oord, Aaron, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
 van den Oord et al. (2016b) van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016b.
 van den Oord et al. (2016c) van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016c.
 van den Oord et al. (2016d) van den Oord, Aaron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. arXiv preprint arXiv:1606.05328, 2016d.
 Wang et al. (2016) Wang, YuXiang, Sharpnack, James, Smola, Alex, and Tibshirani, Ryan J. Trend filtering on graphs. Journal of Machine Learning Research, 17(105):1–41, 2016.
Below are the marginal distributions of the realworld datasets used for benchmarking in Section Smoothed Dyadic Partitioning  Appendix. All of the datasets exhibit some degree of spatial structure, making this a difficult task for a multinomial model without a large sample size. Several of the datasets also contain spikes along boundaries that present problems for Gaussian mixture models. For example, the Housing data contains a large spike at the largest value; MNIST contains a large spike at its smallest value; the Student Performance data contains a ridge line along the upper boundary (i.e. students getting zero on one of the two tests); the Concrete data contains several highprobability points along the top of the (X,Z) and (Y,Z) border; and the CIFAR data contains spikes at the upper left and lower right corners of each 2d view.