The Kernel Mixture Network: A Nonparametric Method for Conditional Density Estimation of Continuous Random Variables

Luca Ambrogioni^{1}, Umut Güçlü^{1}, Marcel van Gerven^{1} and
Eric Maris^{1}

1 Radboud University, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands

* l.ambrogioni@donders.ru.nl

## Abstract

This paper introduces the kernel mixture network, a new method for nonparametric estimation of conditional probability densities using neural networks. We model arbitrarily complex conditional densities as linear combinations of a family of kernel functions centered at a subset of training points. The weights are determined by the outer layer of a deep neural network, trained by minimizing the negative log likelihood. This generalizes the popular quantized softmax approach, which can be seen as a kernel mixture network with square and non-overlapping kernels. We test the performance of our method on two important applications, namely Bayesian filtering and generative modeling. In the Bayesian filtering example, we show that the method can be used to filter complex nonlinear and non-Gaussian signals defined on manifolds. The resulting kernel mixture network filter outperforms both the quantized softmax filter and the extended Kalman filter in terms of model likelihood. Finally, our experiments on generative models show that, given the same architecture, the kernel mixture network leads to higher test set likelihood, less overfitting and more diversified and realistic generated samples than the quantized softmax approach.

## 1 Introduction

Almost all probabilistic machine learning problems can be interpreted in terms of conditional density estimation (CDE). For example, in Bayesian filtering the aim is to estimate the probability density of the current state of a dynamical system given a series of indirect and noise-corrupted past observations [1]. Another popular example is generative modeling, where realistic synthetic data such as images and sounds are generated by expressing their joint distribution as a product of univariate conditional distributions [2, 3]. Deep neural networks (DNN) are powerful tools for approximating extremely complex functional dependencies. It is therefore natural to use DNN for solving CDE problems. When the conditional distribution is defined over a finite set, this can be done using a softmax distribution in the outer layer. Two main strategies have been used in the continuous case: mixture density networks and quantized softmax networks. In the mixture approach the conditional density is approximated as a convex mixture of simple density functions whose parameters are determined by the output of a DNN [4, 5, 6]. Conversely, in a quantized softmax network, the continuous variable is discretized into a finite number of bins that are then treated as discrete classes [7]. This strategy requires a careful design of the quantization scheme and results in very sparse gradients since every training example provides strong error signals for a single bin. In other words, the softmax network does not exploit the topology of the real numbers in the backpropagation of the error signals. Despite these shortcomings, the quantized approach is often preferred over mixture density networks [7, 2]. The main reason for this choice is that the softmax distribution can approximate arbitrarily complex conditional densities, as it does not make any parametric assumptions. Fortunately, in the statistics literature there are several examples of truly nonparametric density estimation methods that do not rely on a quantization scheme and can therefore exploit the continuous nature of random variables [8, 9]. Drawing inspiration from these techniques, we introduce a new CDE method, the kernel mixture network (KMN), which combines the flexibility of the quantized approach with the benefits of a continuous modeling of conditional densities. The main idea is to model the conditional density as a linear combination of kernel functions centered on a subset of training points, where the weights are determined by the output of a DNN. While we will focus on conditional densities defined over either the real numbers or the unitary circle, the kernel functions can potentially be defined over arbitrary manifolds and even discrete topological objects such as graphs.

We validate the KMN on two important applications: Bayesian filtering and generative modeling. In the Bayesian filtering example, we show that the KMN can be used together with a CNN architecture to construct new powerful approximate Bayesian filters that can be applied to arbitrarily nonlinear dynamical systems and recover complex non-Gaussian posterior densities. To the best of our knowledge we are the first to introduce this family of CNN-based Bayesian filters, closely related to the recently introduced ConvNet smoother [10]. As a second application, we use the KMN for constructing a probabilistic generative model, based on LSTM units, that learns the conditional density of each principal component of the training set given all the higher variance components. We used this generative network in order to generate realistic images of human faces. The generative approach, which we named LSTM-PCA, is original by itself and is an alternative to more computationally expensive methods such as pixelRNN [3]. In our analysis we show that, given the same LSTM-PCA architecture, the KMN approach significantly outperforms the quantized softmax approach in terms of both model likelihood and realism and variability of the generated images.

### 1.1 Related work

The KMN is related to several other deep learning methods. The output of a KMN is expressed as a linear combination of kernel functions centered at the training points. This feature is shared with kernel methods such as Gaussian process regression and support vector machines. Kernel methods have recently been combined with DNNs. These hybrid approaches exploit the representational power of DNNs in order to construct complex kernel functions [11, 12]. Our method differs from these approaches in that we do not learn the kernel functions. Instead we use a DNN in order to determine the mixing weights of a fixed set of kernels. Furthermore, our kernels do not need to be positive semi-definite and the KMN can be trained using standard gradient descent, without resorting to stochastic variational learning.

The functional form of the KMN output is similar to radial basis function (RBF) networks [13]. However, in contrast to most RBF methods, the KMN method 1) does not use radial activation functions in the hidden layers, 2) does not require training of the center points, 3) is not limited to radially symmetric kernel functions, 4) uses the negative log likelihood as loss function and 5) uses a whole family of kernels for each center point instead of a single kernel.

The KMN is complementary to the recently introduced geometric deep learning methods, which generalize convolutional architectures to arbitrary input manifolds [14, 15]. In fact, the output of a KMN can be defined on any arbitrary output manifold by appropriately choosing the kernel functions. Therefore, geometric deep learning and KMN can be combined for constructing deep convolutional mappings between manifolds.

Our Bayesian filtering approach can also be interpreted as a new application of ε-free approximate Bayesian inference [16]. It differs from other deep filtering methods because it approximates the posterior distribution from a series of synthetic samples instead of relying on a variational Bayes scheme [17, 18].

We formulate our generative model by modifying existing approaches that rely on autoregressive models such as CharRNN [19], PixelCNN [3], PixelRNN [20], WaveNet [2] and ByteNet [21]. These generative models use either recurrent or convolutional neural networks to model the conditional densities of individual variables given all the previous variables by using the quantized softmax loss function. The procedure can be implemented in a relatively simple manner for one-dimensional sequential data with an inherent ordering. On the other hand, applying these methods on data that lack inherent ordering is rather complex. For example, PixelRNN and PixelCNN resort to the use of multiple streams that process images horizontally and vertically as well as complicated masking to generate an RGB image pixel by pixel while ensuring valid conditional densities. We propose a simple solution to the problem of generating RGB images with autoregressive models that relies on principle component analysis. Our LSTM-PCA approach automatically exploits the heterogeneous variance of the components and also enables faster training and sampling by only considering the high variance components.

## 2 Background

A CDE problem consists of estimating the probability density of a random variable from a set of predictive variables. Specifically, we need to construct a function that maps the values of the predictive variables into the space of probability densities over the possible values of , such that

(1) |

In machine learning, the function is usually part of a large parametric family whose parameters have to be learned from a training set consisting of many pairs. In the following we will review some well known methods for CDE.

### 2.1 Mixture density and quantized softmax networks

A possible approach is to model as a convex mixture of simple parameterized probability densities in which both the mixing coefficients and the parameters are determined by the output of a DNN [4]. While the mixture density network has the advantage of exploiting the continuous nature of the random variable, it enforces very strong constraints on the form of the resulting conditional density. A more flexible alternative is to quantize the range of into a finite number of intervals whose probability can be modeled independently using a softmax distribution [7]. The resulting probability density is a piecewise constant function:

(2) |

where denotes the activation of the -th unit of the outer layer of a DNN (parametrized by a set of weights ) and is the uniform probability density function on the -th interval.

Given a training set of independently sampled training pairs , the weights of the neural networks can be estimated by minimizing the negative log likelihood:

(3) |

where the index denotes the -th training pair.

### 2.2 Kernel density estimation

Kernel density estimation is one of the oldest and best known methods for estimating a probability density from a series of samples without relying on a parametric model [8, 9]. The estimation is performed by centering a symmetric and normalized kernel function on each of the sample points:

(4) |

where the positive-valued weights determine the importance of each sample point. The kernel function is usually chosen a priori. A common choice is the Gaussian kernel

(5) |

where the parameter , usually referred to as bandwidth, regulates the width of the kernel. The model parameters, weights and bandwidth can be optimized using frequentist or Bayesian methods [22, 23]. In order to avoid the bandwidth selection problem, we reframe the estimation as follows:

(6) |

where, instead of optimizing the bandwidth parameters, we keep them fixed and extend the model by placing a whole family of kernel functions on each sample point. We refer to this method as kernel mixture density estimation. More generally, we can apply kernel mixture density estimation for any given family of kernel functions . These functions can be defined on sets other than . For example, the following von Mises kernel can be used to estimate densities on a circle:

(7) |

where is the modified Bessel function of order .

Traditional methods for estimating conditional densities using kernel density estimation involve the independent estimation of joint and marginal densities, respectively and [24]. Unfortunately, in machine learning applications, this approach is often infeasible as it requires the proper estimation of probability densities in high-dimensional spaces.

## 3 The kernel mixture network

The main idea of this paper is to generalize the quantized softmax network using the kernel mixture density estimation approach. From Eq. 2 we can see that the conditional density function of a quantized softmax network is a special case of the expression in Eq. 4 where the kernels are the rectangular functions and the weights are given by . It is therefore natural to extend this expression to a more general family of kernel functions that can exploit the topology of continuous random variables. We introduce the KMN by using a model of the form given in Eq. 6 in which the weights are determined by the output of a DNN:

(8) |

In order to assure that all weights are non negative we assume that all the output nodes of the networks have non-negative activation functions.

The set of center points is composed of all the values assumed by the variable in the training set. Consequently the variable range does not need to be known a priori. In other words, the dimensionality of the functional space spanned by the KMN depends on the number of training points. This feature is shared with other nonparametric regression methods such as Gaussian process regression [25]. However, using as many kernels as (a multiple of the) training points can be impractical in big datasets. Therefore we subsample by recursively removing each center point that is closer than a constant to its predecessor.

The loss function of the kernel density estimation network can be obtained by plugging the model given in Eq. 8 into the expression for the negative log likelihood given in Eq. 3:

(9) |

This cost function can be optimized with respect to the neural network parameters using standard back-propagation techniques.

Note that we have complete freedom in choosing the kernel functions as far as they define valid probability densities. This property can be used for estimating conditional densities on arbitrary manifolds or even on discrete objects such as graphs. Importantly, the kernel itself does not need to be differentiable since the gradients are computed with respect to the weights.

## 4 Experiments

In this section we validate the performance of the KMN on a Bayesian filtering problem and on a generative modeling problem. We compare the performance of the kernel mixture approximate Bayesian filter with both its quantized analog and the extended Kalman filter (EKF), the most widely used method for nonlinear filtering [26]. In the generative modeling application we use the LSTM-PCA approach for generating grayscale and color images of human faces.

### 4.1 Applications to Bayesian filtering

Bayesian filtering is a special case of Bayesian inference where the current state of a latent time series has to be estimated from a series of past indirect measurements [1]. The structure of a filtering problem allows for a convenient recursive reformulation in terms of the Bayesian filtering equations [1]. Unfortunately these equations are very often intractable. In these cases, the solution has to be approximated using methods such as the EKF. Here, we introduce the use of KMN to estimate the density from a large set of simulated samples drawn from the prior distribution. Specifically, we sample latent time series from the prior distribution and, subsequently, synthetic observations from the likelihood. This approach is a probabilistic extension of our recently introduced ConvNet smoother [10] and is an application of the general framework of ε-free approximate Bayesian inference [16].

In our first simulation, we generated a latent time series by integrating the following stochastic oscillator equation:

(10) |

The stochastic dynamical process was discretized using the Euler-Maruyama scheme with integration step equal to seconds. The parameters of the dynamical model were: , , and . We simulated training time series and validation time series. Noisy observations were generated by adding Gaussian white noise to the time series. We trained a deep CNN to estimate the probability of the latent state given the past noisy observations. The details of the architecture are the same as used in [10]. The only difference is that here the output layer determined the weights of the kernel mixture. We used rectified quadratic units as activation functions on the final layer. As kernels we used Gaussian functions with standard deviations ranging from to in steps of .

Figure 1 shows the resulting marginal posterior filter densities of an example trial. As we would expect from a filtering problem, the posterior density becomes tighter as time progresses since the filter can use increasingly more data. Importantly, the KMN can model very non-Gaussian distributions, in this case recovering the skewness of the posterior distribution.

In Fig. 2 we compare the performance of the KMN filter with the quantized CNN filter (same architecture, bin size equal to ) and the more conventional EKF. Panel A shows the validation set loss (negative log likelihood) of the kernel mixture and the quantized networks as a function of the number of training iterations. The KMN converges faster than its quantized alternative and reaches a better local minimum. The scatter plots in panels B and C show the comparisons of the likelihood of the trained KMN with the likelihood of respectively the trained quantized CNN and the EKF. Our KMN outperforms both alternative methods in almost all the validation trials.

In our second simulation we give an example of KMN conditional density estimation on a manifold other than . In particular, we use the KMN Bayesian filter for estimating the phase of an an-harmonic wave with random nonlinear waveform from noisy measurements. The phase is a circular variable that can be parametrized with an angle . Our latent state was a phase with uniform random initial value and fixed linear growth: . The indirect measurements were obtained as follows:

(11) |

were is Gaussian white noise (sd = 2). The random function is given by

(12) |

where the Taylor coefficients and were sampled from truncated t distributions (df = 3, from to ) and the coefficients and were sampled from t distributions (df = 3). Note that the likelihood function of would be very challenging to obtain in closed form, however this is not a problem for our approach (see [10] for more details). In order to exploit the circularity of the phase variable we used the von Mises kernels given in Eq. 7, with scale ranging from to in steps of .

Figure 3 shows the posterior filter density for an example trial. As expected, the filter is initially very uncertain, however it quickly converges to accurately track the underlying phase. Note that the resulting conditional densities are defined over a circle, since the phase is equivalent to the phase .

### 4.2 Applications to generative modeling

We performed two face generation experiments to compare the performance of the KMN with a quantized softmax approach. In the first experiment we trained two networks to address the problem of grayscale face generation. In the second experiment we tackled the more complex case of colored face generation. The networks shared the same LSTM-PCA architecture, only differing in their last layers. Specifically, they had two 512-unit LSTM layers followed by either a rectified linear layer (in the case of KMN) or a softmax layer (in the case of quantized softmax).

We first performed PCA on the aligned face images in the CelebA dataset after resizing them to pixels and determined the principal components of the faces explaining 90% of the variance in the data, corresponding to 95 and 125 components in the cases of grayscale faces and color faces respectively. The task of the networks was to predict the loading of each principal component given all the loadings of the higher variance components. For the softmax model we quantized the components to 256 bins with equal widths. We used the training split of the CelebA dataset to train the models, the test split to test them and the validation split to monitor the loss during training. We trained all the models with the Adam optimizer [27] for 100 epochs.

In both the grayscale and the color face generation experiments, the validation loss during training indicates that the softmax models overfitted the training set after approximately 50 epochs, whereas the KMN models seemed to continue learning without overfitting by the 100th epoch. Furthermore, we observed large differences between the negative log likelihoods of the softmax model and the KMN model on the test set in both experiments in favor of the KMN model (Figures 4 and 5, bottom panel). In order to further evaluate the performance of the two methods we generated face images (from epoch 50 for softmax, i.e. before overfitting, and from epoch 100 for KMN). Both the grayscale and the color faces generated by the KMN models appeared more realistic. Specifically, they were sharper and less blurry and had fewer artifacts compared to those generated by the softmax models (Figures 4 and 5, top panel). Furthermore, the KMN-generated faces were visibly more diverse than those generated by the softmax models.

## 5 Discussion

We introduced a new method for the nonparametric estimation of conditional probability density functions using neural networks. The KMN combines the flexibility of the popular quantized softmax approach with the regularizing properties of kernel density estimation methods. We showed that the KMN can be used for constructing Bayesian filters that track very complex probability densities on manifolds. Furthermore, we used the KMN for generating images of human faces using the newly introduced LSTM-PCA network architecture. We showed that, given the same architecture, the KMN network approach is less likely to overfit than the quantized softmax approach, generating images that are more realistic and more diversified.

Note that the KMN can be used together with several other popular generative methods, such as PixelCNN [3], PixelRNN [20] and WaveNet [2]. From our simulations, it is likely that the KMN approach will substantially improve the quality and diversity of the samples generated using these methods. The KMN approach may also be used together with models that estimate probability distributions over discrete graphs. This can be done by using diffusion kernels over the graph that encode the average distance between nodes [28]. For example, image recognition techniques can exploit the semantic similarities between classes as provided by lexical databases such as WordNet [29]. Also note that, while we have been using the KMN solely for density estimation, the method can easily be used for reconstructing arbitrary scalar functions on manifolds simply by using a different loss function, such as the mean squared loss.

## References

- 1. S. Särkkä, Bayesian Filtering and Smoothing. Cambridge University Press, 2013.
- 2. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
- 3. A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, and K. Koray, “Conditional image generation with PixelCNN decoders,” Advances in Neural Information Processing Systems, 2016.
- 4. C. M. Bishop, “Mixture density networks,” Technical Report NCRG/94/004, 1994.
- 5. L. Theis and M. Bethge, “Generative image modeling using spatial LSTMs,” Advances in Neural Information Processing Systems, 2015.
- 6. T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications,” arXiv preprint arXiv:1701.05517, 2017.
- 7. A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
- 8. M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” The Annals of Mathematical Statistics, vol. 27, no. 3, pp. 832–837, 1956.
- 9. E. Parzen, “On estimation of a probability density function and mode,” The Annals of Mathematical Statistics, vol. 33, no. 3, pp. 1065–1076, 1962.
- 10. L. Ambrogioni, U. Güçlü, E. Maris, and M. van Gerven, “Estimating nonlinear dynamics with the ConvNet smoother,” arXiv preprint arXiv:1702.05243, 2017.
- 11. A. G. Wilson, Z. Hu, R. n. Salakhutdinov, and E. P. Xing, “Deep kernel learning,” Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 370–378, 2016.
- 12. A. G. Wilson, Z. Hu, R. R. Salakhutdinov, and E. P. Xing, “Stochastic variational deep kernel learning,” Advances in Neural Information Processing Systems, pp. 2586–2594, 2016.
- 13. M. D. Buhmann, Radial Basis Functions: Theory and Implementations, vol. 12. Cambridge University Press, 2003.
- 14. F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model CNNs,” arXiv preprint arXiv:1611.08402, 2016.
- 15. M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” arXiv preprint arXiv:1611.08097, 2016.
- 16. G. Papamakarios and I. Murray, “Fast ε-free inference of simulation models with Bayesian conditional density estimation,” Advances in Neural Information Processing Systems, 2016.
- 17. R. G. Krishnan, U. Shalit, and D. Sontag, “Deep Kalman filters,” arXiv preprint arXiv:1511.05121, 2015.
- 18. E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski, “Black box variational inference for state space models,” arXiv preprint arXiv:1511.07367, 2015.
- 19. A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
- 20. A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
- 21. N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu, “Neural machine translation in linear time,” arXiv preprint arXiv:1610.10099, 2016.
- 22. B. A. Turlach, Bandwidth Selection in Kernel Density Estimation: A Review. Université catholique de Louvain, 1993.
- 23. X. Zhang, M. L. King, and R. J. Hyndman, “A Bayesian approach to bandwidth selection for multivariate kernel density estimation,” Computational Statistics & Data Analysis, vol. 50, no. 11, pp. 3009–3031, 2006.
- 24. J. G. De Gooijer and D. Zerom, “On conditional density estimation,” Statistica Neerlandica, vol. 57, no. 2, pp. 159–176, 2003.
- 25. C. E. Rasmussen, Gaussian Processes for Machine Learning. The MIT Press, 2006.
- 26. H. W. Sorenson, Kalman filtering: Theory and Application. IEEE, 1985.
- 27. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- 28. R. I. Kondor and J. Lafferty, “Diffusion kernels on graphs and other discrete input spaces,” International Conference on Machine Learning, vol. 2, pp. 315–322, 2002.
- 29. G. A. Miller, “WordNet: A lexical database for English,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.