FFT-Based Fast Bandwidth Selector for Multivariate Kernel Density Estimation 1footnote 11footnote 1The up-to-date R source codes are included as a supplemental material. One can replicate all the figures and all the data shown in the tables.

FFT-Based Fast Bandwidth Selector for Multivariate Kernel Density Estimation 111The up-to-date R source codes are included as a supplemental material. One can replicate all the figures and all the data shown in the tables.

Artur Gramacki a.gramacki@issi.uz.zgora.pl Jarosław Gramacki j.gramacki@ck.uz.zgora.pl Institute of Control and Computation Engineering, University of Zielona Góra, ul. Licealna 9, Zielona Góra 65-417, Poland Computer Center, University of Zielona Góra, ul. Licealna 9, Zielona Góra 65-417, Poland
Abstract

The performance of multivariate kernel density estimation (KDE) depends strongly on the choice of bandwidth matrix. The high computational cost required for its estimation provides a big motivation to develop fast and accurate methods. One of such methods is based on the Fast Fourier Transform. However, the currently available implementation works very well only for the univariate KDE and it’s multivariate extension suffers from a very serious limitation as it can accurately operate only with diagonal bandwidth matrices. A more general solution is presented where the above mentioned limitation is relaxed. Moreover, the presented solution can by easily adopted also for the task of efficient computation of integrated density derivative functionals involving an arbitrary derivative order. Consequently, bandwidth selection for kernel density derivative estimation is also supported. The practical usability of the new solution is demonstrated by comprehensive numerical simulations.

keywords:
multivariate kernel density estimation, density derivative functionals, bandwidth selection, Fast Fourier Transform, nonparametric estimation
journal: Computational Statistics & Data Analysis

1 Introduction

Kernel density estimation (KDE) is a very important statistical technique with many practical applications. It has been applied successfully to both univariate and multivariate problems. There exists extensive literature on this issue, including several classical monographs, see Silverman:1986 (), Scott:1992 () and Wand-1995 ().

A general form of the -dimentional multivariate kernel density estimator is

(1)

where

(2)

and is the bandwidth or smoothing matrix, is the problem dimensionality, , and , is a sequence of independent identically distributed (iid) -variate random variables drawn from a (usually unknown) density function . Here and are the unscaled and scaled kernels, respectively. In most cases the kernel has the form of a standard multivariate normal density.

The univariate kernel density estimator for a random sample drawn from a common and usually unknown density function is given by

(3)

where

(4)

and is the bandwidth which is a positive integer. The scaled () and unscaled () kernels are related in Eq. (4). Note that the notation used in Eq. (1) is not a direct extension of the univariate notation in Eq. (3), since in the one-dimensional case the bandwidth is , so we are dealing with ‘squared bandwidths’ here.

It seems that both uni- and multivariate KDE techniques have reached maturity and recent developments in this field are primarily focused on computational problems. There are two main computational problems related to KDE: (a) the fast evaluation of the kernel density estimates , and (b) the fast estimation (under certain criteria) of the optimal bandwidth matrix (or scalar in the univariate case). As for the first problem, a number of methods have been proposed, see for example Raykar:2010 () for a comprehensive review. As for the second problem, relatively less attention has been paid in the literature. An attempt of using the Message Passing Interface (MPI) was presented in Lukasik:2007 (). In Raykar:2006 () the authors give an - approximation algorithm, where the constant controls the desired arbitrary accuracy. Other techniques, like for example usage of Graphics Processing Units (GPUs), have also been used (Andrzejewski:2013, ). In this paper we are concerned with fast estimation of the optimal bandwidth and are interested in the multivariate case only. However, our results can be easily adapted also to the univariate case.

It is obvious from Eq. (1) that the naive direct evaluation of the KDE at evaluation points for data points requires kernel evaluations. Evaluation points can be of course the same as data points and then the computational complexity is making it very expensive, especially for large datasets and higher dimensions.

As for finding of the optimal bandwidth, the computational problems are even more evident. Typically, to complete all the required computations for this task a sort of numerical optimization is needed. Usually, the computational complexity of evaluating typical objective function is . During the optimization process the objective function must be evaluated many times (often more that a hundred or so), making the problem of finding the optimal bandwidth very expensive, even for moderate data dimensionalities and sizes.

In this paper we are concerned with an FFT-based method that was originally described in Wand-1994a (). In (Wand-1995, , appendix D) an interesting illustrative toy example has been presented. From now on this method will be called Wand’s algorithm. It can be used for the KDE evaluation as well as for bandwidth selection problem and it works very well for the univariate case given in Eq. (3). Unfortunately, its multivariate extension does not support unconstrained bandwidth matrices (that is, if , where is the set of all symmetric, positive definite matrices). The method supports only more restricted constrained bandwidth matrices (that is, if , where is the set of all positive definite diagonal matrices of the form ). This limitation was successfully overcome by the authors and the main results are presented in Gramacki:2016 (). In this paper we extend those results to the problem of fast (FFT-based) estimation of unconstrained bandwidth matrices.

To the best of our knowledge, our paper is the first where this problem is presented and successfully solved using an FFT-based approach. In this work we use excellent results presented in Chacon:2015 (), clearly the ones which significantly simplifies computations of integrated density derivative functionals (IDDF) involving an arbitrary derivative order (for details see Section 5). IDDFs are crucial elements in almost every modern bandwidth selection algorithm.

The remainder of the paper is organized as follows: in Section 3 we give an overview of the most popular and the most frequently used bandwidth selectors. In Section 2, based on a simple example, we demonstrate the problem. In Section 4 we give details of a complete FFT-based algorithm for fast estimation of unconstrained bandwidth matrices. To make the presentation of our algorithm clear, we do it on the basis of one of the simplest bandwidth selection algorithm, namely least square cross validation (LSCV). In Section 5 we extend our results also for the IDDFs. In Section 6 we give results from some numerical experiments based on both synthetic and real data sets. In Section 7 we conclude our paper.

2 Problem demonstration

As was mentioned in Section 1, Wand’s algorithm does not support unconstrained bandwidth matrices, which considerably limits its practical usability. In this short demonstration we use a sample dataset Unicef presented in more detail in Section 6.2. In Fig. 1(a) we show the reference density when the bandwidth was obtained by direct (i.e., non-FFT-based) implementation of the LSCV algorithm, briefly presented in Section  3. After numerical minimization of the resulting objective function we get the sought bandwidth. In Fig. 1(b) one can see the behavior of Wand’s original algorithm (i.e., FFT-based) when the minimization of the objective function proceeds over . The density is totally corrupted. In Fig. 1(c) we show the reference density when the bandwidth was obtained by direct (i.e., non-FFT-based) implementation of the LSCV algorithm when the minimization of the objective function now proceeds over . Finally, in Fig. 1(d) we show the behavior of Wand’s original algorithm when . Figures 1(c) and 1(d) are practically identical, which confirms the fact that the original version of Wand’s algorithm is adequate only for constrained bandwidth matrices. Some minor differences between Fig. 1(c) and 1(d) are due to the binning of the original input data, but they are not of practical relevance.


(a)

(b)

(c)

(d)
Figure 1: Demonstration of behavior of Wand’s original algorithm. (a) the reference density, (b) the behavior of Wand’s original algorithm when the minimization of the objective function proceeds over . The density is totally corrupted, (c) the reference density when the bandwidth was obtained by direct (i.e., non-FFT-based) implementation of Eq. (3) when the minimization of the objective function now proceeds over , (d) the behavior of Wand’s original algorithm when . Figures (c) and (d) are in fact almost identical.

The estimated bandwidth matrices used to plot densities shown in Figs. 1(a)–1(d) are as follows:

(5)

It is easy to notice that in this particular example the off-diagonal entries in are positive, while the ‘true’ entries should be negative, as in . In the context of this example, this means that individual kernels in Eq. (1) used for computing the density are (incorrectly) ‘rotated’ about 90 degrees, as can be visualized in Fig. 2. The kernels generated by bandwidth follow correctly the north-west dataset orientation, while bandwidth incorrectly generates north-east oriented kernels.


(a)

(b)
Figure 2: Visualization of kernels used for computing density for the sample dataset Unicef (marked as small filled circles): (a) kernels generated by , (b) kernels generated by .

3 Bandwidth selectors

The accuracy of the kernel density estimators depend very strongly on the bandwidth. In the univariate case the bandwidth is a scalar entity which controls the amount of smoothing. In the multivariate case the bandwidth is a matrix which controls both the amount and the orientation of smoothing. This matrix can be defined on various levels of complexity. The simplest case is when a positive constant scalar multiplies the identity matrix, that is, where . Another level of sophistication is . These two forms are often called constrained. In the most general form the bandwidth is unconstrained, that is, . A very important problem before evaluating Eq. (1) is to find the optimal (under certain criteria) bandwidth and many original methods have been developed so far, most of them being automatic or data-driven bandwidth selectors.

The problem of selecting the scalar bandwidth in univariate kernel density estimation is quite well understood. A number of methods exist that combine good theoretical properties with strong practical performance. See for example Jones:1996a (), Jones:1996b () and Wand-1995 () where one can find a comprehensive history of these selectors. Many of these univariate selectors can be extended to the multivariate case in a relatively straightforward fashion if is constrained (see Wand-1994b (), Sain:1994 ()). However, if is unconstrained, such generalization is not so easy. Comprehensive analysis of unconstrained bandwidth selectors was made mainly in the following works: Duong:2004 (), Duong:2005b (), Duong:2005a (), Chacon:2010 (), Chacon:2011 (), Chacon:2013 () and Chacon:2015 (). They provide references to all main bandwidth selectors, so here we do not reproduce them and we recommend the reader interested in details to consult these references.

Three major types of bandwidth selectors are: (a) methods which use very simple and easy to compute mathematical formulas; they were developed to cover a wide range of situations, but do not guarantee being close enough to the optimal (under certain criteria) bandwidth; they are often called the rules-of-thumb (see Silverman:1986 () and Scott:1992 ()), (b) methods based on cross-validation (CV) ideas and more precise mathematical arguments (see Rudemo:1982 () and Bowman:1984 ()). They require much more computational efforts, however, in reward for it, we get bandwidths which are more accurate for a wider class of density functions. Three classical variants of the CV methods are: least squares cross validation (LSCV), sometimes called unbiased cross validation (UCV), see Scott:1987 (), biased cross validation (BCV), see Scott:1987 (), and smoothed cross validation (SCV), see Hall:1992 (), (c) methods based on plugging in estimates of some unknown quantities that appear in formulas for the asymptotically optimal bandwidth. They are often called plug-in (PI), see Park:1990 () and Sheather:1991 ().

All CV-like selectors are based on estimating MISE or AMISE error criteria and then on minimization of an objective function. In the case of the LSCV method such a function is defined in the following form:

(6)

where

(7)

is the leave-one-out estimator of . Then, the LSCV objective function can be expressed as follows (for details, see Wand-1995 ()):

(8)

where denotes the convolution operator. For most practical implementations the normal kernels are used, i.e., , and then we obviously have . The LSCV bandwidth matrix is the minimizer of , that is,

(9)

In this paper the FFT-based algorithm is presented in details on the basis of the LSCV method as it is very popular among practitioners, mainly due to its intuitive motivation. However, this variant of the CV selector has some serious drawbacks recalled by many authors, like many local minima in the objective function and high variability (in the sense that for different datasets from the same distribution, it will typically give considerably different answers). Extending our results also for other CV selectors, as well as for the PI selector is also possible, see Section 5 for details.

4 FFT-based algorithm

A preliminary work on using FFT to univariate kernel density estimation defined in (3) was that in Silverman:1982 (). Based on this idea, in Wand-1994a () the author proposed a more universal method for both uni- and multivariate cases and also gave a note on a possibility of using the FFT algorithm for computation of kernel functional estimates as they are particularly important in bandwidth selection for KDE. Based on this note and using results given in Gramacki:2016 () and in Chacon:2015 (), in this paper we present a fast method for optimal unconstrained bandwidth selection. Below we present a complete procedure for the LSCV bandwidth selector where the FFT-based approach is used.

From the viewpoint of the main subject of this paper, Eq. (3) has to be rewritten in a slightly different form. Our goal is to remove the unwanted condition in the second double summation. For a sufficiently large (several tens in practical applications) it is safe to assume . Under this assumption, we can write the objective function in the following form

(10)

where

(11)

Now we are interested in fast computation of the following part of Eq. (10):

(12)

Computational complexity of Eq. (12) is clearly . Thus, its fast and accurate computation plays a crucial role in bandwidth selection problem.

It is also easy to notice that Eq. (12) is a zero-th order derivative functional of the general -th order form given by

(13)

Here is the derivative order (even number), is the gradient operator and the notation with symbol is explained in details in Section 5 where the above mentioned generalization is presented in details.

In the first step the multivariate linear binning of the input random variables is required. The binning is a kind of data discretization, as described in (Wand-1994a, , Section 3) and can be computed using a fast algorithm by extending the ‘integer division’ idea of Fan:1994 (). The binned approximation of Eq. (12) is

(14)

where are equally spaced grid points and are grid counts. Grid counts are obtained by assigning certain weights to the grid points, based on neighbouring observations. In other words, each grid point is accompanied by a corresponding grid count.

The following notation is used (taken from Wand-1994a ()): for , let be an equally spaced grid in the th coordinate directions such that contains the th coordinate grid points. Here is a positive integer representing the grid size in direction . Let

(15)

denote the grid point indexed by and the th binwidth be denoted by

(16)

In the second step, the summation inside the brackets in Eq. (4) is rewritten so that it takes a form of the convolution

(17)

where

(18)

In the third step, we compute the convolution between and using the FFT algorithm in only operations compared to the operations required for direct computation of Eq. (4). To compute the convolution between and they must first be reshaped (zero-padded) according to precise rules which are described in detail in Gramacki:2016 (). Here, for simplicity, only two-dimensional variant is presented as extension to higher dimensions is straightforward. We have

(19)

and

(20)

where the entry in (20) is placed in row and column . The sizes of the zero matrices are chosen so that after reshaping of and , they both have the same dimension (highly composite integers; typically, a power of 2). () are computed according to the following equation

(21)

Now, to evaluate the summations inside the brackets in Eq. (4), we can apply the discrete convolution theorem, that is, we must do the following operations:

(22)

where stands for the Fourier transform and is its inverse. The sought convolution corresponds to a subset of in Eq. (22) divided by the product of (the so-called normalization), that is,

(23)

where, for the two-dimensional case, means a subset of rows from to and a subset of columns from to of the matrix .

In the fourth step, to complete the calculations of Eq. (4), the resulting -dimensional array needs to be multiplied by the corresponding grid counts and summed to obtain , that is,

(24)

where means the element-wise multiplication. Finally, the sought in Eq. (10) can be easily and effectively calculated.

In practical implementations, the sum limits can be additionally shrunk to some smaller values , which significantly reduces the computational burden (see Section 6.3 for numerical results). In most cases, the kernel is the multivariate normal density function and, as such, an effective support can be defined, i.e., the region outside which the values of are practically negligible. Now Eq. (4) can be rewritten as

(25)

We propose to calculate using the following formula ():

(26)

where is the largest eigenvalue of and is the mesh size from Eq. (16). After some empirical tests we found that can be set to around for a standard two-dimensional normal kernel. Such a value of guarantees that calculated by either (4) or (25) differs very little. Finally, we can calculate sizes of matrices (19) and (20) according to the following equation

(27)

5 Notes on FFT-based algorithm for integrated density derivative functionals

We will start this section by a short introduction to some notation which will be used in further discussion. This notation was introduced in Chacon:2010 () and then used by those authors in consecutive papers, see the Reference. Let  be a real -variate function and its first derivative (gradient) vector is defined as with . Then the -th derivative of is defined to be the vector . According to this notation denotes the -th Kronecker power of the operator , formally understood as the -fold product . Naturally, , . Accordingly, all the second order partial derivatives can be organized into the Hessian matrix and the Hessian operator can be formally written as and of course is the matrix of size . Then, the equivalent vectorized form is , where denotes the operator which concatenates the columns of a matrix into a single vector, see Henderson:1979 (). For it is not clear how to organize the set containing all the partial derivatives of order into a matrix-like manner, so the above presented notation seems to be very convenient, clear and useful.

Many (if not the most) modern bandwidth selection algorithms involve computing Eq. (13) for a given even number , which can vary, depending on a concrete algorithm. This is usually the most time and resource consuming part of these algorithms. Thus, in this chapter, we are concerned for detailed explanation on how the FFT-based algorithm can improve the computation of Eq. (13).

It is easy to notice that two computational problems occur here. The first one is how to calculate efficiently the -th order partial derivatives of the kernel function . The second problem is how to efficiently calculate the double sums. Promising algebraic solutions of these two problems have been recently developed in Chacon:2015 () where some efficient recursive algorithms were proposed. Our FFT-based solution can be seen as a useful extension of these developments.

Below we expand the results from Section 4 where the LSCV bandwidth selector was analysed. This selector involves Eq. (13) in its simplest form, that is for (zero-th order derivative), see Eq. (12). Hence, the FFT-based calculation of Eq. (12) is in fact a straightforward extension of our solution presented in Gramacki:2016 ().

However, extending our FFT-based solution for a more general case when is not so easy (in most practical applications ). The problem comes from the fact that such -th derivative is the set af all its partial derivatives of order . Here denotes the vector containing all the -th partial derivatives of and its length is . The vector-like arranging (instead of, for example, as an -fold tensor array or as a multivariate matrix) is preferred here as it significantly simplifies multivariate analysis. However, the length of quickly run into computational difficulties of the FFT-based algorithm. Fortunately, we can use excellent results presented in (Chacon:2015, , Chapter 6.3) where the authors show how one can rewrite the three most often used algorithms for bandwidth selection (i.e., PI, UCV, SCV) in such a way that they (instead of direct usage of Eq. (13)) utilize the V-statistics of the following general form

(28)

where is a symmetric kernel function. is called a V-statistics of degree .

In the context of bandwidth selectors the following V-statistics of degree 2 based on higher order derivatives of the Gaussian density function has the special meaning

(29)

where is a scalar function (and this fact is crucial here!) defined as

(30)

Here are symmetric matrices. Based on the above, we can define also

(31)

Hence, what is required in the context of the paper’s main subject is the FFT-based implementation of Eq. (29). The key for its efficient implementation is to develop a fast algorithm to compute the functions. The most time consuming part in the formula for is computation of the -th derivative of the multivariate Gaussian density function. Hopefully, in Chacon:2015 () the authors give the complete and efficient algorithm for this task. Now, as is a scalar function, the FFT-based implementation is a straightforward replication of the four-step’s procedure presented in Section 4 (see also the supplemental material for the up-to-date R source codes).

The algorithm for computing has been implemented in the ks R package (ks, ) starting from version 1.10.0. This package contains, among others, the function Qr.cumulant{ks} which is a very efficient implementation of the V-statistics of the following form

(32)

We use this function to make some numerical experiments. We compare the implementation of Eq. (32) from the ks R package with our FFT-based implementation and it seems that such a comparison is a good idea to look at the advantages of the latter. The results are presented in Section 6.4.

Coming back to the LSCV FFT-based algorithm presented in details in Section 4 we can rewrite appropriate equations using the functions. Thus, Eq. (12) can be rewritten as

(33)

and accordingly Eq. (18) becomes

(34)

The LCSV criterion of Eq. (3) can be rewritten in the most general case, supporting bandwidth selection for kernel density derivative estimation involving an arbitrary order , as

(35)

For Eqs. (3) and (5) are obviously equivalents. Finally, the LSCV criterion from Eq. (10) can be rewritten in the same way

(36)

and accordingly, for Eqs. (10) and (5) are also equivalents. Now, the FFT-based implementation of (5) and (5) is only a straightforward usage of the results from Section 4.

Last but not least, we have not implemented the complete procedures for PI and SCV bandwidth selectors (covering both the case of kernel density estimation and also kernel density derivative estimation) as this should go beyond the scope of the paper. But we can surely expect that our FFT-based solution should improve their performance as well. The mathematical formulas for PI and SCV selectors shown in Section 6.3 of Chacon:2015 () involve double sums with and functions, so it is obvious that our solution can be easily adopted there, and in consequence, the selectors should work faster then their non-FFT counterparts.

6 Experimental results

This section is divided into four parts. The first part reports a simulation study based on synthetic data (two-dimensional mixtures of normal densities). The advantage of using such target densities is that we can compute exact Integrated Squared Errors (ISE)

(37)

between the resulting kernel density estimates and the target densities. It was proven that the ISE of any normal mixture density has an explicit form, see for example Duong:2004 ().

The second part reports a simulation study based on two real datasets. Here, the most handy way to compare the results is to use contour plots. We also moved away from the general multivariate case to the bivariate case as the results and the target densities can be easily visualized on two-dimensional plots.

The third part reports speed results when we compare computational times needed for estimation of the optimal bandwidth matrices for both FFT-based and non-FFT-based (direct) algorithms. Also, usability of reducing into is analyzed (see Eqs. (25) and (26)).

Finally, the fourth part reports speed results when we compare computational times needed for the V-statistics computation defined in Eq. (32). We compare results returned by the Qr.cumulant{ks} R function and our FFT-based implementation.

All the calculations were conducted in the R environment. Minimization of the objective function was carried out using the optim{stats} R function. The Nelder-Mead method was used with default scaling parameters, that is the reflection factor , the contraction factor and the expansion factor . This method was chosen as it is robust and works reasonably well for nondifferentiable functions. A disadvantage of the method is that it is relatively slow. Some numerical-like problems are also reported.

6.1 Synthetic data

The target densities which are analyzed were taken from Chacon:2009 () as they cover a very wide range of density shapes. We preserve their original names and numbering. The shapes are shown in Fig. 3.

Figure 3: Contour plots for 12 target densities (normal mixtures).

We took sample sizes and grid sizes (for simplicity equal in each direction) . For each combination of the sample size and the grid size we computed the ISE error and these computations were repeated 50 times. In each repetition a different random sample was drawn from the target density. Then classical boxplots were drawn. We did not make separate simulations for and (see Eq. (25) and (26)) as the results are practically the same for .

Our goal was to check two things: (a) if, in general, the FFT-based algorithm gives correct results (compared with a reference textual implementation based on Eq. (10)), and (b) how the binning operation may influence the final results. In Figs. 4 and 5 we present results for sample sizes and , respectively. Looking at the boxplots we can see that the FFT-based solution is absolutely comparable to the direct solution. The ISE errors differ slightly, but from a practical point of view the fluctuations can be neglected.

Figure 4: Boxplots of the ISE errors for the sample size . ‘no FFT’ is the reference boxplot calculated without the FFT, using the direct formula (10). Fxx means the boxplot where gridsize xx was use.
Figure 5: Boxplots of the ISE errors for the sample size . ‘no FFT’ is the reference boxplot calculated without the FFT, using direct the formula (10). Fxx means the boxplot where gridsize xx was use.
Figure 6: ISE errors for Model 8 and for each of 50 experiment repetitions. The sample size is .

However, during practical experiments, problems of numerical nature were observed. First, we shall describe the problem, and next we shall try to give a plausible explanation. While preparing Figs. 4 and 5 only ‘good’ results were used while ‘bad’ results were discarded. The ‘bad’ are these for which the derived ISE errors turned out to be extremely large, like many thousands or more. In Fig. 6 we show every ISE error for Model 8 (Asymmetric Bimodal) and for each of the 50 experiment replications. As can be easily noticed, only direct computation (based on Eq. (10), labeled ‘no FFT’) and computations for the grid size equal to 50 yield all ISE errors on an acceptable level. Unfortunately, after binning the data, a number of failed optimizations occur, especially for smaller grids. In Table 1 we give exact numbers of optimization failures (that is, where we get excessive ISE errors) for some selected sample sizes, grid sizes and model numbers (see Fig. 3). The criterion for classifying a particular ISE error as excessive was . We can see (as expected) a pattern here that larger , generates larger number of optimization failures. Another pattern is that increasing the grid size decreases the number of optimization failures. Moreover, some models are stiffer than others and it seems, e.g., that Model 3 is the most fragile and, in fact, FFT-based approach is unacceptable here. The problem with this particular model is caused by a specific data concentration, where most of the probability mass is concentrated on a very small area. Accordingly, in this case a denser griding is required.

grid size #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
20 0 4 49 0 0 0 0 0 0 1 1 3
30 0 0 21 0 0 0 0 0 0 0 0 2
40 0 0 5 0 0 0 1 0 0 0 1 2
50 0 0 1 0 0 0 0 0 0 0 0 1
150 0 0 0 0 0 0 0 0 0 0 0 0
grid size #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
20 0 44 50 0 6 0 3 1 0 5 0 10
30 0 0 48 0 0 0 6 3 0 0 0 8
40 0 0 36 0 0 0 5 1 0 0 0 5
50 0 0 15 0 0 0 4 0 0 0 0 3
150 0 0 1 0 0 0 0 0 0 0 0 0
grid size #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
20 0 50 50 48 50 0 29 8 0 50 19 50
30 0 47 50 4 14 0 22 8 0 25 2 48
40 0 0 50 0 1 0 38 11 0 2 0 36
50 0 0 50 0 0 0 21 9 0 0 0 30
150 0 0 4 0 0 0 1 0 0 0 1 2
Table 1: Number of abnormally large ISE errors for particular models from Fig. 3. The criteria for classifying a particular ISE error as excessive was .

An obvious workaround of the above mentioned numerical problems can be increasing the grid size. Some suggestions about selection of grid sizes can be found in Gonzalez:1996 () and are similar to our results. According to the results given in Table 1 we can say that grid sizes of about or more should be adequate in most practical applications.

What is also important, direct minimization (that is without the FFT-based approach) is much more robust in the sense that there are no optimization problems as shown above. The explanation for this phenomena is that binning the data makes them highly discretized, even if there are no repeated values. This may result in a nondifferentiable objective function which is much more difficult for optimization algorithms, causing problems with finding a global minimum. Hence, more research is necessary to develop some new or improve the existing algorithms which will be more robust in the area of bandwidth estimation.

6.2 Real data

In this section we analyze two real datasets. The first one is the well-known Old Faithful Geyser Data as investigated in Azzalini:1990 () (and many others). It consists of pairs of waiting times between eruptions and the durations of the eruptions for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. A data frame consists of 272 observations on 2 variables. The second dataset is the Unicef one available in the ks R package (ks, ). This data set contains the numbers of deaths of children under 5 years per 1000 live births and the average life expectancy (in years) at birth for 73 countries with the GNI (Gross National Income) less than 1000 US dollars per annum per capita. A data frame consists of 73 observations on 2 variables. Each observation corresponds to a country.

Here, the ISE criterion does not have a closed form (as opposed to any normal mixture densities used in Section 6.1), so the only sensible way to evaluate our FFT-based solution is to use contour plots. Before processing, all duplicates were discarded as all cross-validation methods are not well-behaved in this case. When there are duplicate observations, the procedure will tend to choose too small bandwidths. We did not make separate simulations for and (see Eqs. (25) and (26)) as the results are practically the same for .

First we analyze how the binning procedure affects the accuracy of evaluating of the objective function . In Figs. 7(a) and 7(b) we show densities of the Unicef and the Old Faithful datasets, respectively. The optimal bandwidth was calculated based on exact solution of the objective function given in Eq. (10). In other words, no binning was used here. In Figs. 7(c) and 7(d) we can observe how the binning influences the resulting densities. Now the calculations were based on Eq. (4). As one can observe, even a moderate grid size (here ) is enough and the plots differ very slightly comparing with Figs. 7(a) and 7(b). After application of the FFT-based approach (this time calculations were based on Eq. (4)) the resulting contour plots presented in Figs. 7(e) and 7(f) are identical compared with those generated without the FFT-based support. This of course confirms the fact that the FFT-based procedure finds the correct bandwidth.


(a)

(b)

(c)

(d)

(e)

(f)
Figure 7: The effect of the binning and FFT procedures applied to Unicef and Old Faithful datasets. (a), (b) – the optimal bandwidth based on Eq. (10), that is if no binning is used, (c), (d) – the optimal bandwidth based on Eq. (4), that is if binning is used and FFT is not used. Additionally, for better recognizing the effect of the binning, the plots from (a) and (b) (drawn with dotted lines) are added. (e), (f) – the optimal bandwidth based on Eq. (4) when FFT is used.

6.3 Speed comparisons

In this section we analyze how our FFT-based approach reduces the total computational times. Tree different implementations are considered. The first one is based on direct computation of the double summation in Eq. (12). This implementation is highly vectorized (no explicit for loops; for details see supplementary material). We do not analyze a pure for-loops-based implementation as it is extremely slow and, as such, is without any practical usability, especially for large , like for example thousands or so. We called this implementation direct. The second implementation utilizes the FFT and is based on Eq. (4), where precalculation of the kernel values (see Eq. (18)) is vectorized. We called this implementation fft-M. Finally, the third implementation utilizes Eq. (25), that is a modified version of Eq. (4) where the sum limits are replaced by some smaller values . We called this implementation fft-L.

In this experiment we do not find in fact the minimizer of Eq. (9). This is because the different methods may require a different number of evaluations of the objective function under minimization. Instead of that, we measure execution times needed to calculate functionals defined in Eqs. (12), (4) and (25). In this experiment the time needed for binning is also included in the results (for fft-M and fft-L methods). Binning is a necessary step and as such should not be neglected.

To reduce the number of variants, all experiments were performed only for two-dimensional datasets. Additionally, in this experiment the statistical structure of the dataset is not very important, so the distribution was used, varying only its size . Using other distributions (i.e., these shown in Fig. 3) will not change the performance for the fft-M implementation but can slightly change the performance for the fft-L implementation. This is because some different and values may be assigned (see Eq. (26)) and, consequently, some different and values will be generated (see Eq. (27)).

We took sample sizes from to incrementing the sequence by . Grid sizes are taken from to incrementing the sequence by (for simplicity, grids are equal in each direction, that is, ). For the fft-M and fft-L implementations each combination of the sample and grid sizes was used. The computations were repeated 50 times and the mean time was calculated. For the direct implementation 50 repetitions were computed for each sample size and also the mean time was calculated.

In Fig. 8 we show the results for the direct, fft-M and fft-L implementations. We can see that for small grid sizes (roughly up to ) the calculation times are more dependent on the sample size compared with the grid sizes of about and bigger. Starting from the grid sizes of about , computational times become almost constant and this behavior is very attractive from the practical point of view. One could expect that the FFT-based implementations should not depend on (and this is true), but the main portion of the computational time for those small grid sizes comes from the binning step and not from the FFT calculations. As a consequence, we observe a linear dependence. If we remind ourselves that the binning is computed using a algorithm, the observed linearity is obvious.

Figure 8: Speed comparison results for the fft-M, fft-L and direct implementations. Lines marked with open circles () are for the fft-M implementation, lines marked with filled circles () are for the fft-L implementation. The most lower-right plot gives result for the linear binning operation.

Moreover, the fft-L implementation is always faster compared with its fft-M equivalent. The differences becomes bigger as the grid size increases. At the same time we can not see any significant accuracy degradation, so the usage of instead of is highly recommended in practical applications.

We can also see an interesting behavior for the grid sizes , and . Namely, computational times decrease as the sample size increases. The explanation of this phenomena is simple if we look carefully at Eq. (26) and check for the values of which are calculated. Results for the three selected grid sizes are shown in Table 2. For example, for the grid size we can see that for sample sizes , and are both equal to . Then, for sample sizes , and are equal to and , respectively. Finally, for the sample sizes , and are both equal to . The values of directly affect the FFT computation time (see Eq. (22)), which cause the three ‘levels’ in Fig. 8 for the grid size .

The last two plots in Fig. 8 confirm the computational complexity of the direct implementation and the computational complexity of the binning operation. From a practical point of view, the usefulness of the direct implementation is very controversial, especially for large data sizes.

sample size
grid size 200 400 600 800 1000 1200 1400 1600 1800 2000
512 512 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256
512 512 512 512 256 512 256 512 256 256 256 256 256 256 256 256 256 256 256 256
512 512 512 512 512 512 512 512 512 512 256 512 256 512 256 512 256 512 256 512
sample size
grid size 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000
256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256
256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256
256 512 256 512 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256
Table 2: Values of and calculated according to Eq. (27) for some selected grid and sample sizes.

In Table 3 we show the values of calculated according to Eq. (26) for some selected grid sizes and sample sizes . As was expected, for a given value of the grid size, its equivalents are roughly constant, independently of the sample size.

grid size ()
sample size 30 60 110 140 160 200
200
600
1000
1400
1800
2200
2600
3000
3400
3800
Table 3: Values of calculated according to Eq. (26) for some selected values of the grid and sample sizes, where . Expressions in the parentheses mean and determined for given grid and sample sizes.

6.4 V-statistics speed comparisons

In this section we compare our FFT-based approach with some recent developments presented in Chacon:2015 (). We have implemented the V-statistics given in Eq. (32) using our FFT-based approach. Then we compare it with the Qr.cumulant{ks} R function. Note that the Qr.cumulant{ks} function (and other dependent functions, see supplementary material) has not been exported in the sense of the R environment’s meaning. They are accessible only through the triple colon operator pkg:::name. For safe, they should be treated as auxiliary ones. But anyway it seems that they work fine and return correct results.

In Eq. (32) two issues should be considered separately. Firstly, how to efficiently calculate the entity and secondly, how to efficiently calculate the double sums there. The first issue has been successfully solved in the ks package and an efficient implementation is based on the dmvnorm.deriv.recursive{ks} function. Our implementation of (and consequently FFT-based implementation of Eq. (29)) is based on this function. As for the second issue, in Chacon:2015 () the authors also propose a method for fast calculation of the double sums but numerical simulations show that our FFT-based implementation is generally faster. To confirm this we have compared the Qr.cumulant{ks} achievements with our implementation (see the supplemental material for the up-to-date R source codes). We called the two implementations Qr-ks and Qr-FFT, respectively.

We took sample sizes from to incrementing the sequence by . Grid sizes are taken to be and (for simplicity, grids are equal in each direction, that is, ). Derivative orders are taken to be . The computations were repeated 50 times and the mean time was calculated. To reduce the number of variants, all experiments were performed only for two-dimensional datasets. Additionally, in this experiment the statistical structure of the dataset is not very important, so the distribution was used, varying only its size .

In Fig. 9 we show the results for both the Qr-ks and the Qr-FFT implementations. As can be seen, the Qr-FFT implementation usually outperforms the Qr-ks implementation, especially for bigger . The Qr-ks implementation is faster only for smaller . What is also important, the Qr-FFT implementation is practically independent of the sample size. Some small departures from this behavior (visible mainly for grid size equals and for smaller derivative orders) are due to the binning (which is the required data preprocessing step). As the binning takes relatively not so much time, its influence for the total computational time for bigger and can be neglected.

Figure 9: Speed comparison results for the Qr-ks and the Qr-FFT implementations. Lines marked with open symbols (like for example ) are for the Qr-FFT implementation, lines marked with filled symbols (like for example ) are for the Qr-ks implementation.

7 Conclusion

Although nonparametric kernel density estimation is nowadays a standard technique in exploratory data analyses, there still exist some not fully solved problems. They include among other things, the question of how to fast compute the bandwidth, especially for the multivariate unconstrained case. In the paper we have presented a satisfactory FFT-based algorithm which is fast, accurate and covers unconstrained bandwidth matrices. We have outlined a complete procedure and have made comprehensive simulation studies.

Our main contributions in the paper are: (a) paying attention to a real problem involving the FFT in the field of bandwidth selection, and (b) improving the existing FFT-base algorithm by proposing a new reshaping shown in Eqs. (19) and (20). The latter plays a crucial role in adapting the FFT-based algorithm for supporting both constrained and unconstrained bandwidth matrices. (c) pointing out how our algorithm can be used for fast computation of integrated density derivative functionals involving an arbitrary derivative order. This is extremely important in implementing almost all modern bandwidth selection algorithms, not only the LSCV one used in the paper. Preliminary computer simulations proves the practical usability of our approach.

In Section 6.1 we have reported some numerical-like problems which remain a challenging open problem.

Acknowledgments

The authors would like to thank anonymous reviewers whose comments greatly helped improve the quality of the manuscript.

References

References

  • (1) B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall/CRC, 1986.
  • (2) D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley & Sons, Inc., 1992.
  • (3) M. P. Wand, M. C. Jones, Kernel Smoothing, Chapman & Hall, 1995.
  • (4) V. C. Raykar, R. Duraiswami, L. H. Zhao, Fast computation of kernel estimators, Journal of Computational and Graphical Statistics 19 (1) (2010) 205–220.
  • (5) S. Łukasik, Parallel computing of kernel density estimates with mpi, Lect. Notes in Comput. Sci. 4489 (2007) 726–734.
  • (6) V. C. Raykar, R. Duraiswami, Very fast optimal bandwidth selection for univariate kernel density estimation, Tech. Rep. CS-TR-4774/UMIACS-TR-2005-73, Dept. of Computer Science, University of Maryland, College Park (2006).
  • (7) W. Andrzejewski, A. Gramacki, J. Gramacki, Graphics processing units in acceleration of bandwidth selection for kernel density estimation, International Journal of Applied Mathematics and Computer Science 23 (4) (2013) 869–885.
  • (8) M. P. Wand, Fast computation of multivariate kernel estimators, Journal of Computational and Graphical Statistics 3 (4) (1994) 433–445.
  • (9) A. Gramacki, J. Gramacki, FFT-Based Fast Computation of Multivariate Kernel Estimators with Unconstrained Bandwidth Matrices, Journal of Computational and Graphical StatisticsDOI:10.1080/10618600.2016.11829188 [in press].
  • (10) J. E. Chacón, T. Duong, Efficient recursive algorithms for functionals based on higher order derivatives of the multivariate gaussian density, Statistics and Computing 25 (2015) 959–974.
  • (11) M. C. Jones, J. S. Marron, S. J. Sheather, A brief survey of bandwidth selection for density estimation, Journal of the American Statistical Association 91 (433) (1996) 401–407.
  • (12) M. C. Jones, J. S. Marron, S. J. Sheather, Progress in data-based bandwidth selection for kernel density estimation, Computational Statistics 11 (3) (1996) 337––381.
  • (13) M. P. Wand, M. C. Jones, Multivariate plug-in bandwidth selection, Computational Statistics 9 (2) (1994) 97–116.
  • (14) S. R. Sain, K. A. Baggerly, D. W. Scott, Cross-validation of multivariate densities, ournal of the American Statistical Association 89 (427) (1994) 807–817.
  • (15) T. Duong, Bandwidth selectors for multivariate kernel density estimation, Ph.D. thesis, University of Western Australia, School of Mathematics and Statistics (2004).
  • (16) T. Duong, M. L. Hazelton, Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation, Journal of Multivariate Analysis 93 (2005) 417–433.
  • (17) T. Duong, M. L. Hazelton, Cross-validation bandwidth matrices for multivariate kernel density estimation, Scandinavian Journal of Statistics 32 (2005) 485–506.
  • (18) J. E. Chacón, T. Duong, Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices, Test 19 (2010) 375–398.
  • (19) J. E. Chacón, T. Duong, Unconstrained pilot selectors for smoothed cross validation, Australian & New Zealand Journal of Statistics 53 (2011) 331–351.
  • (20) J. E. Chacón, T. Duong, Data-driven density estimation, with applications to nonparametric clustering and bump hunting, Electronic Journal of Statistics 7 (2013) 499–532.
  • (21) M. Rudemo, Empirical choice of histograms and kernel density estimators, Scandinavian Journal of Statistics 9 (2) (1982) 65–78.
  • (22) A. W. Bowman, An alternative method of cross-validation for the smoothing of density estimates, Biometrika 71 (2) (1984) 353–360.
  • (23) D. W. Scott, G. R. Terrell, Biased and unbiased cross-validation in density estimation, Journal of the American Statistical Association 82 (1987) 1131––1146.
  • (24) P. Hall, J. S. Marron, B. U. Park, Smoothed cross validation, Probability Theory and Related Fields 92 (1992) 1–20.
  • (25) B. U. Park, J. S. Marron, Comparison of data-driven bandwidth selectors, Journal of the American Statistical Association 85 (409) (1990) 66–72.
  • (26) S. J. Sheather, M. C. Jones, A reliable data-based bandwidth selection method for kernel density estimation, Journal of the Royal Statistical Society, Series B 53 (3) (1991) 683–690.
  • (27) B. W. Silverman, Kernel density estimation using the fast Fourier transform. Algorithm AS 176, Applied Statistics 31 (1982) 93–99.
  • (28) J. Fan, J. S. Marron, Fast implementations of nonparametric curve estimators, Journal of Computational and Graphical Statistics 3 (1994) 35–56.
  • (29) H. V. Henderson, S. R. Searle, Vec and vech operators for matrices, with some uses in jacobians and multivariate statistics, The Canadian Journal of Statistics 7 (1) (1979) 65–81.
  • (30) T. Duong, Kernel Smoothing, R package version 1.10.3 (2015).
    URL http://CRAN.R-project.org/package=ks
  • (31) J. E. Chacón, Data-driven choice of the smoothing parametrization for kernel density estimators, The Canadian Journal of Statistics 37 (2009) 249–265.
  • (32) W. González-Manteigaa, C. Sánchez-Selleroa, M. Wand, Accuracy of binned kernel functional approximations, Computational Statistics & Data Analysis 22 (1996) 1–16.
  • (33) A. Azzalini, A. W. Bowman, A look at some data on the old faithful geyser, Applied Statistics 39 (1990) 357–365.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
18457
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description