Indirect Cross-validation for Density Estimation
A new method of bandwidth selection for kernel density estimators is proposed. The method, termed indirect cross-validation, or ICV, makes use of so-called selection kernels. Least squares cross-validation (LSCV) is used to select the bandwidth of a selection-kernel estimator, and this bandwidth is appropriately rescaled for use in a Gaussian kernel estimator. The proposed selection kernels are linear combinations of two Gaussian kernels, and need not be unimodal or positive. Theory is developed showing that the relative error of ICV bandwidths can converge to 0 at a rate of , which is substantially better than the rate of LSCV. Interestingly, the selection kernels that are best for purposes of bandwidth selection are very poor if used to actually estimate the density function. This property appears to be part of the larger and well-documented paradox to the effect that “the harder the estimation problem, the better cross-validation performs.” The ICV method uniformly outperforms LSCV in a simulation study, a real data example, and a simulated example in which bandwidths are chosen locally.
KEY WORDS: Kernel density estimation; Bandwidth selection; Cross-validation; Local cross-validation.
Let be a random sample from an unknown density . A kernel density estimator of is
where is a smoothing parameter, also known as the bandwidth, and is the kernel, which is generally chosen to be a unimodal probability density function that is symmetric about zero and has finite variance. A popular choice for is the Gaussian kernel: . To distinguish between estimators with different kernels, we shall refer to estimator (1) with given kernel as a -kernel estimator. Choosing an appropriate bandwidth is vital for the good performance of a kernel estimate. This paper is concerned with a new method of data-driven bandwidth selection that we call indirect cross-validation (ICV).
Many data-driven methods of bandwidth selection have been proposed. The two most widely used are least squares cross-validation, proposed independently by ?) and ?), and the ?) plug-in method. Plug-in produces more stable bandwidths than does cross-validation, and hence is the currently more popular method. Nonetheless, an argument can be made for cross-validation since it requires fewer assumptions than plug-in and works well when the density is difficult to estimate; see ?). A survey of bandwidth selection methods is given by ?).
A number of modifications of LSCV has been proposed in an attempt to improve its performance. These include the biased cross-validation method of ?), a method of ?), the trimmed cross-validation of ?), the modified cross-validation of ?), and the method of ?) based on kernel contrasts. The ICV method is similar in spirit to one-sided cross-validation (OSCV), which is another modification of cross-validation proposed in the regression context by ?). As in OSCV, ICV initially chooses the bandwidth of an -kernel estimator using least squares cross-validation. Multiplying the bandwidth chosen at this initial stage by a known constant results in a bandwidth, call it , that is appropriate for use in a Gaussian kernel estimator.
A popular means of judging a kernel estimator is the mean integrated squared error, i.e., , where
Letting be the bandwidth that minimizes when the kernel is Gaussian, we will show that the mean squared error of as an estimator of converges to 0 at a faster rate than that of the ordinary LSCV bandwidth. We also describe an unexpected bonus associated with ICV, namely that, unlike LSCV, it is robust to rounded data. A fairly extensive simulation study and two data analyses confirm that ICV performs better than ordinary cross-validation in finite samples.
2 Description of indirect cross-validation
We begin with some notation and definitions that will be used subsequently. For an arbitrary function , define
The LSCV criterion is given by
where, for , denotes a kernel estimator using all the original observations except for . When uses kernel , can be written as
It is well known that is an unbiased estimator of , and hence the minimizer of with respect to is denoted .
2.1 The basic method
Our aim is to choose the bandwidth of a second order kernel estimator. A second order kernel integrates to 1, has first moment 0, and finite, nonzero second moment. In principle our method can be used to choose the bandwidth of any second order kernel estimator, but in this article we restrict attention to , the Gaussian kernel. It is well known that a -kernel estimator has asymptotic mean integrated squared error (MISE) within 5% of the minimum among all positive, second order kernel estimators.
Indirect cross-validation may be described as follows:
Select the bandwidth of an -kernel estimator using least squares cross-val- idation, and call this bandwidth . The kernel is a second order kernel that is a linear combination of two Gaussian kernels, and will be discussed in detail in Section 2.2.
Assuming that the underlying density has second derivative which is continuous and square integrable, the bandwidths and that asymptotically minimize the of - and -kernel estimators, respectively, are related as follows:
Define the indirect cross-validation bandwidth by . Importantly, the constant depends on no unknown parameters. Expression (3) and existing cross-validation theory suggest that will at least converge to 1 in probability, where is the minimizer of for the -kernel estimator.
Henceforth, we let denote the bandwidth that minimizes with . Theory of ?) and ?) shows that the relative error converges to 0 at the rather disappointing rate of . In contrast, we will show that can converge to 0 at the rate . Kernels that are sufficient for this result are discussed next.
2.2 Selection kernels
We consider the family of kernels , where, for all ,
Note that the Gaussian kernel is a special case of (4) when or . Each member of is symmetric about 0 and such that . It follows that kernels in are second order, with the exception of those for which .
The family can be partitioned into three families: , and . The first of these is . Each kernel in has a negative dip centered at . For fixed, the smaller is, the more extreme the dip; and for fixed , the larger is, the more extreme the dip. The kernels in are ones that “cut-out-the-middle.”
The second family is . Kernels in are densities which can be unimodal or bimodal. Note that the Gaussian kernel is a member of this family. The third sub-family is , each member of which has negative tails. Examples of kernels in are shown in Figure 1.
Kernels in and are not of the type usually used for estimating . Nonetheless, a worthwhile question is “why not use for both cross-validation and estimation of ?” One could then bypass the step of rescaling and simply estimate by an -kernel estimator with bandwidth . The ironic answer to this question is that the kernels in that are best for cross-validation purposes are very inefficient for estimating . Indeed, it turns out that an -kernel estimator based on a sequence of ICV-optimal kernels has that does not converge to 0 faster than . In contrast, the of the best -kernel estimator tends to 0 like . These facts fit with other cross-validation paradoxes, which include the fact that LSCV outperforms other methods when the density is highly structured, ?), the improved performance of cross-validation in multivariate density estimation, ?), and its improvement when the true density is not smooth, ?). One could paraphrase these phenomena as follows: “The more difficult the function is to estimate, the better cross-validation seems to perform.” In our work, we have in essence made the function more difficult to estimate by using an inefficient kernel . More details on the of -kernel estimators may be found in ?).
3 Large sample theory
The theory presented in this section provides the underpinning for our methodology. We first state a theorem on the asymptotic distribution of , and then derive asymptotically optimal choices for the parameters and of the selection kernel.
3.1 Asymptotic mean squared error of the ICV bandwidth
Classical theory of ?) and ?) entails that the bias of an LSCV bandwidth is asymptotically negligible in comparison to its standard deviation. We will show that the variance of an ICV bandwidth can converge to 0 at a faster rate than that of an LSCV bandwidth. This comes at the expense of a squared bias that is not negligible. However, we will show how to select and (the parameters of the selection kernel) so that the variance and squared bias are balanced and the resulting mean squared error tends to 0 at a faster rate than does that of the LSCV bandwidth. The optimal rate of convergence of the relative error is , a substantial improvement over the infamous rate for LSCV.
Before stating our main result concerning the asymptotic distribution of , we define some notation:
Note that to simplify notation, we have suppressed the fact that , and depend on the parameters and . An outline of the proof of the following theorem is given in the Appendix.
Theorem. Assume that and its first five derivatives are continuous and bounded and that exists and is Lipschitz continuous. Suppose also that
for any sequence of random variables such that , a.s. Then, if and is fixed,
as and , where converges in distribution to a standard normal random variable,
Theorem 4.1 of ?) on asymptotic normality of LSCV bandwidths is not immediately applicable to our setting for at least three reasons: the kernel is not positive, it does not have compact support, and, most importantly, it changes with via the parameter .
The assumption of six derivatives for is required for a precise quantification of the asymptotic bias of . Our proof of asymptotic normality of only requires that be four times differentiable, which coincides with the conditions of Theorem 4.1 in ?).
The asymptotic bias is positive, implying that the ICV bandwidth tends to be larger than the optimal bandwidth. This is consistent with our experience in numerous simulations.
In the next section we apply the results of our theorem to determine asymptotically optimal choices for and .
3.2 Minimizing asymptotic mean squared error
The corresponding asymptotically optimal mean squared error is
which confirms our previous claim that the relative error of converges to 0 at the rate . The corresponding rates for LSCV and the Sheather-Jones plug-in rule are and , respectively.
Because is not confounded with in , we may determine a single optimal value of that is independent of . The function of is minimized at . Furthermore, small choices of lead to an arbitrarily large increase in mean squared error, while the MSE at is only about 1.33 times that at the minimum.
Our theory to this point applies to kernels in , i.e., kernels with negative tails. ?) has developed similar theory for the case where , which corresponds to , i.e., kernels that apply negative weights to the smallest spacings in the LSCV criterion. Interestingly, the same optimal rate of results from letting . However, when the optimal values of are used in the respective cases ( and ), the limiting ratio of optimum mean squared errors is , with yielding the smaller error. Our simulation studies confirm that using with large does lead to more accurate estimation of the optimal bandwidth.
4 Practical choice of and
In order to have an idea of how good choices of and vary with and , we determined the minimizers of the asymptotic mean squared error of for various sample sizes and densities. In doing so, we considered a single expression for the asymptotic mean squared error that is valid for either large or small values of . Furthermore, we use a slightly enhanced version of the asymptotic bias of . The first order bias of is , or , where
Now, the term is of smaller order asymptotically than and hence was deleted in the theory of Section 3. Here we retain , and hence the that minimizes the mean squared error depends on both and .
We considered the following five normal mixtures defined in the article by ?):
|Skewed unimodal density:|
|Separated bimodal density:|
|Skewed bimodal density:||.|
These choices for provide a fairly representative range of density shapes. It is worth noting that the asymptotically optimal (expression (8)) is free of location and scale. We may thus choose a single representative of a location-scale family when investigating the effect of . The following remarks summarize our findings about and .
For each , the optimal value of () is larger (smaller) for the unimodal densities than for the bimodal ones.
All of the MSE-optimal and correspond to kernels from , the family of negative-tailed kernels.
For each density, the optimal decreases monotonically with . Recall from Section 3.2 that the asymptotically optimal is . For each unimodal density, the optimal is within 13.5% of 2.42 at , and for each bimodal density is within 18% of 2.42 when is 20,000.
In practice it would be desirable to have choices of and that would adapt to the and at hand. However, attempting to estimate optimal values of and is potentially as difficult as the bandwidth selection problem itself. We have built a practical purpose model for and by using polynomial regression. The independent variable was and the dependent variables were the MSE-optimal values of and for the five densities defined above. Using a sixth degree polynomial for and a quadratic for , we arrived at the following models for and :
To the extent that unimodal densities are more prevalent than multimodal densities in practice, these model values are biased towards bimodal cases. Our extensive experience shows that the penalty for using good bimodal choices for and when in fact the density is unimodal, is an increase in the upward bias of . Our implementation of ICV, however, guards against oversmoothing by using an objective upper bound on the bandwidth, as we explain in detail in Section 7. We thus feel confident in recommending model (11) for choosing and in practice, at least until a better method is proposed. Indeed, this model is what we used to choose and in the simulation study reported upon in Section 7.
5 Robustness of ICV to data rounding
?, p.52) showed that if the data are rounded to such an extent that the number of pairs for which is above a threshold, then approaches as approaches zero. This threshold is for the Gaussian kernel. ?) showed that for data with ties, the behavior of as is determined by the balance between and . In particular, is and when and , respectively. The former condition holds necessarily if is nonnegative and has its maximum at . This means that all the traditional kernels have the problem of choosing when the data are rounded.
Recall that selection kernels (4) are not restricted to be nonnegative. It turns out that there exist and such that will hold. We say that selection kernels satisfying this condition are robust to rounding. It can be verified that the negative-tailed selection kernels with are robust to rounding when
Interestingly, the boundary separating robust from nonrobust kernels almost coincides with the pairs defined by that model.
6 Local ICV
A local version of cross-validation for density estimation was proposed and analyzed independently by ?) and ?). A local method allows the bandwidth to vary with , which is desirable when the smoothness of the underlying density varies sufficiently with . ?) proposed a different method of local smoothing that is a hybrid of plug-in and cross-validation methods. Here we propose that ICV be performed locally. The method parallels that of ?) and ?), with the main difference being that each local bandwidth is chosen by ICV rather than LSCV. We suggest using the smallest local minimizer of the ICV curve, since ICV does not have LSCV’s tendency to undersmooth.
Let be a kernel estimate that employs a kernel in the class , and define, at the point , a local ICV curve by
The quantity determines the degree to which the cross-validation is local, with a very large choice of corresponding to global ICV. Let be the minimizer of with respect to . Then the bandwidth of a Gaussian kernel estimator at the point is taken to be . The constant is defined by (3), and choice of and in the selection kernel will be discussed in Section 8.
Local LSCV can be criticized on the grounds that, at any , it promises to be even more unstable than global LSCV since it (effectively) uses only a fraction of the observations. Because of its much greater stability, ICV seems to be a much more feasible method of local bandwidth selection than does LSCV. We provide evidence of this stability by example in Section 8.
7 Simulation study
The primary goal of our simulation study is to compare ICV with ordinary LSCV. However, we will also include the Sheather-Jones plug-in method in the study. We considered the four sample sizes , 250, 500 and 5000, and sampled from each of the five densities listed in Section 4. For each combination of density and sample size, 1000 replications were performed. Here we give only a synopsis of our results. The reader is referred to ?) for a much more detailed account of what we observed.
Let denote the minimizer of for a Gaussian kernel estimator. For each replication, we computed , , and . The definition of is , where is the oversmoothed bandwidth of ?). Since tends to be biased upwards, this is a convenient means of limiting the bias. In all cases the parameters and in the selection kernel were chosen according to model (11). For any random variable defined in each replication of our simulation, we denote the average of over all replications (with and fixed) by . Our main conclusions may be summarized as follows.
The ratio ranged between 0.04 and 0.70 in the sixteen settings excluding the skewed bimodal density. For the skewed bimodal, the ratio was 0.84, 1.27, 1.09, and 0.40 at the respective sample sizes 100, 250, 500 and 5000. The fact that this ratio was larger than 1 in two cases was a result of ICV’s bias, since the sample standard deviation of the ICV bandwidth was smaller than that for the LSCV bandwidth in all twenty settings.
The ratio was smaller than 1 for every combination of density and sample size. For the two “large bias” cases mentioned in the previous remark the ratio was 0.92.
The ratio was smaller than 1 in six of the twenty cases considered. Among the other fourteen cases, the ratio was between 1.00 and 1.15, exceeding 1.07 just twice.
Despite the fact that the LSCV bandwidth is asymptotically normally distributed (see ?)), its distribution in finite samples tends to be skewed to the left. In contrast, our simulations show that the ICV bandwidth distribution is nearly symmetric.
In this Section we illustrate the use of ICV with two examples, one involving credit scores from Fannie Mae and the other simulated data. The first example is provided to compare the ICV, LSCV, and Sheather-Jones plug-in methods for choosing a global bandwidth. The second example illustrates the benefit of applying ICV locally.
8.1 Mortgage defaulters
In this example we analyze the credit scores of Fannie Mae clients who defaulted on their loans. The mortgages considered were purchased in “bulk” lots by Fannie Mae from primary banking institutions. The data set was taken from the website http://www.dataminingbook.com associated with ?).
In Figure 3 we have plotted an unsmoothed frequency histogram and the LSCV, ICV and Sheather-Jones plug-in density estimates for the credit scores. The class interval size in the unsmoothed histogram was chosen to be 1, which is equal to the accuracy to which the data have been reported. It turns out that the LSCV curve tends to when , but has a local minimum at about 2.84. Using results in a severely undersmoothed estimate. Both the Sheather-Jones plug-in and ICV density estimates show a single mode around 675 and look similar, with the ICV estimate being somewhat smoother.
Interestingly, a high percentage of the defaulters have credit scores less than 620, which many lenders consider the minimum score that qualifies for a loan; see ?).
8.2 Local ICV: simulated example
For this example we took five samples of size from the kurtotic unimodal density defined in ?). First, we note that even the bandwidth that minimizes results in a density estimate that is much too wiggly in the tails. On the other hand, using local versions of either ICV or LSCV resulted in much better density estimates, with local ICV producing in each case a visually better estimate than that produced by local LSCV.
For the local LSCV and ICV methods we considered four values of ranging from 0.05 to 0.3. A selection kernel with and was used in local ICV. This choice performs well for global bandwidth selection when the density is unimodal, and hence seems reasonable for local bandwidth selection since locally the density should have relatively few features. For a given , the local ICV and LSCV bandwidths were found for , and were interpolated at other using a spline. Average squared error (ASE) was used to measure closeness of a local density estimate to the true density :
Figure 4 shows results for one of the five samples. Estimates corresponding to the smallest and the largest values of are provided. The local ICV method performed similarly well for all values of considered, whereas all the local LSCV estimates were very unsmooth, albeit with some improvement in smoothness as increased.
A widely held view is that kernel choice is not terribly important when it comes to estimation of the underlying curve. In this paper we have shown that kernel choice can have a dramatic effect on the properties of cross-validation. Cross-validating kernel estimates that use Gaussian or other traditional kernels results in highly variable bandwidths, a result that has been well-known since at least 1987. We have shown that certain kernels with low efficiency for estimating can produce cross-validation bandwidths whose relative error converges to 0 at a faster rate than that of Gaussian-kernel cross-validation bandwidths.
The kernels we have studied have the form , where is the standard normal density and and are positive constants. The interesting selection kernels in this class are of two types: unimodal, negative-tailed kernels and “cut-out the middle kernels,” i.e., bimodal kernels that go negative between the modes. Both types of kernels yield the rate improvement mentioned in the previous paragraph. However, the best negative-tailed kernels yield bandwidths with smaller asymptotic mean squared error than do the best “cut-out-the-middle” kernels.
A model for choosing the selection kernel parameters has been developed. Use of this model makes our method completely automatic. A simulation study and examples reveal that use of this method leads to improved performance relative to ordinary LSCV.
To date we have considered only selection kernels that are a linear combination of two normal densities. It is entirely possible that another class of kernels would work even better. In particular, a question of at least theoretical interest is whether or not the convergence rate of for the relative bandwidth error can be improved upon.
Here we outline the proof of our theorem in Section 3. A much more detailed proof is available from the authors.
We start by writing
where is between and , and so
Using condition (5) we may write the last equation as
Defining and , we have
Using the central limit theorem of ?), it can be verified that
Computation of the first two moments of reveals that
At this point we need the first two moments of . A fact that will be used frequently from this point on is that , . Using our assumptions on the smoothness of , Taylor series expansions, symmetry of about 0 and ,
Recalling the definition of from (10), we have
Let denote the MISE of an -kernel estimator with bandwidth . Then , implying that
Using a second order approximation to and a first order approximation to , we then have
Substitution of this expression for into (14) and using the facts , and , it follows that . Later in the proof we will see that this last result implies that the first order bias of is due only to the difference .
Tedious but straightforward calculations show that , where is as defined in Section 3.1. It is worth noting that , where and . One would expect from Theorem 4.1 of ?) that the factor would appear in . Indeed it does implicitly, since as . Our point is that, when , the part of depending on is negligible in terms of its effect on and also .
To complete the proof write
Applying the same approximation of that led to (15), and the analogous one for , we have
It is easily verified that, as , , and , and hence
The proof is now complete upon combining all the previous results.
The authors are grateful to David Scott and George Terrell for providing valuable insight about cross-validation, and to three referees and an associate editor, whose comments led to a much improved final version of our paper. The research of Savchuk and Hart was supported in part by NSF Grant DMS-0604801.
- Ahmad and Ran (2004 Ahmad, I. A. and I. S. Ran (2004). Kernel contrasts: a data-based method of choosing smoothing parameters in nonparametric density estimation. J. Nonparametr. Stat. 16(5), 671–707.
- Bowman (1984 Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71(2), 353–360.
- Chiu (1991a Chiu, S.-T. (1991a). Bandwidth selection for kernel density estimation. Ann. Statist. 19(4), 1883–1905.
- Chiu (1991b Chiu, S.-T. (1991b). The effect of discretization error on bandwith selection for kernel density estimation. Biometrika 78(2), 436–441.
- Desmond (2008 Desmond, M. (2008). Lipstick on a pig. Forbes.
- Fan, Hall, Martin, and Patil (1996 Fan, J., P. Hall, M. A. Martin, and P. Patil (1996). On local smoothing of nonparametric curve estimators. J. Amer. Statist. Assoc. 91(433), 258–266.
- Feluch and Koronacki (1992 Feluch, W. and J. Koronacki (1992). A note on modified cross-validation in density estimation. Comput. Statist. Data Anal. 13(2), 143–151.
- Hall (1983 Hall, P. (1983). Large sample optimality of least squares cross-validation in density estimation. Ann. Statist. 11(4), 1156–1174.
- Hall (1984 Hall, P. (1984). Central limit theorem for integrated square error of multivariate nonparametric density estimators. J. Multivariate Anal. 14(1), 1–16.
- Hall and Marron (1987 Hall, P. and J. S. Marron (1987). Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation. Probab. Theory Related Fields 74(4), 567–581.
- Hall and Schucany (1989 Hall, P. and W. R. Schucany (1989). A local cross-validation algorithm. Statist. Probab. Lett. 8(2), 109–117.
- Hart and Yi (1998 Hart, J. D. and S. Yi (1998). One-sided cross-validation. J. Amer. Statist. Assoc. 93(442), 620–631.
- Jones, Marron, and Sheather (1996 Jones, M. C., J. S. Marron, and S. J. Sheather (1996). A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc. 91(433), 401–407.
- Loader (1999 Loader, C. R. (1999). Bandwidth selection: classical or plug-in? Ann. Statist. 27(2), 415–438.
- Marron and Wand (1992 Marron, J. S. and M. P. Wand (1992). Exact mean integrated squared error. Ann. Statist. 20(2), 712–736.
- Mielniczuk, Sarda, and Vieu (1989 Mielniczuk, J., P. Sarda, and P. Vieu (1989). Local data-driven bandwidth choice for density estimation. J. Statist. Plann. Inference 23(1), 53–69.
- Rudemo (1982 Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scand. J. Statist. 9(2), 65–78.
- Sain, Baggerly, and Scott (1994 Sain, S. R., K. A. Baggerly, and D. W. Scott (1994). Cross-validation of multivariate densities. J. Amer. Statist. Assoc. 89(427), 807–817.
- Savchuk (2009 Savchuk, O. (2009). Choosing a kernel for cross-validation. PhD thesis, Texas A&M University.
- Savchuk, Hart, and Sheather (2008 Savchuk, O. Y., J. D. Hart, and S. J. Sheather (2008). An empirical study of indirect cross-validation. Festschrift for Tom Hettmansperger. IMS Lecture Notes-Monograph Series. Submitted.
- Scott and Terrell (1987 Scott, D. W. and G. R. Terrell (1987). Biased and unbiased cross-validation in density estimation. J. Amer. Statist. Assoc. 82(400), 1131–1146.
- Sheather and Jones (1991 Sheather, S. J. and M. C. Jones (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser. B 53(3), 683–690.
- Shmueli, Patel, and Bruce (2006 Shmueli, G., N. R. Patel, and P. C. Bruce (2006). Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. New York: Wiley.
- Silverman (1986 Silverman, B. W. (1986). Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. London: Chapman & Hall.
- Stute (1992 Stute, W. (1992). Modified cross-validation in density estimation. J. Statist. Plann. Inference 30(3), 293–305.
- Terrell (1990 Terrell, G. R. (1990). The maximal smoothing principle in density estimation. J. Amer. Statist. Assoc. 85(410), 470–477.
- van Es (1992 van Es, B. (1992). Asymptotics for least squares cross-validation bandwidths in nonsmooth cases. Ann. Statist. 20(3), 1647–1657.