Asymptotic analysis of the jittering kernel density estimator
Jittering estimators are nonparametric function estimators for mixed data. They extend arbitrary estimators from the continuous setting by adding random noise to discrete variables. We give an in-depth analysis of the jittering kernel density estimator, which reveals several appealing properties. The estimator is strongly consistent, asymptotically normal, and unbiased for discrete variables. It converges at minimax-optimal rates, which are established as a by-product of our analysis. To understand the effect of adding noise, we further study its asymptotic efficiency and finite sample bias in the univariate discrete case. Simulations show that the estimator is competitive on finite samples. The analysis suggests that similar properties can be expected for other jittering estimators.
Keywords: density, discrete, jittering, kernel, minimax, mixed data
T. Nagler \rohead\pagemark \leheadAsymptotic analysis of the jittering kernel density estimator \rehead\pagemark
Multivariate density estimation is a central field in nonparametric statistics. Yet many popular methods have a significant drawback in applications: they can only be applied to continuous data. Some estimators have been specifically designed to allow for mixed continuous and discrete data (Ahmad and Cerrito, 1994, Li and Racine, 2003, Hall et al., 1983, Efromovich, 2011), but the number is small compared to the methods available in a purely continuous framework.
A common trick among practitioners is to make the discrete variables continuous by adding a small amount of noise. The noisy data is continuous and the usual nonparametric estimators apply. But the addition of random noise can introduce bias, so this procedure generally lacks justification. Nagler (2017) showed that adding noise still allows for valid estimates when the noise comes from a certain class of distributions. Then any nonparametric density estimator can be used in the mixed data setting. The resulting estimators are called jittering estimators.
Jittering estimators have so far been neglected in academic research, likely due to the widespread concern that jittering causes a loss in efficiency. The main objective of this article is to demonstrate that this concern is usually unjustified. To this end, we give an in-depth analysis of a simple instance from the class of jittering estimators: the jittering kernel density estimator, which is the jittering analog of the classical kernel density estimator (Parzen, 1962, Rosenblatt, 1956, Wand, 1992). We shall show that it maintains all the properties expected from a good nonparametric density estimator:
It is asymptotically normal and asymptotically unbiased for discrete variables (Theorem 1).
It is strongly and uniformly consistent (Theorem 2).
It is relatively efficient, even fully efficient in specific cases (Section 4.1).
Although focus is on only one instance of the class of jittering estimators, we can expect that others have similar properties.
The remainder of this article is organized as follows. Section 2 introduces the the jittering estimator and some assumptions. Section 3 gives a comprehensive asymptotic analysis which is complemented by a study of the asymptotic efficiency and finite sample bias in the univariate discrete setting (Section 4). Section 5 establishes with minimax-optimal rates for density estimation in a nonparametric mixed data model. Section 6 supports demonstrates that the estimator is also competitive on finite samples; Section 7 offers conclusions. Proofs of all theorems are deferred to Appendix A.
2 The estimator
Suppose that is a random vector with discrete component and continuous component . We explicitly allow for the cases where , (all variables are discrete) and , (all variables are continuous). Our goal is to estimate the density of based on ‘observations’ , , which are iid random vectors having the same distribution as . In this context, is the density with respect to the product of the counting and Lebesgue measures, i.e.,
Let be a real-valued function, called kernel, and abbreviate for any , . The classical kernel density estimator is defined as
where are called bandwidth parameters and control the amount of smoothing. The above definition of the estimator is simplified to ease our exposition: we use only one parameter () for smoothing all components of and one parameter () for smoothing the components of . In practice, one would use a single parameter for each variable or even a bandwidth matrix (see, e.g., Scott, 2008).
The estimator only works for continuous random vectors. To make it applicable to mixed data, we make all discrete variables continuous by adding noise. Let , , be iid random vectors independent from , . Suppose further that the components of are iid with density . The jittering kernel density estimator is defined as the classical kernel density estimator applied to , :
To facilitate our analysis, the following conditions are imposed on the kernel function:
is a continuous function satisfying .
There is , , such that for ,
A kernel function satisfying K2 is called -th order kernel (see, e.g., Marron, 1994). ∎
We further assume that the noise density belongs to the class , as defined in Nagler (2017):
We say that for some , if
is an absolutely continuous probability density function,
for all ,
for all .
The density of is given by
The class ensures that is well-behaved. The most important properties are summarized in the following result (see, Nagler, 2017, Propositions 2 and 3).
Suppose the components of are iid with density . Then the joint density of satisfies for all , and such that ,
The first equality implies that we can equivalently estimate instead of . This is convenient because is the density of a purely continuous random vector. The second equality states that all derivatives w.r.t. vanish, which makes estimation even easier.
The estimator is similar to the estimators of Ahmad and Cerrito 1994 and Li and Racine (2003). The difference lies in the kernel function for discrete data. The estimators of Ahmad and Cerrito 1994 and Li and Racine (2003) use a deterministic kernel function which is defined on the integers. In contrast, the jittering kernel density estimator (2) uses a random kernel defined on a compact subset of where randomness is induced by .
3 Asymptotic analysis in the general setting
3.1 Asymptotic distribution
We first study the asymptotic distribution of the jittering kernel density estimator. To motivate our first theorem, we recall a a classical result from kernel density estimation in the purely continuous setting (e.g., Wand, 1992). If is the density of a continuous random vector , sufficiently smooth, , and , , then
Recall that is nothing else than applied to , . Lemma 1 showed that , the density of , has vanishing derivatives with respect to . We can thus expect the first sum in the bias term in (3) to vanish asymptotically. In fact, it becomes exactly zero when . The following result improves upon the properties implied by (3) by taking these considerations into account.
is times continuously differentiable with respect to .
and hold with .
and as .
There is , such that for all .
Under assumptions A1-A5, it holds for any ,
where and . If further ,
The assumptions in Theorem 1 differ from those usually made in the continuous framework. There are no assumptions on the smoothness of with respect to , because its local behavior is controlled by . Further, is not required to vanish asymptotically, but should be less than for large . This is sufficient to ensure that there is no bias with respect to . Further decreasing does not change the bias, but inflates the variance. ∎
The asymptotic variance does not involve on or its class parameters and (and neither does the asymptotic bias). Intuitively, we would expect an increase in the estimator’s variance because we are adding random noise. Apparently this effect is dominated by the sampling variability in the original data and asymptotically negligible. So there should be no benefit from averaging over multiple jitters (at least asymptotically). This is in contrast to empirical processes of jittered data (Genest et al., 2017).
3.2 Asymptotically optimal bandwidths
A standard tool for studying optimal bandwidths is the asymptotic mean squared error,
Under the assumptions of Theorem 1, we get
For , it is easy to check that the bandwidth minimizing the AMSE satisfies . This is well-known as the optimal rate for the classical kernel density estimator when . The AMSE further suggests that it is optimal to choose as large as possible. The largest allowed by A5 is . Asymptotically, this is the optimal bandwidth. We shall see shortly that this choice means that we are not smoothing the discrete variables at all. This is not unreasonable: in contrast to the continuous case, smoothing discrete variables is not necessary for consistent nonparametric estimation (for a discussion, see, Simar et al., 2011).
On finite samples can be too small. If , the estimator can be written as
Indeed, the estimator neglects all observations where and, thus, does not smooth with respect to the discrete variables. This also means that if for all . Theorem 1 implicitly assumes that is large enough to provide sufficiently many observations with . This is guaranteed asymptotically whenever ), but often demands sample sizes much larger than what is common.
Theorem 1 implies pointwise consistency of the jittering kernel density estimator, but assumption A1 is more strict than necessary. The following result weakens this assumption and additionally establishes strong uniform consistency.
The th derivative of exists and is uniformly Lipschitz on .
Suppose that assumptions , – hold. Then, for all ,
If there are such that for all , the rates of convergence in Theorem 2 do not involve , the dimension of the discrete variables. So adding more discrete variables does not change the convergence rate of the estimator. In particular, there is no cost for recoding unordered categorical variables into several binary variables. ∎
4 A closer look at the univariate discrete setting
The jittering kernel density estimator handles continuous variables just like the classical kernel density estimator. How it smooths discrete variables is less obvious. To gain a better understanding, we study its asymptotic efficiency and finite sample bias when there is only one discrete variable (, ).
4.1 Asymptotic efficiency
For convenience, set . The expectation and variance in Theorem 1 become
The most efficient point estimator for a discrete probability is the sample frequency . It satisfies
The asymptotic relative efficiency (ARE) of relative to is defined as
where denotes the leading term of an asymptotic expansion of the variance. The ARE is interpreted as follows: If the estimator is used with observations, then one needs observations to obtain the same accuracy with . If the ARE is less than one, then needs less observations, i.e., is more efficient than . If the ARE is greater then one, it is the other way around. If it is exactly one, the two estimators are equally efficient.
Straightforward calculations yield
The relative efficiency depends on three quantities:
It is increasing in and the most efficient choice is , which corresponds to the uniform error density on . On the other hand, the relative efficiency approaches 0 for or .
It is decreasing in , which is the roughness of the kernel . The ‘least rough‘ kernel is the is the uniform kernel, i.e., , for which . But this kernel is rather unpopular in practice. A more widely used kernel is the Epanechnikov kernel, , for which .
It is decreasing in . The worst case is that , for which the ARE is zero. For a variable, noise, and the Epanechnikov kernel, we get .
Suppose is the uniform density on (for which ), , and is the uniform kernel (for which ). Then, the two estimators are equally efficient. In fact, the estimator becomes
which is exactly the sample frequency estimator . ∎
4.2 Finite sample bias
Assuming , Theorem 1 shows that is unbiased in a purely discrete setting. On small samples, it is often necessary to choose a larger bandwidth (see Section 3.2). When , the estimator is usually biased.
Suppose that and satisfies –. Then,
To interpret the bias, it is helpful to focus on a simple case first.
Suppose that and is a symmetric function satisfying –. Then for all ,
The operator is known as the second order central difference operator (e.g., Monahan, 2011). It is commonly used as numerical approximation of second order derivative of real-valued functions, which is
We can interpret as a discrete analogue to the second order derivative of a real-valued function. In this aspect, the discrete setting is similar to the continuous one (where the bias of is proportional to the second order derivative).
The parameter is called the step size and determines how local the derivative approximation is. The bias of is a weighted sum of such ‘derivatives’ for several values of . The bandwidth limits the maximal step size and thereby controls the locality of the bias. Although not universally true, smaller values of typically correspond to a smaller bias. A simple counter example is when for all , where the bias is zero for all . There are also situations where decreasing leads to a larger bias. This phenomenon also exists in the continuous setting, but is disguised by asymptotic approximations. When as in Theorem 1, the estimator is unbiased.
The bias in Lemma 2 can be interpreted similarly. But is replaced by a weighted approximation of the derivative. If or are asymmetric, different weights will be assigned to the ‘forward derivative’ and the ‘backward derivative’ .
5 Minimax rate optimality
The maximum risk associated with a class of densities and a (semi-) distance is defined as
We consider two semi-distances that relate to pointwise and uniform consistency of , respectively:
For , we shall consider all bounded density functions whose continuous part belongs to a Hölder class. For , we use the multi-index notations , and denote the partial derivatives of with respect to as
For and , , , the class is defined as all functions such that for all with ,
is a probability density on ,
exists for all and
If and , contains all densities on . If and , it is a Hölder class on . ∎
The following result establishes convergence rates of the jittering kernel density estimator with respect to the maximum risk.
We shall see that the rates in Theorem 3 (i)–(iii) are optimal in a minimax sense. The minimax risk is defined as
where the infimum is taken over all possible estimators of . In our context, an ‘estimator’ is any measurable function of , .
A sequence of positive real numbers is called
an upper bound on the minimax rate if there is such that
a lower bound on the minimax rate if there is such that
a minimax-optimal rate of convergence if both (i) and (ii) hold.
In a purely continuous setting, optimal rates have long been established (Stone, 1980, 1983, Ibragimov and Khas’ minskii, 1983). To the best of the author’s knowledge, there are no results on optimal rates in the mixed data setting.
To show that a rate is minimax-optimal, we have to check that it is both an upper and lower bound on the minimax rate. Theorem 3 already gives us an upper bound, since, for any estimator ,
Lower bounds on the minimax rate can be deduced easily by considering subsets of for which lower bounds are known (see Section A.4).
Let and . The minimax-optimal rate of convergence associated with the class and distance satisfies
, for , ,
, for , , ,
, for , , ,
Theorem 4 only provides an interval for the optimal rate in case (iv). Minimax analysis for this setting is surprisingly har; see (Han et al., 2015) for minimax rates with respect to the distance. The interval is quite narrow, differing only by a factor of size . The exact rate, however, remains an open problem. ∎
6 Simulation experiments
The jittering kernel density estimator has appealing asymptotic properties. This may come as a surprise: since we are adding noise to the data, we could expect that the data become less informative and uncertainty increases. We complement our asymptotic arguments with a small numerical experiment that illustrates the small sample performance of the estimator. Because of its wide use and close resemblance to our approach, we will use the estimator of Li and Racine (2003) as a benchmark.
We use the following setup:
We compare three estimators
jkde: the jittering kernel density estimator with noise density , for which .
jkde2: the jittering kernel density estimator with noise density (as in, Nagler, 2017, Example 3), for which , .
Contrary to (2), we use one bandwidth parameter for each variable. Both estimators use likelihood cross-validation for bandwidth selection.
We estimate the density of a vector , where for all , for all . For sake of simplicity, all variables are simulated independently.
Results are based on simulated data sets with sample sizes .
As a performance measure we use the root average square error (RASE) computed over a grid in . More specifically, we use , , and
Figure 1 shows the estimators’ performance for various values of , and . Each estimator is represented by two boxes, where the left box corresponds to and the right box to . The choice of noise density seems to be of minor importance: jkde and jkde2 give almost identical results. Compared to liracine, the two estimator show only subtle differences. The two jittering estimators are more accurate in all scenarios with , and less accurate when . This is related to our observation from Section 4.1 that the efficiency is worse when is large. The relative performance of the three estimators is consistent across the two sample sizes under consideration. Overall, the jittering estimators are competitive with the benchmark estimator liracine. We found no evidence that adding artificial noise negatively affects the accuracy of the estimates. This confirms what was suggested by the estimator’s asymptotic properties.
This article gave an in-depth analysis of the behavior of the jittering kernel density estimator. It was shown to have appealing large-sample properties and perform well on small samples.
Although our focus was on a particular instance of the class of jittering estimators, we also learned something about the class as a whole. Adding noise to discrete variables does not have a negative impact on estimation accuracy. This is true for both large samples (as confirmed by our asymptotic analysis) and small samples (as illustrated by simulations). More specifically, it allows for estimators that are optimal in terms of convergence rates and efficiency. It is likely that these findings generalize to more sophisticated density estimators or estimators of functionals of the density, such as regression functions.
https://github.com/tnagler/cctools: an R package implementing the jittering kernel density estimator and likelihood cross-validation for the bandwidths.
https://gist.github.com/tnagler/786465cee2c774a844ff1846e7cdacd8: code for the simulation study in Section 6.
This work was partially supported by the German Research Foundation (DFG grant CZ 86/5-1).
Appendix A Proofs
a.1 Proof of Theorem 1
We first calculate the bias term. Using a change of variables, we get
Since , it holds for all and that . Furthermore, is zero outside of . Hence, for ,
Recall the derivative notation from (7). An -th order Taylor expansion of yields that
for some , where the second equality is due to K2. The second sum is because all terms are bounded by A1 and K1. In summary,
For the variance, we get
The second term in square brackets has already been calculated for the bias. Using similar arguments, we can show