1 Introduction
Abstract

We derive concentration inequalities for the supremum norm of the difference between a kernel density estimator (KDE) and its point-wise expectation that hold uniformly over the selection of the bandwidth and under weaker conditions on the kernel than previously used in the literature. The derived bounds are adaptive to the intrinsic dimension of the underlying distribution. For instance, when the data-generating distribution has a Lebesgue density, our bound implies the same convergence rate as ones known in the literature. However, when the underlying distribution is supported over a lower dimensional set, our bounds depends explicitly on the intrinsic dimension of the support. Analogous bounds are derived for the derivative of the KDE, of any order. Our results are generally applicable but are especially useful for problems in geometric inference and topological data analysis, including level set estimation, density-based clustering, modal clustering and mode hunting, ridge estimation and persistent homology.

 

Uniform Convergence Rate of the Kernel Density Estimator
Adaptive to Intrinsic Dimension


 


Jisu Kim                        Jaehyeok Shin                        Alessandro Rinaldo                        Larry Wasserman

Carnegie Mellon University

1 Introduction

Density estimation (see, e.g. Rao, 1983) is a classical and fundamental problem in non-parametric statistics that, especially in recent years, has also become a key step in many geometric inferential tasks. Among many existing methods for density estimation, kernel density estimators (KDEs) are especially popular because of their conceptual simplicity and nice theoretical properties. A KDE is simply the Lebesgue density of the distribution obtained by convolving the empirical measure induced by the sample with an appropriate function, called kernel, (Parzen, 1962; Wand and Jones, 1994). Formally, let be an independent and identically distributed sample from an unknown Borel probability distribution in . For a given kernel , where is an appropriate function on (often a density), and bandwidth , the corresponding KDE is the random Lebesgue density function defined as

(1)

The point-wise expectation of the KDE is the function

and can be regarded as a smoothed version of the density of , if such a density exists. In fact, interestingly, both and are Lebesgue probability densities for any choice of , regardless of whether admits a Lebesgue density. What is more, is often times able to capture important topological properties of the underlying distribution or of its support (see, e.g. Fasy et al., 2014). For instance, if a data-generating distribution consists of two point masses, it has no Lebesgue density but the pointwise mean of KDE with Gaussian kernel is a density of mixtures of two Gaussian distributions whose mean parameters are the two point masses. Although is quite different from the distribution corresponding to , for practical purposes, one may in fact rely on .

Though seemingly contrived, the previous example illustrates a general of phenomenon encountered in many geometrical inference problems, namely that using as a target for inference leads to not only well-defined statistical tasks but also to faster of even dimension independent rates. Results of this form, which require a uniform control over are plentiful in the literature on density-based clustering (Rinaldo and Wasserman, 2010; Wang et al., 2017), modal clustering and mode hunting (Chacón et al., 2015; Azizyan et al., 2015), mean-shift clustering (Arias-Castro et al., 2016), ridge estimation (Chen et al., 2015a, b) and inference for density level sets (Chen et al., 2017), cluster density trees (Balakrishnan et al., 2013; Kim et al., 2016) and persistent diagrams (Fasy et al., 2014; Chazal et al., 2014).

Asymptotic and finite-sample bounds on under the existence of Lebesgue density have been well-studied for fixed bandwidth cases (Rao, 1983; Giné and Guillou, 2002; Sriperumbudur and Steinwart, 2012).

Bounds for KDEs not only uniform on but also uniform on the choice of the bandwidth have had relatively less attentions although such bounds are important to understand the consistency of the KDE with adaptive bandwidth which can depend on location or random samples. Einmahl et al. (2005) showed that

for regular kernels and bounded Lebesgue densities. Jiang (2017) provided finite-sample counterpart for the bound of uniformly on , and extended it to the density on a manifold case.

The main goal of this paper is to extend existing uniform bounds on KDEs by weakening the conditions on the kernel and making it adaptive to the intrinsic dimension of the underlying distribution, which is allowed to be supported on lower-dimensional sets, such as manifolds. Specifically, define the volume dimension be a nonnegative number satisfying

(2)

We show that, if satisfies mild regularity conditions, with probability at least ,

(3)

where and is a constant which does not depend on neither nor the lower bound of bandwidth . If the distribution has a bounded Lebesgue density, so our result is matched to the previous results in literature in terms of convergence order. For the density on a -dimensional manifold, . Thus, if KDEs are defined with a correct normalizing factor instead of , our rate also recovers ones in the density on manifold literature.

We make the following contributions:

  1. We derive high probability finite sample bounds for , uniformly over the choice of , for a given depending on .

  2. We derive rates of consistency adaptive to the intrinsic dimension of the distribution under weaker conditions than the ones existing in the literature to the best of our knowledge.

  3. We also obtain analogous bounds for all higher order derivatives of and .

The closest results to the ones we present are by Jiang (2017), who relies on relative VC bounds to derive finite sample bounds on for a special class of kernels and assuming to have a well-behaved support. Our analysis rely instead on more sophisticated techniques rooted in the theory of empirical process theory as outlined in Sriperumbudur and Steinwart (2012) and are applicable to a broader class of kernels. In addition, our conditions on the support of are more general.

2 Uniform convergence of the Kernel Density Estimator

We first characterize the intrinsic dimension of the distribution by its rate of the probability volume growth on balls, i.e. define the volume dimension be a nonnegative number satisfying

(4)

It can be easily shown that once such exists, cannot be greater than the dimension of the ambient space .

The distribution condition in (4) is general to cover the most of usual conditions on in uniform KDE convergence literature. For instance, when has a bounded density with respect to the -dimensional Lebesgue measure , then the probability on the ball is bounded as

(5)

and hence . If has a -dimensional manifold support with positive reach and a bounded density with respect to the uniform measure on the manifold, it can be shown (see, e.g. Chazal, 2013, Proposition 1.1).

To obtain uniform convergence bound of the kernel density estimator, we first rewrite

as a supremum over a function class. Formally, for and , let . Define

be a class of unnormalized kernel functions centered on each element in and bandwidth greater than or equal to , and let

be a class of normalized kernel functions. Then can be rewritten as a supremum of an empirical process indexed by , that is,

(6)

To get a bound on (6), the function class , or equivalently , should be not too large. One common approach is to assume that is a uniformly bounded VC-class which is characterized by how many functions are required to make a covering on the entire function class (Giné and Guillou, 1999; Sriperumbudur and Steinwart, 2012).

Assumption 1.

Let be a kernel function with . We assume that

is a uniformly bounded VC-class with dimension , i.e. there exists positive numbers and such that, for every probability measure on and for every , the covering numbers satisfies

where the covering numbers is defined as the minimal number of open balls of radius with respect to distance whose centers are in to cover .

We also impose an integrability condition on the kernel :

(7)
Remark 1.

It is important to note that the integrability condition in (7) is weak and it can be satisfied under commonly used kernel conditions in literature. For instance, if the kernel function decays in polynomial order strictly faster than (which is at most ) as , that is, if

for any , the integrability condition (7) is satisfied. Also, if the kernel function is spherically symmetric, that is, if there exists with , then the integrability condition (7) is satisfied provided . Kernels with bounded support also satisfy the condition (7). Specifically, most of commonly used kernels including Uniform, Epanechnikov, and Gaussian kernels satisfy the integrability condition.

We combine Talagrand inequality and VC type bound to bound (6), which is generalizing the approach in Sriperumbudur and Steinwart (2012, Theorem 3.1). The following version of Talagrand’s inequality is from Bousquet (2002, Theorem 2.3) and simplified in Steinwart and Christmann (2008, Theorem 7.5).

Proposition 2.

(Bousquet, 2002, Theorem 2.3), (Steinwart and Christmann, 2008, Theorem 7.5, Theorem A.9.1)

Let be a probability space and let be i.i.d. from . Let be a class of functions from to that is separable in . Suppose all functions are -measurable, and there exists such that

For all . Let

Then for any ,

By using the Talagrand inequality,

can be upper bounded in terms of , , , and

(8)

To bound the last expectation (8), we use the uniformly bounded VC class assumption on the kernel. The following bound on the expected suprema of empirical processes of VC classes of functions is from Giné and Guillou (2001, Proposition 2.1).

Proposition 3.

(Giné and Guillou (2001, Proposition 2.1), (Sriperumbudur and Steinwart, 2012, Theoream A.2))

Let be a probability space and let be i.i.d. from . Let be a class of functions from to that is uniformly bounded VC-class with dimension , i.e. there exists positive numbers , such that, for all , , and the covering number satisfies

for every probability measure on and for every . Let be a positive number such that for all . Then there exists a universal constant not depending on any parameters such that

By applying Proposition 2 and Proposition 3 to , it can be shown that the upper bound of

can written as a function of and . When the lower bound on the interval is not too small, the terms relating to are more dominant. Hence, to get a good upper bound with respect to both and , it is important to get a tight upper bound for . Under the existence of the Lebesgue density of , it can be shown that

by change of variables. (see, e.g. the proof of Proposition A.5. in Sriperumbudur and Steinwart (2012).)

For general distributions, the change of variables is no longer applicable. However, we can bound in terms of the volume dimension .

Lemma 4.

Let be a probability space and let . For any kernel satisfying the integrability condition (7), the expectation of the square of the kernel is upper bounded as

where is a constant depending only on and .

2.1 Uniformity on a ray of bandwidths

In this subsection, we build a uniform convergence bound of the kernel density estimator, which is uniform on a ray of bandwidths .

We first discuss the sufficient conditions for Assumption 1 which is that the function class

is not too complex. Since , it is sufficient to impose uniformly bounded VC class condition on a larger function class

This is implied by condition () in Giné et al. (2004) or condition () in Giné and Guillou (2001). In particular, the condition is satisfied when , where is a polynomial and is a bounded real function of bounded variation as in Nolan and Pollard (1987).

Under Assumption 1, we derive our main concentration inequality for .

Theorem 5.

Let be a probability distribution and let be a kernel function satisfying Assumption 1. Then, with probability at least ,

(9)

where is a constant depending only on , , , , , .

When is fixed and , two dominating terms in (9) are and . If is not going to too fast, then the second term dominates the upper bound in (9) as in the following corollary.

Corollary 6.

Let be a probability distribution and let be a kernel function satisfying Assumption 1. Suppose

Then, with probability at least ,

where depending only on , , , , , .

2.2 Fixed bandwidth

In this subsection, we study a uniform convergence bound on the kernel density estimator with a fixed bandwidth . We are interested in a high probability bound on

Of course, it can be bounded by the results in the previous subsection because

(10)

Therefore, the convergence bound uniform on a ray of bandwidths in Theorem 5 and Corollary 6 is applicable to fixed bandwidth cases.

Once the support of is bounded, that is, there exists such that , then, for the kernel density estimator with a -Lipschitz continuous kernel and fixed bandwidth, we can derive a uniform convergence bound without the finite VC condition of (Giné and Guillou, 2001; Giné et al., 2004) based on the following lemma.

Lemma 7.

Suppose there exists with . Let the kernel is -Lipschitz continuous. Then for all , the supremum of the -covering number over all measure is upper bounded as

Corollary 8.

Suppose there exists with . Let be a -Lipschitz continuous kernel function satisfying the integrability condition (7). If

Then with probability at least ,

where is a constant depending only on , , , , , , ..

3 Uniform convergence of the Derivatives of the Kernel Density Estimator

In this section, we build an analogous uniform convergence bound of the derivatives of the kernel density estimator. For a nonnegative integer vector , define and

For operator to be well defined and interchange with integration, we need the following smoothness condition on the kernel .

Assumption 2.

For given , let be a kernel function satisfying the following: the partial derivative exists and .

Under Assumption 2, Leibniz’s rule is applicable and can be written as

where as we defined it in Section 2. Analogous to Section 2, let

be a class of unnormalized kernel functions centered on and bandwidth greater than or equal to , and let

be a class of normalized kernel functions. Then similar to (6), can be rewritten as

(11)

To have a uniform supremum bound on (11), the function class should be not too complex. As same as the kernel density estimator case, we assume that is a uniformly bounded VC-class.

Assumption 3.

Let be a kernel function with . We assume that

is a uniformly bounded VC-class with dimension , i.e. there exists positive numbers and such that, for every probability measure on and for every , the covering numbers satisfies

We also impose an integrability condition on the derivatives of kernel :

(12)

Again, to get a good upper bound of

getting a tight upper bound for is important.

Under the integrability condition (12), we can bound in terms of the volume dimension as follows, which is analogous to Lemma 4.

Lemma 9.

Let be a probability space and let . For any kernel satisfying the integrability condition (12), the expectation of the square of the derivative of kernel is upper bounded as

where is a constant depending only on and .

To bound with high probability, we combine Talagrand inequality and VC type bound with Lemma 9. The following theorem provides a high probability upper bound for (11), which is analogous to Theorem 5.

Theorem 10.

Let be a distribution and be a kernel function satisfying Assumption 2 and 3. Then, with probability at least ,

(13)

where is a constant depending only on , , , , , .

When is not going to too fast, then term dominates the upper bound in (13) as in the following corollary, which is analogous to Corollary (6).

Corollary 11.

Let be a distribution and be a kernel function satisfying Assumption 2 and 3. Suppose

Then, with probability at least ,

where is a constant depending only on , , , , , .

Now we consider the case when the bandwidth is fixed as We are interested in a high probability bound on

Of course, it can be bounded by the results in the previous subsection because

Therefore,the convergence bound uniform on a ray of bandwidths in Theorem 10 and Corollary 11 is applicable to fixed bandwidth cases.

Once the support of is bounded, that is, there exists such that , then, for a -Lipschitz continuous derivative of kernel density estimator and fixed bandwidth, we can derive a uniform convergence bound without the finite VC condition of (Giné and Guillou, 2001; Giné et al., 2004) based on the following lemma.

Lemma 12.

Suppose there exists with . Also, suppose that is -Lipschitz, i.e.

Then for all , the supremum of the -covering number over all measure is upper bounded as

Corollary 13.

Suppose there exists with . Let be a kernel function with -Lipschitz continuous derivative satisfying the integrability condition (12). If

Then, with probability at least ,

where is a constant depending only on , , , , , .

References

  • Arias-Castro et al. [2016] Ery Arias-Castro, David Mason, and Bruno Pelletier. On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. The Journal of Machine Learning Research, 17(1):1487–1514, 2016.
  • Azizyan et al. [2015] Martin Azizyan, Yen-Chi Chen, Aarti Singh, and Larry Wasserman. Risk bounds for mode clustering. arXiv preprint arXiv:1505.00482, 2015.
  • Balakrishnan et al. [2013] Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman. Cluster trees on manifolds. In Advances in Neural Information Processing Systems, pages 2679–2687, 2013.
  • Bousquet [2002] O. Bousquet. A bennett concentration inequality and its application to suprema of empirical processes. C. R. Acad. Sci. Paris, Ser. I, 334:495–500, 2002.
  • Chacón et al. [2015] José E Chacón et al. A population background for nonparametric density-based clustering. Statistical Science, 30(4):518–532, 2015.
  • Chazal [2013] Frédéric Chazal. An upper bound for the volume of geodesic balls in submanifolds of euclidean spaces. 2013.
  • Chazal et al. [2014] Frédéric Chazal, Brittany T Fasy, Fabrizio Lecci, Bertrand Michel, Alessandro Rinaldo, and Larry Wasserman. Robust topological inference: Distance to a measure and kernel distance. arXiv preprint arXiv:1412.7197, 2014.
  • Chen et al. [2015a] Yen-Chi Chen, Christopher R Genovese, Shirley Ho, and Larry Wasserman. Optimal ridge detection using coverage risk. In Advances in Neural Information Processing Systems, pages 316–324, 2015a.
  • Chen et al. [2015b] Yen-Chi Chen, Christopher R Genovese, Larry Wasserman, et al. Asymptotic theory for density ridges. The Annals of Statistics, 43(5):1896–1928, 2015b.
  • Chen et al. [2017] Yen-Chi Chen, Christopher R Genovese, and Larry Wasserman. Density level sets: Asymptotics, inference, and visualization. Journal of the American Statistical Association, 112(520):1684–1696, 2017.
  • Einmahl et al. [2005] Uwe Einmahl, David M Mason, et al. Uniform in bandwidth consistency of kernel-type function estimators. The Annals of Statistics, 33(3):1380–1403, 2005.
  • Fasy et al. [2014] Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, Larry Wasserman, Sivaraman Balakrishnan, Aarti Singh, et al. Confidence sets for persistence diagrams. The Annals of Statistics, 42(6):2301–2339, 2014.
  • Giné and Guillou [1999] Evarist Giné and Armelle Guillou. Laws of the iterated logarithm for censored data. Ann. Probab., 27(4):2042–2067, 10 1999. doi: 10.1214/aop/1022874828. URL https://doi.org/10.1214/aop/1022874828.
  • Giné and Guillou [2001] Evarist Giné and Armelle Guillou. On consistency of kernel density estimators for randomly censored data: rates holding uniformly over adaptive intervals. Annales de l’Institut Henri Poincare (B) Probability and Statistics, 37(4):503 – 522, 2001. ISSN 0246-0203. doi: https://doi.org/10.1016/S0246-0203(01)01081-0. URL http://www.sciencedirect.com/science/article/pii/S0246020301010810.
  • Giné and Guillou [2002] Evarist Giné and Armelle Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l’Institut Henri Poincare (B) Probability and Statistics, volume 38, pages 907–921. Elsevier, 2002.
  • Giné et al. [2004] Evarist Giné, Vladimir Koltchinskii, and Joel Zinn. Weighted uniform consistency of kernel density estimators. Ann. Probab., 32(3B):2570–2605, 07 2004. doi: 10.1214/009117904000000063. URL https://doi.org/10.1214/009117904000000063.
  • Jiang [2017] Heinrich Jiang. Uniform convergence rates for kernel density estimation. In International Conference on Machine Learning, pages 1694–1703, 2017.
  • Kim et al. [2016] Jisu Kim, Yen-Chi Chen, Sivaraman Balakrishnan, Alessandro Rinaldo, and Larry Wasserman. Statistical inference for cluster trees. In Advances in Neural Information Processing Systems 29, pages 1839–1847. 2016.
  • Nolan and Pollard [1987] Deborah Nolan and David Pollard. -processes: Rates of convergence. Ann. Statist., 15(2):780–799, 06 1987. doi: 10.1214/aos/1176350374. URL https://doi.org/10.1214/aos/1176350374.
  • Parzen [1962] Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
  • Rao [1983] BLS Prakasa Rao. Nonparametric functional estimation. Academic press, 1983.
  • Rinaldo and Wasserman [2010] Alessandro Rinaldo and Larry Wasserman. Generalized density clustering. The Annals of Statistics, 38(5):2678–2722, 2010.
  • Sriperumbudur and Steinwart [2012] Bharath Sriperumbudur and Ingo Steinwart. Consistency and rates for clustering with dbscan. In Neil D. Lawrence and Mark Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 1090–1098, La Palma, Canary Islands, 21–23 Apr 2012. PMLR. URL http://proceedings.mlr.press/v22/sriperumbudur12.html.
  • Steinwart and Christmann [2008] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387772413.
  • Wand and Jones [1994] Matt P Wand and M Chris Jones. Kernel smoothing. Chapman and Hall/CRC, 1994.
  • Wang et al. [2017] Daren Wang, Xinyang Lu, and Alessandro Rinaldo. Optimal rates for cluster tree estimation using kernel density estimators. arXiv preprint arXiv:1706.03113, 2017.

Supplementary Material

Appendix A Uniform convergence on a function class

As we have seen in (6) in Section 2, uniform bound on the kernel density estimator boils down to uniformly bounding on the function class . In this section, we derive a uniform convergence for a more general class of functions. Let be a class of functions from to , and consider a random variable

(14)

As discussed in Section 2, we combine Talagrand inequality (Theorem (2)) and VC type bound (Theorem (3)) to bound (14), which is generalizing the approach in Sriperumbudur and Steinwart [2012, Theorem 3.1].

Theorem 14.

Let be a probability space and let be i.i.d. from . Let be a class of functions from to that is uniformly bounded VC-class with dimension , i.e. there exists positive numbers , such that, for all , , and for every probability measure on and for every , the covering number satisfies

Let with for all . Then there exists a universal constant not depending on any parameters such that is upper bounded with probability at least ,

Proof of Theorem 14.

Let . Then it is immediate to check that for all ,

(15)

Now, is expanded as

Hence from (15), applying Proposition 2 to above gives the probabilistic bound on as

(16)

It thus remains to bound the term . Let