Generalized Mode and Ridge Estimation
The generalized density is a product of a density function and a weight function. For example, the average local brightness of an astronomical image is the probability of finding a galaxy times the mean brightness of the galaxy. We propose a method for studying the geometric structure of generalized densities. In particular, we show how to find the modes and ridges of a generalized density function using a modification of the mean shift algorithm and its variant, subspace constrained mean shift. Our method can be used to perform clustering and to calculate a measure of connectivity between clusters. We establish consistency and rates of convergence for our estimator and apply the methods to data from two astronomical problems.
Consider a random sample of the form , where is a random sample from a smooth density and each is a scalar random variable. The generalized density function (GDF), also known as an intensity function, is where . A kernel estimate of the GDF is
where is a kernel function and is the smoothing bandwidth. For simplicity, we henceforth assume that the function is a Gaussian kernel. The generalized density function and estimator are of interest in problems where the additional information in the ’s measure the importance, relevance, or intensity of the corresponding points. (In the language of point processes, the ’s are the points and the ’s are the marks.) Two common cases include:
represents a covariate associated with each point, and the generalized density weights the points according to the value of the covariate. For example, in galactic astronomy, the ’s might represent a galaxy’s location and the ’s the galaxy’s mass. Astronomers are interested in the “mass-density” , which describes the distribution of galaxy mass.
represents the measurement precision for each observation, so the generalized density weights the points according to how precisely they are measured.
This paper focuses on estimating the modes and ridges of the GDF because (i) they are often features of direct scientific interest, (ii) they provide useful and descriptive summaries of the GDF’ structure, and (iii) they can be used as inputs to clustering.
where is the eigenvalues of , the Hessian matrix of , and is a matrix with being eigenvectors of corresponding to eigenvalue .
The problem of estimating modes and ridges for density functions has been considered in the literature.ma For instance, [16, 14, 1] develop estimators for local modes, and  develop estimators for density ridges and establish the asymptotic properties of these estimators. The analogous problem for generalized density functions has not yet been considered.
2.1 Weighted Mean Shift
If we keep iterating, we end up at a mode of the kernel density estimator .
We define the weighted mean shift, which generates a path starting from a point by successive updates of the form
This is directly analogous to the ordinary mean-shift update but puts additional weight on the point . The path generated from each point eventually converges to a local mode of the generalized density estimate , an element of the set . Later we will see that this is a consistent estimator of .
Note that we use the fact that for the Gaussian kernel. After rearrangement,
is the mean shift vector that is always pointing toward the direction of gradient. Thus, the update rule (5) is to move to which follows the gradient ascent. By Morse theory, for a smooth function with non-degenerate Hessian matrix, the gradient ascent path will converge to one of the local modes. Accordingly, the update rule (5) ends up at one of the local modes.
2.2 Weight Subspace Constrained Mean Shift
 proposes the subspace-constrained mean-shift algorithm which can be used to find the ridges of . An analogous modification also works in the weighted case to find the ridges of the generalized density function. This update rule, which we call the weighted subspace constrained mean shift algorithm, is given by
where is the mean shift vector defined in (8) and with being the eigenvector corresponding to the -th eigenvalue (first is the largest) of the estimated Hessian matrix . The Hessian matrix is
where is the identity matrix. In practice, we can ignore the factor since it is just a scaling. This algorithm will push every point along a ‘projected gradient path’ until it arrives a point on .
3.1 Mode Clustering
A common application of the mean shift algorithm is to perform a type of clustering called mode clustering . The clusters are defined as the sets of points whose mean-shift paths converge to the same local mode. Using the weighted mean-shift algorithm yields a mode clustering based on the generalized density estimate.
Let be a smooth intensity function. For any point , we define a gradient path as follows
We denote as the destination of . By Morse theory [10, 17], must be one of the local modes of except the case is in a set of Lebesque measure (including saddle points and local minimums). Let be the collection of local modes of . We define the cluster of by
In practice, we cluster data points by their destination of the weighted mean shift (5).
3.2 Connectivity Measure for Clusters
Some clusters are fairly isolated while other are close together. Here we show how to measure how close clusters are by defining a notion of connectivity. The idea is that the mean shift iterations can be thought of particles moving according to a diffusion (Markov chain).
We define a diffusion as follows. The probability of jumping from to is
This defines a diffusion and we denote be the random variable of the above diffusion. i .e. Now
which is the update rule (5). The same result holds for non-weighted mean shift. Thus, the mean shift can be viewed as a expectation for a certain diffusion.
Let be the local modes of . Motivated by (13), we define
where . Note that we impute the weight for each local mode by an estimate of . We also define each mode to be an absorbing state. Namely, the transition probability to/from each local mode to itself is .
Let be a transition matrix with states such that the first states are the estimated local modes and the latter states are the data points. Then the transition matrix can be factorized by
and is the identity matrix.
Then the matrix is the absorbing matrix; that is, the absorbing probability from to the local mode is . We define the connectivity for the two clusters corresponding to local modes as
where is the data points belonging to cluster .
The interpretation of is as follows. is the average hitting probability from points in cluster that end up at mode first, and vice versa. Connectivity will be large when two clusters are close and the boundary between them has high density. If we think of the (hard) cluster assignments as class labels, the connectivity is analogous to the mis-classification rate between class and class .
4 Statistical Analysis
The modes and ridges of are estimates of the modes and ridges of . In this section we study the statistical properties of these estimators. Let be the set of bounded, times continuously differentiable functions. Let be the max norm of a vector or a matrix . For a smooth function , we define the following operators. A vector of non-negative integers is called a multi-index with and corresponding derivative operator
where is often written as . Namely, each is a -th order partial derivative of . For , define the following norms associated with derivatives
Note that for two functions , is a norm that measures the differences between to the -th order differentiation.
The random variable is uniformly bounded by a constant .
The function and is a Morse function (the Hessian matrix is non-singular at all critical values).
The kernel function and is symmetric, non-negative and
for all .
The kernel function satisfies condition of . That is, there exists some such that for all , where is the covering number for a semi-metric space and
Assumptions (A1-2) are mild regularity conditions. Condition (K1) is common for a kernel function and (K2) is the smoothness condition for the kernel function. In particular, the Gaussian kernel and any smooth kernel with compact support satisfies (K1-2).
4.1 Risk of
Define the mean integrated square errors (MISE) by
for . Note that as we obtain the usual MISE for . This is just an extension to higher order derivatives.
Assume (A1-2) and (K1). Then
 proves the above Theorem for usual kernel density estimation. We omit the proof as it is similar to their proof.
We also have the following uniform bound.
For a smooth function , let be defined as the above. Assume (A1-2) and (K1-2). Then
4.2 Mode Estimation and Ridge Recovery
In this section we assume that the density is supported on a compact subset . For two sets , the Hausdorff distance is given by
where and .
Let be the collection of local modes of and respectively. Assume (A1-2), (K1-2) and
there exists such that where is the first eigenvalue to the Hessian matrix of .
When is sufficiently small,
The proof of Theorem 3 is in the supplementary material. Here we present an outline for the proof.
Proof Outline. Let be the estimated local modes. Note that the number of estimated modes might be different from the number of true modes. However, by assumption (M) and the fact is sufficiently small, we have and . This follows from the bounds on eigenvalues of Hessian matrix and the estimated gradient. Moreover, we cannot place two estimated modes near any true mode since there is always a saddle point between two (estimated) modes and this cannot happen due to assumption (M) on the first eigenvalue. After rearranging the indices, each is close to for all . Note that so that Taylor’s theorem implies
where is a point between . Thus, . By similar technique for proving rates of convergence of the kernel density estimator, we have . And is uniformly bounded when is small. We conclude
The above theorem shows that every local mode of the intensity function will be estimated by the weighted mean shift algorithm as long as is sufficiently small. By Theorem 2, is converging to so that this result eventually holds. The rate comes from the bias-variance decomposition. Note that the variance is since the modes are defined via the gradient and this is the variance in estimating gradients.
Let be the ridges of and respectively. Assume (A1-2), (K1-2) and the following two conditions:
There exists and such that and and for all ,
There exists such that
where is defined in (R1).
When is sufficiently small,
Proof Outline. The proof is essentially the same as in . In particular, our condition (A1-2) together with (R1) implies the conditions in . We assumed an additional condition (R2) so that, when is sufficiently small, the ridge estimator . This is because and its first, second derivatives will be bounded near with high probability and thus the eigenvalue and projected gradient for the estimator will be similar to the truth. Hence, by Theorem 6 and 7 in , the result follows.
We apply our method to two astronomy problems. The detailed description of the data can be found in the supplementary material.
The first problem analyzes data from the Sloan Digit Sky Survey (SDSS), where each data point is a galaxy at a specific location. The original data set is three-dimensional, but to ease visualization, we take a thin slice of the data to transform it into a two-dimensional problem. For each galaxy, we also have information on the galaxy’s total luminosity, which can be transformed into a proxy for the galaxy’s mass. We weight each galaxy by the mass and estimate the mass-density field and find the modes (red crosses) and ridges (blue curves). In cosmology, the modes correspond to galaxy clusters and the ridges correspond to filaments. Figure 2 shows the results of our analysis.
The second problem is to identify galaxies in a single image containing several galaxies. The data set is a noisy image of four overlapping galaxies. We first transform the image into gray scale and convolve with a Gaussian kernel with bandwidth to reconstruct the GDF. We focus only the region with intensity larger than . We apply the weighted mean shift algorithm and perform clustering to segment the data and construct the connectivity measure for clusters.
Figure 3 shows the results. We successfully detect the four galaxies and the weighted mean shift clustering classifies regions belonging to each galaxy. The connectivity matrix seems to capture the level of “interaction” among the four galaxies. For instance, clusters and have high connectivity, reflecting their large overlap. In contrast, clusters and have only moderate connectivity.
In this paper, we generalized the mode and ridge estimation from densities to generalized densities that can account for weighting or “marks.” We have established convergence rate for estimating modes and ridges in this case. Future work will be focused on constructing confidence sets for the estimated modes and ridges.
Appendix A Proof for Theorem 3
Proof. The proof consists of two steps. At the first step, we prove that the number of are the same and each element of correspond to a close element in . The second step is to prove the rate of convergence.
(this is what we mean is sufficiently small) Note that by Weyl’s theorem (Theorem 4.3.1 in ), (21) implies that the eigenvalue difference between and its estimator is bounded by . Thus, eigenvalues of within is upper bounded by . We will use this fact later.
Step 1. Let be the estimated local modes. Note that the number of estimated modes might be different from the number of true modes. However, by assumption (M) and (21), we have . This can be proved by contradiction since any local mode of must have negative first eigenvalue and zero gradient. Equation (21) and assumption (M) force these points to be within .
Moreover, we have . Note that we cannot place two estimated modes near any true mode since there is always a saddle point between two (estimated) modes and this cannot happen due to the assumption (M) on the first eigenvalue (saddle points have positive first eigenvalue).
Step 2. This part of proof is similar to ; however, our problem is simpler than theirs since we only need to find the rate of convergence while they prove the limiting distributions. By similar technique, we can prove the limiting distribution as well.
After rearranging the indices, each is close to for all . Note that so that Taylor’s theorem implies
where is a point between . Thus,
The next step is to prove
We bound by the following:
We first bound . Note that the matrix norm is the largest absoluted eigenvalue; thus, all we need to do is to bound the eigenvalues of . Since is a point between , . Consequently, all eigenvalues of must be less or equal to by assumption (M) and (21). Therefore, the eigenvalues of must be bounded by . This gives the bound on the norm.
Now we find the rate of . Since ,
is the bias of . By assumptions (A1-2) and (K1), the bias is at rate by the same way for find the rate of bias of the kernel density estimator. Similarly, the covariance matrix is at rate .
Hence, by multidimensional Chebeshev’s inequality,
which implies . Putting altogether, . By repeating step 2 for each mode, we conclude the result.
Appendix B Proof for Theorem 4
Note that they prove that the set is close to at rate
as is sufficiently small.
Here we prove that our additional assumption (R2) implies that . Let . Then we have . Thus,
where are two constants that is independent of . The existence of comes from the fact that and are uniformly bounded for all . Note that since is a projection matrix, its max norm is uniformly bounded by . The first term can be bounded by Davis-Kahan’s theorem  at rate and the second term is at rate . Accordingly, as is small, any point satisfies . Thus, so that .
Appendix C Preprocessing and Description of The Data
c.1 The SDSS Data
The data in the Sloan Digit Sky Survey (SDSS) can be found in http://www.sdss3.org/. We use the Main Sample Galaxies (MSGs) spectroscopic data in the data release 9. Each observation is a galaxy conatining the following four features:
RA = right ascension (i.e., longitude, in degrees [0 to 360])
DEC = declination (i.e., latitude, in degrees [-90 to 90])
z = spectroscopic redshift (precisely estimated proxies for distance)
r = Petrosian magnitude, r band (for completeness only)
The first three features (RA, DEC, z) relates to the spatial position of the galaxy and the Petrosian magnitude is a luminosity measure associated with the mass.
We select a thin slice of the universe to analyze:
The mass of a galaxy is given by the following formula (http://www.sdss.org/DR7/tutorials/getdata/index.html):
where is the Petrosian magnitude and is the redshift.
We use the MASS as and the spatial location to conduct our analysis. Note that since we pick a thin slice, we can neglect the dimension of .
c.2 The Galaxy Merger
The galaxy merger image is taken from the galaxy zoo (http://quenchtalk.galaxyzoo.org/#/subjects/AGS000007y) with label ‘AGS000007y’.
For the image data, the statistical model is given by
where ’s are IID mean noises and is from an uniform grid. Since is on an uniform grid, the density function is constant, the function is the same as the GDF. Then the estimator
is an estimate to . And all the analysis for GDF can be applied to the image data.
- C. Abraham, G. Biau, and B. Cadre. Simple estimation of the mode of a multivariate density. The Canadian Journal of Statistics, 2003.
- J. Chacón, T. Duong, and M. Wand. Asymptotics for general multivariate kernel density derivative estimators. Statistica Sinica, 2011.
- Y. Cheng. Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(8):790–799, 1995.
- D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603 –619, may 2002. ISSN 0162-8828. doi: 10.1109/34.1000236.
- D. Eberly. Ridges in Image and Data Analysis. Springer, 1996.
- U. Einmahl and D. M. Mason. Uniform in bandwidth consistency for kernel-type function estimators. The Annals of Statistics, 2005.
- K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21:32–40, 1975.
- C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Nonparametric ridge estimation. arXiv:1212.5156v1, 2012.
- E. Gine and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l’Institut Henri Poincare (B) Probability and Statistics, 2002.
- M. A. Guest. Morse theory in the 1990’s. arXiv:math/0104155v1, 2001.
- R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge, second edition, 2013.
- J. Li, S. Ray, and B. Lindsay. A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research, 8(8):1687–1723, 2007.
- V. Luxburg. A tutorial on spectral clustering. Statistical and Computing, 2007.
- E. Mammen, J. Marron, and N. Fisher. Some asymptotics for multimodality tests based on kernel density estimates. Probability Theory and Related Fields, 1992.
- U. Ozertem and D. Erdogmus. Locally defined principal curves and surfaces. Journal of Machine Learning Research, 2011.
- J. P. Romano. On weak convergence and optimality of kernel density estimates of the mode. The Annals of Statistics, 1988.
- T. Sousbie. The persistent cosmic web and its filamentary structure – i. theory and implementation. Mon. Not. R. Astron. Soc., 2011.