Generalized Mode and Ridge Estimation

# Generalized Mode and Ridge Estimation

## Abstract

The generalized density is a product of a density function and a weight function. For example, the average local brightness of an astronomical image is the probability of finding a galaxy times the mean brightness of the galaxy. We propose a method for studying the geometric structure of generalized densities. In particular, we show how to find the modes and ridges of a generalized density function using a modification of the mean shift algorithm and its variant, subspace constrained mean shift. Our method can be used to perform clustering and to calculate a measure of connectivity between clusters. We establish consistency and rates of convergence for our estimator and apply the methods to data from two astronomical problems.

\nipsfinalcopy

## 1 Introduction

Consider a random sample of the form , where is a random sample from a smooth density and each is a scalar random variable. The generalized density function (GDF), also known as an intensity function, is where . A kernel estimate of the GDF is

 ˆfn(x)=1nhdn∑i=1YiK(x−Xih), (1)

where is a kernel function and is the smoothing bandwidth. For simplicity, we henceforth assume that the function is a Gaussian kernel. The generalized density function and estimator are of interest in problems where the additional information in the ’s measure the importance, relevance, or intensity of the corresponding points. (In the language of point processes, the ’s are the points and the ’s are the marks.) Two common cases include:

• represents a covariate associated with each point, and the generalized density weights the points according to the value of the covariate. For example, in galactic astronomy, the ’s might represent a galaxy’s location and the ’s the galaxy’s mass. Astronomers are interested in the “mass-density” , which describes the distribution of galaxy mass.

• represents the measurement precision for each observation, so the generalized density weights the points according to how precisely they are measured.

This paper focuses on estimating the modes and ridges of the GDF because (i) they are often features of direct scientific interest, (ii) they provide useful and descriptive summaries of the GDF’ structure, and (iii) they can be used as inputs to clustering.

Given a smooth function , the modes (local maximums) and ridges [5, 15, 8] are defined as

 M=Mode(f) ={x∈Rd:∇f(x)=0,λ1(x)<0} (2) R=Ridge(f) ={x∈Rd:V(x)V(x)T∇f(x)=0,λ2(x)<0},

where is the eigenvalues of , the Hessian matrix of , and is a matrix with being eigenvectors of corresponding to eigenvalue .

The problem of estimating modes and ridges for density functions has been considered in the literature.ma For instance, [16, 14, 1] develop estimators for local modes, and [8] develop estimators for density ridges and establish the asymptotic properties of these estimators. The analogous problem for generalized density functions has not yet been considered.

We estimate the modes and ridges by

 ˆMn=Mode(ˆfn),ˆRn=Ridge(ˆfn). (3)

It is well-known the local modes and ridges from the kernel density estimator can be estimated efficiently by the mean shift algorithm [7, 3, 4]. Here, we present a modification of the mean shift algorithm that can find the modes and ridges in .

## 2 Methods

### 2.1 Weighted Mean Shift

Before we proceed to our method, we first review the usual mean shift algorithm. Given data and an initial point , the mean shift algorithm [7, 3, 4] updates to

 x⟵∑ni=1XiK(x−Xih)∑ni=1K(x−Xih). (4)

If we keep iterating, we end up at a mode of the kernel density estimator .

We define the weighted mean shift, which generates a path starting from a point by successive updates of the form

 x⟵∑ni=1YiXiK(x−Xih)∑ni=1YiK(x−Xih). (5)

This is directly analogous to the ordinary mean-shift update but puts additional weight on the point . The path generated from each point eventually converges to a local mode of the generalized density estimate , an element of the set . Later we will see that this is a consistent estimator of .

Now we derive the relation between (1) and (5). The gradient of is

 ∇ˆfn(x) =1nhdn∑i=1Yi∇K(x−Xih) (6) =1nhd+2n∑i=1Yi(Xi−x)K(x−Xih) =1nhd+2(n∑i=1YiXiK(x−Xih))−xnhd+2(n∑i=1YiK(x−Xih)).

Note that we use the fact that for the Gaussian kernel. After rearrangement,

 x+m(x)=∑ni=1YiXiK(x−Xih)∑ni=1YiK(x−Xih), (7)

where

 m(x)=nhd+2∑ni=1YiK(x−Xih)×∇ˆfn(x) (8)

is the mean shift vector that is always pointing toward the direction of gradient. Thus, the update rule (5) is to move to which follows the gradient ascent. By Morse theory, for a smooth function with non-degenerate Hessian matrix, the gradient ascent path will converge to one of the local modes. Accordingly, the update rule (5) ends up at one of the local modes.

### 2.2 Weight Subspace Constrained Mean Shift

[15] proposes the subspace-constrained mean-shift algorithm which can be used to find the ridges of . An analogous modification also works in the weighted case to find the ridges of the generalized density function. This update rule, which we call the weighted subspace constrained mean shift algorithm, is given by

 x⟵x+V(x)V(x)Tm(x), (9)

where is the mean shift vector defined in (8) and with being the eigenvector corresponding to the -th eigenvalue (first is the largest) of the estimated Hessian matrix . The Hessian matrix is

 ∇∇ˆfn(x)=1nhd+4n∑i=1((Xi−x)(Xi−x)T−h2Id)YiK(x−Xih), (10)

where is the identity matrix. In practice, we can ignore the factor since it is just a scaling. This algorithm will push every point along a ‘projected gradient path’ until it arrives a point on .

## 3 Applications

### 3.1 Mode Clustering

A common application of the mean shift algorithm is to perform a type of clustering called mode clustering [12]. The clusters are defined as the sets of points whose mean-shift paths converge to the same local mode. Using the weighted mean-shift algorithm yields a mode clustering based on the generalized density estimate.

Let be a smooth intensity function. For any point , we define a gradient path as follows

 ϕx:R+↦Rd,ϕx(0)=x,ϕ′x(t)=∇f(ϕx(t)). (11)

We denote as the destination of . By Morse theory [10, 17], must be one of the local modes of except the case is in a set of Lebesque measure (including saddle points and local minimums). Let be the collection of local modes of . We define the cluster of by

 Cj={x∈Rd:dest(x)=Mj}. (12)

In practice, we cluster data points by their destination of the weighted mean shift (5).

### 3.2 Connectivity Measure for Clusters

Some clusters are fairly isolated while other are close together. Here we show how to measure how close clusters are by defining a notion of connectivity. The idea is that the mean shift iterations can be thought of particles moving according to a diffusion (Markov chain).

We define a diffusion as follows. The probability of jumping from to is

 P(x→Xi)=YiK(x−Xih)∑ni=1YiK(x−Xih). (13)

This defines a diffusion and we denote be the random variable of the above diffusion. i .e. Now

 E(Q(x)|X1,Y1,⋯,Xn,Yn)=∑ni=1XiYiK(x−Xih)∑ni=1YiK(x−Xih) (14)

which is the update rule (5). The same result holds for non-weighted mean shift. Thus, the mean shift can be viewed as a expectation for a certain diffusion.

Let be the local modes of . Motivated by (13), we define

 P(Xi→Xj) =YjK(x−Xjh)∑ni=1YiK(x−Xih)+∑kj=1WjK(x−ˆMjh) (15) P(Xi→ˆMj) =WjK(x−Mjh)∑ni=1YiK(x−ˆMjh)+∑kj=1WjK(x−ˆMjh),

where . Note that we impute the weight for each local mode by an estimate of . We also define each mode to be an absorbing state. Namely, the transition probability to/from each local mode to itself is .

Let be a transition matrix with states such that the first states are the estimated local modes and the latter states are the data points. Then the transition matrix can be factorized by

 P=[Ik0ST],Tij=P(Xi→Xj),Sij=P(Xi→ˆMj), (16)

and is the identity matrix.

Then the matrix is the absorbing matrix; that is, the absorbing probability from to the local mode is . We define the connectivity for the two clusters corresponding to local modes as

 Ωij=12⎛⎝∑Xl∈DiYlAlj∑Xl∈DiYl+∑Xl∈DjYlAli∑Xl∈DjYl⎞⎠, (17)

where is the data points belonging to cluster .

The interpretation of is as follows. is the average hitting probability from points in cluster that end up at mode first, and vice versa. Connectivity will be large when two clusters are close and the boundary between them has high density. If we think of the (hard) cluster assignments as class labels, the connectivity is analogous to the mis-classification rate between class and class .

## 4 Statistical Analysis

The modes and ridges of are estimates of the modes and ridges of . In this section we study the statistical properties of these estimators. Let be the set of bounded, times continuously differentiable functions. Let be the max norm of a vector or a matrix . For a smooth function , we define the following operators. A vector of non-negative integers is called a multi-index with and corresponding derivative operator

 Dα=∂α1∂xα11⋯∂αd∂xαdd,

where is often written as . Namely, each is a -th order partial derivative of . For , define the following norms associated with derivatives

 (18)

Note that for two functions , is a norm that measures the differences between to the -th order differentiation.

General Assumptions.

• The random variable is uniformly bounded by a constant .

• The function and is a Morse function (the Hessian matrix is non-singular at all critical values).

• The kernel function and is symmetric, non-negative and

 ∫x2K(α)(x)dx<∞,∫(K(α)(x))2dx<∞

for all .

• The kernel function satisfies condition of [9]. That is, there exists some such that for all , where is the covering number for a semi-metric space and

 K={u↦K(α)(x−uh):x∈Rd,h>0,|α|=0,1,2,3}.

Assumptions (A1-2) are mild regularity conditions. Condition (K1) is common for a kernel function and (K2) is the smoothness condition for the kernel function. In particular, the Gaussian kernel and any smooth kernel with compact support satisfies (K1-2).

### 4.1 Risk of ˆfn

Define the mean integrated square errors (MISE) by

 MISEk(ˆfn)=E∫∑|α|=k∣∣∣ˆf(α)n(x)−f(α)(x)∣∣∣2dx, (19)

for . Note that as we obtain the usual MISE for . This is just an extension to higher order derivatives.

###### Theorem 1

Assume (A1-2) and (K1). Then

 MISEk(ˆfn) =O(h4)+O(1nhd+2k),k=0,1,2,3.

[2] proves the above Theorem for usual kernel density estimation. We omit the proof as it is similar to their proof.

We also have the following uniform bound.

###### Theorem 2

For a smooth function , let be defined as the above. Assume (A1-2) and (K1-2). Then

 ∥ˆfn−f∥∗∞,k=O(h2)+OP(√lognnhd+2k),k=0,1,2,3.

The proof is essentially the same as [9, 6] by noting that the random variable is bounded by . Similar results in kernel density estimation can be seen in [9, 6].

### 4.2 Mode Estimation and Ridge Recovery

In this section we assume that the density is supported on a compact subset . For two sets , the Hausdorff distance is given by

 dH(A,B)=inf{r:A⊂B⊕r,B⊂A⊕r}, (20)

where and .

###### Theorem 3

Let be the collection of local modes of and respectively. Assume (A1-2), (K1-2) and

• there exists such that where is the first eigenvalue to the Hessian matrix of .

When is sufficiently small,

 dH(ˆMn,M)=O(h2)+OP(√1nhd+2).

The proof of Theorem 3 is in the supplementary material. Here we present an outline for the proof.

Proof Outline. Let be the estimated local modes. Note that the number of estimated modes might be different from the number of true modes. However, by assumption (M) and the fact is sufficiently small, we have and . This follows from the bounds on eigenvalues of Hessian matrix and the estimated gradient. Moreover, we cannot place two estimated modes near any true mode since there is always a saddle point between two (estimated) modes and this cannot happen due to assumption (M) on the first eigenvalue. After rearranging the indices, each is close to for all . Note that so that Taylor’s theorem implies

 ∇ˆfn(Mi)=∇ˆfn(Mi)−∇ˆfn(ˆMi)=∇∇ˆfn(M∗i)(Mi−ˆMi),

where is a point between . Thus, . By similar technique for proving rates of convergence of the kernel density estimator, we have . And is uniformly bounded when is small. We conclude

The above theorem shows that every local mode of the intensity function will be estimated by the weighted mean shift algorithm as long as is sufficiently small. By Theorem 2, is converging to so that this result eventually holds. The rate comes from the bias-variance decomposition. Note that the variance is since the modes are defined via the gradient and this is the variance in estimating gradients.

###### Theorem 4

Let be the ridges of and respectively. Assume (A1-2), (K1-2) and the following two conditions:

• There exists and such that and and for all ,

 λ1(x)>b2,λ2(x)
• There exists such that

 {x:∥V(x)V(x)T∇f(x)∥max≤G0,λ2(x)

where is defined in (R1).

When is sufficiently small,

 dH(ˆRn,R)=O(h2)+OP(√lognnhd+4).

Proof Outline. The proof is essentially the same as in [8]. In particular, our condition (A1-2) together with (R1) implies the conditions in [8]. We assumed an additional condition (R2) so that, when is sufficiently small, the ridge estimator . This is because and its first, second derivatives will be bounded near with high probability and thus the eigenvalue and projected gradient for the estimator will be similar to the truth. Hence, by Theorem 6 and 7 in [8], the result follows.

## 5 Applications

We apply our method to two astronomy problems. The detailed description of the data can be found in the supplementary material.

The first problem analyzes data from the Sloan Digit Sky Survey (SDSS), where each data point is a galaxy at a specific location. The original data set is three-dimensional, but to ease visualization, we take a thin slice of the data to transform it into a two-dimensional problem. For each galaxy, we also have information on the galaxy’s total luminosity, which can be transformed into a proxy for the galaxy’s mass. We weight each galaxy by the mass and estimate the mass-density field and find the modes (red crosses) and ridges (blue curves). In cosmology, the modes correspond to galaxy clusters and the ridges correspond to filaments. Figure 2 shows the results of our analysis.

The second problem is to identify galaxies in a single image containing several galaxies. The data set is a noisy image of four overlapping galaxies. We first transform the image into gray scale and convolve with a Gaussian kernel with bandwidth to reconstruct the GDF. We focus only the region with intensity larger than . We apply the weighted mean shift algorithm and perform clustering to segment the data and construct the connectivity measure for clusters.

Figure 3 shows the results. We successfully detect the four galaxies and the weighted mean shift clustering classifies regions belonging to each galaxy. The connectivity matrix seems to capture the level of “interaction” among the four galaxies. For instance, clusters and have high connectivity, reflecting their large overlap. In contrast, clusters and have only moderate connectivity.

## 6 Conclusion

In this paper, we generalized the mode and ridge estimation from densities to generalized densities that can account for weighting or “marks.” We have established convergence rate for estimating modes and ridges in this case. Future work will be focused on constructing confidence sets for the estimated modes and ridges.

## Appendix A Proof for Theorem 3

Proof. The proof consists of two steps. At the first step, we prove that the number of are the same and each element of correspond to a close element in . The second step is to prove the rate of convergence.

We assume

 ∥ˆfn−f∥∗∞,2≤min{λ12d,λ02}. (21)

(this is what we mean is sufficiently small) Note that by Weyl’s theorem (Theorem 4.3.1 in [11]), (21) implies that the eigenvalue difference between and its estimator is bounded by . Thus, eigenvalues of within is upper bounded by . We will use this fact later.

Step 1. Let be the estimated local modes. Note that the number of estimated modes might be different from the number of true modes. However, by assumption (M) and (21), we have . This can be proved by contradiction since any local mode of must have negative first eigenvalue and zero gradient. Equation (21) and assumption (M) force these points to be within .

Moreover, we have . Note that we cannot place two estimated modes near any true mode since there is always a saddle point between two (estimated) modes and this cannot happen due to the assumption (M) on the first eigenvalue (saddle points have positive first eigenvalue).

Step 2. This part of proof is similar to [16]; however, our problem is simpler than theirs since we only need to find the rate of convergence while they prove the limiting distributions. By similar technique, we can prove the limiting distribution as well.

After rearranging the indices, each is close to for all . Note that so that Taylor’s theorem implies

 ∇ˆfn(Mi)=∇ˆfn(Mi)−∇ˆfn(ˆMi)=∇∇ˆfn(M∗i)(Mi−ˆMi),

where is a point between . Thus,

 Mi−ˆMi=(∇∇ˆfn(M∗i))−1∇ˆfn(Mi).

The next step is to prove

 E(∥Mi−ˆMi∥2) =O(h2) Var(∥Mi−ˆMi∥2) =O(1nhd+2).

We bound by the following:

 Mi−ˆMi =(∇∇ˆfn(M∗i))−1∇ˆfn(Mi) ≤∥∥∥(∇∇ˆfn(M∗i))−1∥∥∥2∥∥∇ˆfn(Mi)∥∥2.

We first bound . Note that the matrix norm is the largest absoluted eigenvalue; thus, all we need to do is to bound the eigenvalues of . Since is a point between , . Consequently, all eigenvalues of must be less or equal to by assumption (M) and (21). Therefore, the eigenvalues of must be bounded by . This gives the bound on the norm.

Now we find the rate of . Since ,

 E(∇ˆfn(Mi))=E(∇ˆfn(Mi))−∇f(Mi)

is the bias of . By assumptions (A1-2) and (K1), the bias is at rate by the same way for find the rate of bias of the kernel density estimator. Similarly, the covariance matrix is at rate .

Hence, by multidimensional Chebeshev’s inequality,

 ∇ˆfn(Mi)−E(∇ˆfn(Mi))=OP(√1nhd+2)

which implies . Putting altogether, . By repeating step 2 for each mode, we conclude the result.

## Appendix B Proof for Theorem 4

Proof. The proof is essentially the same as in proving Theorem 6 in [8]. In particular, our condition (A1-2) together with (R1) implies all the conditions in [8].

Note that they prove that the set is close to at rate

 dH(~R,R)=O(∥ˆfn−f∥∗∞,2)

as is sufficiently small.

Here we prove that our additional assumption (R2) implies that . Let . Then we have . Thus,

 ∥V(x) V(x)T∇f(x)∥max =∥V(x)V(x)T∇f(x)−ˆVn(x)ˆVn(x)T∇ˆfn(x)∥max ≤∥(V(x)V(x)T−ˆVn(x)ˆVn(x)T)∇f(x)∥max+∥ˆVn(x)ˆVn(x)T(∇f(x)−∇ˆfn(x))∥max ≤∥V(x)V(x)T−ˆVn(x)ˆVn(x)T∥maxC1+∥∇f(x)−∇ˆfn(x)∥maxC2

where are two constants that is independent of . The existence of comes from the fact that and are uniformly bounded for all . Note that since is a projection matrix, its max norm is uniformly bounded by . The first term can be bounded by Davis-Kahan’s theorem [13] at rate and the second term is at rate . Accordingly, as is small, any point satisfies . Thus, so that .

Now by Theorem 6 in [8],

 dH(ˆR,R)=O(∥ˆfn−f∥∗∞,2)

and by Theorem 2, we conclude the result.

## Appendix C Preprocessing and Description of The Data

### c.1 The SDSS Data

The data in the Sloan Digit Sky Survey (SDSS) can be found in http://www.sdss3.org/. We use the Main Sample Galaxies (MSGs) spectroscopic data in the data release 9. Each observation is a galaxy conatining the following four features:

• RA = right ascension (i.e., longitude, in degrees [0 to 360])

• DEC = declination (i.e., latitude, in degrees [-90 to 90])

• z = spectroscopic redshift (precisely estimated proxies for distance)

• r = Petrosian magnitude, r band (for completeness only)

The first three features (RA, DEC, z) relates to the spatial position of the galaxy and the Petrosian magnitude is a luminosity measure associated with the mass.

We select a thin slice of the universe to analyze:

 RA ∈[155,185] DEC ∈[35,65] z ∈[0.110,0.115].

The mass of a galaxy is given by the following formula (http://www.sdss.org/DR7/tutorials/getdata/index.html):

 MASS=r−5log(4.28×108×z),

where is the Petrosian magnitude and is the redshift.

We use the MASS as and the spatial location to conduct our analysis. Note that since we pick a thin slice, we can neglect the dimension of .

### c.2 The Galaxy Merger

The galaxy merger image is taken from the galaxy zoo (http://quenchtalk.galaxyzoo.org/#/subjects/AGS000007y) with label ‘AGS000007y’.

For the image data, the statistical model is given by

 Yi=g(Xi)+ϵi,i=1,⋯,n,

where ’s are IID mean noises and is from an uniform grid. Since is on an uniform grid, the density function is constant, the function is the same as the GDF. Then the estimator

 ˆgn(x)=1nhdn∑i=1YiK(x−Xih)

is an estimate to . And all the analysis for GDF can be applied to the image data.

### References

1. C. Abraham, G. Biau, and B. Cadre. Simple estimation of the mode of a multivariate density. The Canadian Journal of Statistics, 2003.
2. J. Chacón, T. Duong, and M. Wand. Asymptotics for general multivariate kernel density derivative estimators. Statistica Sinica, 2011.
3. Y. Cheng. Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(8):790–799, 1995.
4. D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603 –619, may 2002. ISSN 0162-8828. doi: 10.1109/34.1000236.
5. D. Eberly. Ridges in Image and Data Analysis. Springer, 1996.
6. U. Einmahl and D. M. Mason. Uniform in bandwidth consistency for kernel-type function estimators. The Annals of Statistics, 2005.
7. K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21:32–40, 1975.
8. C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Nonparametric ridge estimation. arXiv:1212.5156v1, 2012.
9. E. Gine and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l’Institut Henri Poincare (B) Probability and Statistics, 2002.
10. M. A. Guest. Morse theory in the 1990’s. arXiv:math/0104155v1, 2001.
11. R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge, second edition, 2013.
12. J. Li, S. Ray, and B. Lindsay. A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research, 8(8):1687–1723, 2007.
13. V. Luxburg. A tutorial on spectral clustering. Statistical and Computing, 2007.
14. E. Mammen, J. Marron, and N. Fisher. Some asymptotics for multimodality tests based on kernel density estimates. Probability Theory and Related Fields, 1992.
15. U. Ozertem and D. Erdogmus. Locally defined principal curves and surfaces. Journal of Machine Learning Research, 2011.
16. J. P. Romano. On weak convergence and optimality of kernel density estimates of the mode. The Annals of Statistics, 1988.
17. T. Sousbie. The persistent cosmic web and its filamentary structure – i. theory and implementation. Mon. Not. R. Astron. Soc., 2011.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters