Fast Landmark Subspace Clustering

# Fast Landmark Subspace Clustering

Dept. of Mathematics, University of Minnesota,
Minneapolis, MN 55455, USA
{wang1591,lerman}@umn.edu
###### Abstract.

Kernel methods obtain superb performance in terms of accuracy for various machine learning tasks since they can effectively extract nonlinear relations. However, their time complexity can be rather large especially for clustering tasks. In this paper we define a general class of kernels that can be easily approximated by randomization. These kernels appear in various applications, in particular, traditional spectral clustering, landmark-based spectral clustering and landmark-based subspace clustering. We show that for data points from clusters with landmarks, the randomization procedure results in an algorithm of complexity . Furthermore, we bound the error between the original clustering scheme and its randomization. To illustrate the power of this framework, we propose a new fast landmark subspace (FLS) clustering algorithm. Experiments over synthetic and real datasets demonstrate the superior performance of FLS in accelerating subspace clustering with marginal sacrifice of accuracy.

## 1. Introduction

Kernel-based learning algorithms have been highly successful for various machine learning tasks. Such algorithms typically contain two steps, first nonlinearly mapping the input data into a high-dimensional feature space and then applying linear learning algorithms on .

When designing kernel methods, it is of primary importance to make it applicable for large datasets. In other words, kernel methods need to scale linearly w.r.t. the number of data points. Unfortunately, most kernel methods require computation related to the kernel matrix. It scales poorly since the number of operations of merely obtaining the full kernel matrix is . Furthermore, many clustering algorithms require an eigen-decomposition of the kernel matrix.

Motivated by this scalability problem, we define a general class of kernels and study the properties of fast randomization approximations. This is a flexible framework that allows applications in different scenarios. In the setting of subspace clustering, we propose a landmark subspace clustering algorithm and show that it is fast with marginal sacrifice in accuracy.

### 1.1. Related Works

For building scalable kernel classification machines, Rahimi and Recht [DBLP:conf/nips/RahimiR07] proposed the approximation of kernels by Fourier random features and justified it by the Bochner’s theorem of harmonic analysis. This technique was further analyzed and refined in other classification situations [41466, DBLP:conf/nips/RahimiR08, AISTATS2012_KarK12]. It has recently been shown to have comparable performance with deep learning [DBLP:journals/corr/LuMLGGBFCKPS14] in both scalability and accuracy. However, Hamid et al. [DBLP:conf/icml/HamidXGD14] observed that such random features for polynomial kernels contain redundent information, leading to rank-deficient matrices. Although they are able to rectify it by dimension reduction, their work indicates that applying random features as suggested directly by the Bochner’s theorem may not be efficient.

In the territory of fast spectral clustering, the Nystrm method in [DBLP:conf/cvpr/LiLKL11, DBLP:conf/icml/KumarMT09] provides a low rank approximation of the kernel matrix by sampling over its columns. The error analysis of this approximation by special sampling schemes can be found in [DBLP:conf/alt/ChoromanskaJKMM13, DBLP:journals/jmlr/DrineasM05, DBLP:journals/jmlr/KumarMT12]. Different from the Nystrm method, the landmark-based methods [DBLP:conf/aaai/ChenC11, DBLP:conf/kdd/YanHJ09, DBLP:conf/icdm/GanSN13] first generate representative landmarks to summarize the underlying geometry of a dataset and then create new columns from these landmarks instead of sampling the original columns. Comparing with generating random Fourier features or subsampling the original columns, the landmark-based methods need less columns and thus are more efficient for spectral clustering.

### 1.2. Contributions of This Paper

First of all, we propose a randomization procedure that applies to a relatively large class of kernels, in particular, larger than classes treated by previous works (see Section 1.1). Second of all, we develop a kernel-based algorithm for fast approximated subspace clustering for a dataset of points with clusters. Given an approximation by landmarks, the algorithm requires running time. Third of all, in the context of kernel approximations via Fourier random features and landmarks, this is the first work that provides a bound of the -error between the original and approximated eigenvectors used in the clustering procedure. This bound is independent of . We remark that Rahimi and Recht [DBLP:conf/nips/RahimiR07] only provided a bound of the difference between the corresponding entries of the original and approximated kernels.

### 1.3. Organization of This Paper

Section 2 defines a set of kernels that fit our approximation scheme and describes a randomized approximation procedure for such kernels. Section 3 introduces the fast landmark subspace (FLS) clustering algorithm. Section 4 shows that good approximation errors for both the kernel matrix and the eigenvectors of the normalized kernel matrix can be obtained by changing the number of landmarks with no dependence on the number of points. This analysis applies not only to FLS but also to other clustering algorithms associated with the different kernels introduced in Section 2. In Section 5, we compare the proposed FLS with state-of-the-art fast subspace clustering algorithms over various datasets.

## 2. Kernels

### 2.1. Construction of Kernels

In this section we introduce a class of kernels. Let and be two measurable spaces and be a bounded continuous map:

 f:X×Y→R,

which is -integrable w.r.t. to the product measure . This map induces an embedding

 (1) ϕ:x∈X↦f(x,⋅)∈L2(Y,dμY).

Moreover, if we define on as follows:

 (2) k(x1,x2)=∫y∈Yf(x1,y)f(x2,y)dμY(y),

then the Fubini’s theorem and [fasshauer_positive_2011, Page 6] imply that is a Mercer’s kernel. Indeed, for any , the Mercer’s condition is satisfied:

 ∫x1,x2∈Xu(x1) k(x1,x2)u(x2)dμX(x1)dμX(x2) =∫y∈Y(∫x∈Xf(x,y)u(x)dμX(x))2dμY(y)≥0.

We now show that different types of kernels can be constructed in this way.

#### Example I

The first example shows that different kernels can be obtained by varying the measure . Let and be the Lebesgue measure. Let and be the product of the Gaussian measure and the uniform measure . If we define

 (3) f(x,(w,t))=√2cos(wTx+t),

then the corresponding kernel defined in (2) is the usual Gaussian kernel with variance . This formulation is a special case of the following Bochner’s theorem (see also [DBLP:conf/nips/RahimiR07]) from harmonic analysis.

###### Theorem 2.1 (Bochner [Mr1038803]).

A continuous shift-invariant kernel on is positive-definite if and only if is the Fourier transform of a non-negative measure.

The implication of Theorem 2.1 is that every positive-definite shift-invariant kernel on can be written as the expectation (integral)

 Eμ×U([0,2π])(f(x1,(w,t))f(x2,(w,t))),

where is as in (3) for some non-negative measure . In other words, the set of kernels defined in (2) by choosing different measures encompass all positive-definite shift-invariant kernels on .

#### Example II

The second example shows that this new formulation of kernels provides a theoretical foundation for the landmark-based methods [DBLP:conf/aaai/ChenC11, DBLP:conf/kdd/YanHJ09, DBLP:conf/icdm/GanSN13]. Indeed, let be , be the Lebesgue measure and be the probability measure from which the observed dataset is sampled. If we define

 (4) f(x,y)=(2πσ2)−k/2e−∥x−y∥22/(2σ2),

then the corresponding kernel on defined in (2) is the kernel implicitly used in landmark-based methods. Thus the theoretical analysis in Section 4 applies to these methods. We note that if one picks as above and be the Lebesgue measure, one obtains the Gaussian kernel on .

### 2.2. Alternative Interpretation

The basic interpretation of this construction of kernels is that can be taken as an infinite dimensional representation of in and be the corresponding inner product. Alternatively speaking, each point provides a similarity score for all pairs . The kernel is an average of all such scores provided by the measurable space . This alternative interpretation provides an insight on the choice of .

In Example II of Section 2, we mentioned that the integration of (4) over the underlying distribution of a given dataset leads to landmark-based kernels and the integration over the Lebesgue measure leads to the usual Gaussian kernels. It is evident that landmark-based kernels are more efficient since they use the average scores of landmarks, which summarize the underlying geometry and thus provide more relevant similarity scores than points randomly chosen from the whole .

### 2.3. Randomized Approximation

In this section we present a randomized mapping to approximate . We explicitly embed each into an Euclidean space using a randomized mapping: where the inner product in approximates the kernel function. That is,

 k(x1,x2)=⟨ϕ(x1),ϕ(x2)⟩μY≈⟨ψ(x1),ψ(x2)⟩.

Let . Then for all by definition. We randomly draw points according to the probability measure and define a random map:

 (5) ψ:x∈X→1√D[ψy1(x),…,ψyD(x)]T∈RD.

## 3. Fast Subspace Clustering

### 3.1. Subspace Kernels

Different kernels can be obtained by taking a suitable measurable space in (2). In this section we show how this idea can be applied to the subspace clustering task, leading to a fast subspace clustering algorithm. Let the measurable space be the sphere (after normalizing the data) and the measurable space be the Grassmannian of -dimensional subspaces with a measure . We define the function

 f:(x,L)∈X×Y↦exp(−dist(x,L)2/σ2)∈R,

and the subspace kernel

 (6) k(x1,x2)=∫L∈Yf(x1,L)f(x2,L)dμY(L).

We explain why this can be a good kernel for spectral subspace clustering in Section 2.2 below and further illustrate its power in the numerical section.

### 3.2. Choice of μY

Lemma 3.1 shows that the uniform measure on the Grassmannian is not a good choice. Indeed, the resulting kernel only has distance information between points and thus is a shift-invariant kernel in Example I of Section 2, which is not capable to detect the linear structures.

###### Lemma 3.1.

If is the uniform measure on , then is a function depending only on the Euclidean distance between and .

Section 2.2 provides an answer for how to pick for subspace kernels (6) given a subspace clustering task. Indeed, we note that the true underlying subspaces provide more relevant scores for the clustering task. A true subspace assigns to a pair if both of them belong to this subspace and a small score if otherwise. Therefore, a good choice of is a measure supported on a neighborhood of the underlying subspaces of a given dataset in .

#### Landmark Subspace

Empirically the true subspaces are unknown and can not be explicitly specified. Therefore the best we can do is to generate a set of subspaces that are close to the true subspaces in order to have a good approximation This set of subspaces is generated as follows. We first pick landmark points for the dataset by random sampling or -means. For each landmark point, a local best fit flat is found by selecting an optimal neighborhood radius according to [DBLP:journals/corr/abs-1010-3460]. These local flats approximate the true subspaces and are called landmark subspaces.

### 3.3. The FLS Algorithm

With the kernel approximation in Section 2.3, we can easily define the approximated normalized kernel matrix for a given dataset . Let

 ^W=ψ(X)Tψ(X)=[ψ(x1)⋯ψ(xn)]T[ψ(x1)⋯ψ(xn)]

and the degree matrix be a diagonal matrix with the diagonal entries . Then we define

 ^Ln,D=^D−1/2^W^D−1/2.

The clustering is based on the top eigenvectors of . At first glance, there seems no reason to consider since its construction requires operations. The essential idea of our randomized procedure is to provide a random low-rank decomposition of the normalized kernel matrix. The task of obtaining the top eigenvectors of is transformed to finding the top singular vectors of since they are the same by definition. This can be done in . We formulate the algorithm as follows:

## 4. Theoretical Analysis

The main goal of the theoretical analysis is to bound the approximation errors of the kernel matrix entries and eigenvectors for the normalized kernel matrix. We show that the errors depend only on . In other words, the number of landmarks does not have to increase as the dataset size increase in order to achieve a fixed level of approximation precision. This rigorously justifies that the cost of our algorithm is indeed linear in without a hidden factor of in .

### 4.1. Uniform Convergence

Hoeffding’s inequality guarantees the pointwise exponentially fast convergence in probability of to . That is,

 Pr[|ψ(x1)Tψ(x2)−k(x1,x2)|≥ϵ]≤2exp(−Dϵ2/4).

Recall that is the function defining the kernel . If and its derivatives in are bounded by a constant , we have the following stronger result that asserts the uniform convergence over a compact subset of X. Theorem 4.1 directly generalizes a similar argument in Claim 1 of [DBLP:conf/nips/RahimiR07] for more general manifolds and the new class of kernels. For the completeness, we include its proof in the appendix.

###### Theorem 4.1.

Let be a compact subset of . Then for the mapping defined in (5),

 Pr[supx,y∈M|ψ(x)Tψ(y)−k(x,y)|≥ϵ]≤Cd,fexp(−Dϵ2/(16d+8))/ϵ2,

where is a constant depending on and and the covering number of . Furthermore, with any constant probability when .

We denote the kernel matrix by . Given a dataset , we recall that the approximation is given by

 ^W=ψ(X)Tψ(X)=[ψ(x1)⋯ψ(xn)]T[ψ(x1)⋯ψ(xn)].

Theorem 4.1 states that the entrywise approximation quality of by is independent of n. This immediately indicates a low-rank approximation of as long as .

### 4.2. Convergence of Eigenvectors

In many cases such as spectral clustering, the stability of eigenvectors is desired for a matrix approximation. For simplicity, we assume there are two clusters and consider the stability of the second largest eigenvector (the stability of the first eigenvector is trivial). The reasoning for top eigenvectors is similar. Given a dataset , we recall that the normalized kernel matrix is defined as

 Ln=D−1/2WD−1/2,

where is diagonal with , and its approximation defined as

 ^Ln,D=^D−1/2^W^D−1/2,

where . In this section, we show the convergence of the second largest eigenvector of to that of uniformly over as . The spectral convergence related to and follows similarly. In the following analysis, we make the assumption below on the kernel :

###### Assumption 4.2.

is bounded from below and above by constants () on the domain from which the dataset is sampled.

We remark that this is also an essential assumption in proving the consistency of spectral clustering in [MR2396807] and Gaussian kernels satisfy this assumption over any compact set in .

Theorem 4.1 implies that entrywisely converges to as . Therefore, the second largest eigenvector of converges to the second largest eigenvector of as for each fixed . In the following, we show that this convergence is uniform over . That is, the approximation error depends only on , even as .

The main idea is as follows. We first show that is a small pertubation of in operator -norm in Lemma 4.3. Then we show that the first nonzero eigenvalue of has strictly positive distance from the rest spectrum for all in Lemma 4.4. This implies that a small perturbation of does not significantly change the corresponding eigenvector .

We now make the above arguments precise. To begin with, we assume the following probabilistic model. Suppose in are i.i.d. sampled from the unit ball according to some probability measure .

We denote , and . It is easy to see that

 ^Ln,D=Ln−H1−H2−H3.

The first task is to show that the -norms are small uniformly over . We note that . Theorem 4.1 implies that given any and any probability with where is a constant depending only on , the entrywise error with probability . The first task is to show that are small uniformly over . This is formulated as follows and proved in the Appendix.

###### Lemma 4.3.

If , , where is a constant depending only on , then , and with probability , for all .

Lemma 4.3 shows that is a small perturbation of in -norm uniformly over . Now we analyze further the relation between their spectrums. Let and be the spectrum and the second largest eigenvalue of . Lemma 4.4 below shows that is sufficiently separated from the rest spectrum of .

###### Lemma 4.4.

Let be a constant depending on the generating probability of the dataset and the kernel . For any constant probability , there exists such that

 dist(λ1,n,{0}∪σ(Ln)∖{λ1,n})≥c,∀n≥Np.

Lemma 4.4 together with the Davis-Kahan theorem [MR0264450] (see also [variantDK, Theorem 1]) indicates that and are robust (not mixed with other eigenvectors) under a small pertubation of . Indeed, we have

 ∥^v1,n−v1,n∥2≤∥H1+H2+H3∥2c≤Cδ,

for some constant and . Thus Theorem 4.5 below follows.

###### Theorem 4.5.

For any and , there are constants and such that if and , then

 ∥^v1,n−v1,n∥2≤Cδ,

with probability at least . Here depends on the gap of top eigenvalues and the lower and upper bounds of the kernel on the compact domain.

If there are clusters, we need to consider first nonzero eigenvectors and their perturbations . One can use similar arguments as above to prove the robustness of . We note that the approximation error of eigenvectors is controlled by the number of landmarks .

## 5. Experiment

We compare FLS with the following algorithms: Local Best-fit Flats (LBF) [DBLP:journals/corr/abs-1010-3460], Sparse Subspace clustering (SSC) [DBLP:journals/pami/ElhamifarV13], Scalable Sparse Subspace Clustering (SSSC) [DBLP:conf/cvpr/00010Y13]. LBF and SSSC are algorithms with the state-of-the-art performance in speed with reasonable high accuracy. Their comparison with other subspace clustering algorithms can be found in [DBLP:journals/corr/abs-1010-3460, DBLP:conf/cvpr/00010Y13]. Yet, there is no direct comparison between them. SSC is a popular but slow algorithm, which is included as a contrast for accuracy. We report the time cost in seconds and measure the accuracy of these algorithms by the ratio of correctly-clustered points with outliers excluded. That is,

 rate=# of correctly-clustered inliers# of % total inliers×100%.

When we fail to obtain a result due to the exceeding of memory limits, we report it as N/A.

### 5.1. Synthetic Data

FLS was tested and compared with other algorithms for artificial data with various subspace dimensions and levels of outliers. We use the notation to denote the model of subspaces in of dimensions . Given each model, we repeatly generate different sample sets according to the code in [synthetic_web]. More precisely, for each subspace in the model, 250 points are first created by drawing uniformly at random from the unit disk of that subspace and then corrupted by Gaussian noise (e.g., from on ). Then the whole sample set is further corrupted by adding or uniformly distributed outliers in a cube of side-length determined by the maximal distance of the former generated data to the origin. For each model, we repeat the experiment on 10 sample sets and the average accuracy (running time) are reported in Table 1 and 2 for outlier levels and respectively.

### 5.2. MNIST Dataset

We test the four algorithms on the MNIST dataset of handwritten digits. We work on the training set of images of the digits through . We pick images of a combination of several digits and apply PCA to reduce the dimension to . We choose a fixed subspace dimension and the correct number of clusters. The clustering rate is reported in Table 3. We note that stands for in the table.

### 5.3. Other Datasets

We also tried the four algorithms on the Extended Yale B (ExtendYB) [GeBeKr01], the penDigits dataset from UCI database. The Extended Yale B dataset contains 2414 front-face images from 38 persons. In the experiment, we randomly select 8 persons, crop their images to and further reduce the dimension by projecting to the first principal components. As suggest in [DBLP:journals/corr/abs-1010-3460, page 12], we apply a crude whitening process to the projected data (removing the top two principal components) before applying FLS and LBF. We report SSC and SSSC without such whitening since they worked better when all directions are included. The penDigits dataset contains 7494 data points of 16 features, each of which represents a digit from 0 to 9. The results are reported in Table 4.

## 6. Conclusion

In this paper we introduce an efficient framework to approximate a large class of kernels and consequently obtain faster clustering algorithms. In the context of clustering, this framework proposes a novel procedure for fast subspace clustering. Its complexity is for points with clusters and landmarks. Our theoretical analysis establishes such complexity for clustering algorithms associated with our framework. More importantly it bounds the -error between the original and approximated eigenvectors of kernel matrices, which applies to FLS, Fourier random features and other landmark-based methods.

## Appendix A Proof of Lemma 3.1

Let and be two pairs on such that the angles . Then there exists an orthogonal transformation such that and . Thus,

 k(x′1,x′2) =∫L∈Yf(Rx1,L)f(Rx2,L)dμY(L) =∫L∈Yf(x1,R−1L)f(x2,R−1L)dμY(L) =∫L∈Yf(x1,L)f(x2,L)dμY(RL) =∫L∈Yf(x1,L)f(x2,L)dμY(L)=k(x1,x2).

We use the fact that is uniform in the fourth equality. This equality means that . We conclude the proof by noting that and uniquely determine each other for any pair on .

## Appendix B Proof of Theorem 4.1

We denote . Let , where . Then

 EL2f≤E∥∇(ψ(x1)ψ(x2))∥2≤C4f.

The first inequality follows from the same argument in [DBLP:conf/nips/RahimiR07]. By Markov’s inequality,

 (7) P(Lf≥ϵ2r)≤⎛⎝2rC2fϵ⎞⎠2.

On the other hand, if is a -dimensional manifold with diameter , then has dimension , diameter and -net covering number . If be the vertices of the -net, then by Hoeffding’s inequality

 (8) P[∪Ti=1{|h(x(i)1,x(i)2)|≥ϵ/2}]≤2Texp(−Dϵ2/8).

Equations (7), (8) imply that

 P[supM×M|h(x1,x2)|≤ϵ]≥1−2Texp(−Dϵ2/8)−⎛⎝2rC2fϵ⎞⎠2.

If we pick such that the second and third terms on the right-hand side are equal and simplifying the expression, then

 P[supM×M|h(x1,x2)|≤ϵ]≥1−Cd,fexp(−Dϵ2/(16d+8))/ϵ2,

where is a constant depending on and the bound of and its derivatives.

## Appendix C Proof of Lemma 4.3

We first bound the -norms of each factor in .

 (9) ∥D−1/2∥2 =maxiD−1/2ii=(miniDii)−1/2≤(nl)−1/2, ∥^D−1/2∥2 =(mini^Dii)−1/2≤(n(1−δ)l)−1/2, ∥W∥2 ≤nmaxi,j|Wij|≤nu,∥^W∥2≤nu+nδl ∥W−^W∥2 ≤nmaxi,j|Wij−^Wij|≤nδl, ∥D−1/2−^D−1/2∥2 =maxi|D−1/2ii−^D−1/2ii|≤maxi|Dii−^Dii|2(n(1−δ)l)3/2 ≤12δ(1−δ)−3/2(nl)−1/2.

Here is a hint for the last inequality of (9). The function is a convex decreasing function over . Therefore, . Then the last inequality in (9) follows from the fact that

 mini{min{Dii,^Dii}}≥n(1−δ)l.

The upper bounds of follow immediately from (9) and the inequality for any matrices .

## Appendix D Proof of Lemma 4.4

Let the linear operator on be defined as follows:

 T(f)(x)=∫y∈B(0,1)k(x,y)f(y)/√d(x)d(y)dP(y),

where the degree function . Since is a compact operator, its eigenvalue accumulation point is . It is easy to see that is an eigenvalue of with the eigenvector . Luxburg et al. [MR2396807] showed that the eigenvalues and eigenvectors of converge to those of with a convergence rate . Since has all its eigenvalues in , so is . We summerize this in Lemma D.1

###### Lemma D.1.

has all eigenvalues between . is its eigenvalue. The only accumulation point of the eigenvalues is .

Let and be the spectrum and the second largest eigenvalue of respectively. For simplicity, we assume () is a simple eigenvalue. Lemma D.1 implies the distance

 dist(λ1,{0}∪σ(U)∖{λ1})=2c>0

for some . Theorem 15 in [MR2396807] implies that for any constant probability , there exists such that if , the

 dist(λ1,n,{0}∪σ(Ln)∖{λ1,n})≥c

with probability at least .

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters