Fast Kernel -means Clustering Using Incomplete Cholesky Factorization
Kernel-based clustering algorithm can identify and capture the non-linear structure in datasets, and thereby it can achieve better performance than linear clustering. However, computing and storing the entire kernel matrix occupy so large memory that it is difficult for kernel-based clustering to deal with large-scale datasets. In this paper, we employ incomplete Cholesky factorization to accelerate kernel clustering and save memory space. The key idea of the proposed kernel -means clustering using incomplete Cholesky factorization is that we approximate the entire kernel matrix by the product of a low-rank matrix and its transposition. Then linear -means clustering is applied to columns of the transpose of the low-rank matrix. We show both analytically and empirically that the performance of the proposed algorithm is similar to that of the kernel -means clustering algorithm, but our method can deal with large-scale datasets.
keywords:kernel -mean, kernel clustering, incomplete Cholesky factorization
Clustering analysis is a classical unsupervised learning method. The aim of clustering is to partition samples into several groups. One advantage of clustering is that it is suitable for processing multi-class datasets. It has been applied in various fields, including image segmentation Liew and Yan (2003), anomaly detection Chandola et al. (2009), gene sequence analysis Fu et al. (2012), market research Kim and Ahn (2008), etc.
Clustering has been well studied in recent years, and various clustering algorithms have been proposed one after another Corpet (1988); Ester et al. (1996); Elhamifar and Vidal (2009). -means clustering MacQueen (1967) is one of the most popular clustering method since it is simple and efficient in dealing with linear-separable datasets. The target of the -means algorithm is to minimize the Euclidean distance between samples and the clustering centers. The computational complexity of -means is very low and it is suitable to deal with large-scale datasets. However, -means clustering will not achieve satisfactory results when dataset is nonlinear-separable, that is, the dataset cannot be well partitioned into different clusters by hyperplane. To address this deficiency, Schölkopf et al. introduced kernel into -means clustering, and proposed kernel -means clustering Schölkopf et al. (1998). Kernel trick is an effective method to map nonlinear-separable dataset in low-dimensional space to linear-separable dataset in higher-dimensional feature space. By using nonlinear mapping , the Euclidean distance between samples in -means is replaced by the kernel distance defined by
where and are two samples, is the kernel function and . However, the full kernel matrix is required for kernel -means, because it needs to compute the kernel distance between samples and cluster centers which are a linear combination of all the samples in the feature space. If the number of samples is very large, computing and storing consume a lot of memory. Therefore, kernel -means is unsuitable for clustering large-scale datasets.
In this paper, we address this challenge by using low-rank approximation version instead of the full kernel matrix . The low-rank approximation version is generated by incomplete Cholesky factorization (ICF) Zhou (2016). It iteratively chooses samples one by one into a basic subset by minimizing the trace of error, i.e. , and finally constructs the rank- matrix , where is the number of elements in the , . Then, -means clustering is applied on to obtain the final cluster results. We show that approximation error of the kernel -means clustering using ICF algorithm reduces exponentially as increases when the eigenvalues of decay exponentially. Moreover, with regard to the Zhou’s ICF Zhou (2016), we show that: (a) for symmetric positive semidefinite (SPSD) matrix , ICF can obtain a rank- approximation of after iterations; (b) let as the rank of the SPSD matrix , then after iterations, ICF can successfully sample linearly independent columns from without breakdown; (c) the approximation error, , decreases exponentially with the increases of when the eigenvalues of decay exponentially sufficiently fast; (d) the iterations of ICF should be set no more than the rank of , i.e. ; (e) setting the iterations of ICF as can guarantee the error , where is a fully small positive number. These claims can ensure the convergence of ICF, which was not discussed in the previous studies. The experimental results illustrate that the accuracy of the proposed algorithm is similar as the kernel -means using entire kernel matrix, but our algorithm can greatly reduce the running time and can process large-scale datasets.
The rest of the paper is organized as follows. Section 2 describes the related work on large-scale kernel -means clustering. Section 3 outlines the kernel -means clustering and the ICF algorithm. Section 4 discusses the convergence of ICF. We present the kernel -means clustering using ICF and its analysis in section 5. Section 6 summarizes the results of the experimental results, and section 7 concludes the study.
2 Related work
Kernel -means clustering can achieve higher clustering quality than -means. However, the computational complexity of the kernel -means clustering is high, mainly because the computation and storage of the kernel matrix consume much time and memory. Many algorithms have been proposed in the literature to improve the ability of kernel -means to deal with large-scale datasets (see Table 1). Chitta et al. (2011) proposed the approximate kernel -means algorithm. It approximates the cluster centers by the randomly selected samples instead of all the samples to avoid compute the full kernel matrix. The computational complex of this algorithm is , where is the number of samples in the dataset, is the number of selected samples and is the number of iterations. Wang et al. (2019) proposes a scalable kernel -means clustering algorithm which uses Nyström approximation to sample features, and then the truncated singular values decomposition (SVD) is used to reduce the number of features to . The computational complex of it is . By employing random Fourier features (RFF) and SVD, Chitta et al. (2012) proposes efficient kernel -means clustering to improve the efficiency of kernel clustering. Its computational complex is . These algorithms all focus on approximating the kernel matrix and more efficient than the standard kernel -means algorithm whose computational complexity is .
|-means Jain (2010)|
|Kernel -means Girolami (2002)|
|Approximate kernel -means Chitta et al. (2011)|
|Efficient kernel -means Chitta et al. (2012)|
|Scalable kernel -meansWang et al. (2019)|
Incomplete Cholesky factorization is another effective kernel matrix approximate method. Comparing with data independent method RFF, the data dependent methods ICF has better generalization performance Yang et al. (2012). Comparing with Nyström methods, ICF has three advantages: (a) ICF does not need to compute and store the full kernel matrix, which is not true for some Nyström methods, such as leverage scores sampling Gittens and Mahoney (2016), Farahat schemes Farahat et al. (2011) and landmark point sampling with -means Zhang et al. (2008); (b) the result of ICF is deterministic, in contrast, some random-based Nyström method randomly selects a subset of training samples as the basis functions and RFF randomly samples vectors from a distribution to form the basis functions; (c) ICF does not sample the same column of kernel matrix twice, whereas most non-deterministic Nyström methods require sampling with replacement Patel et al. (2016). ICF has been successfully applied in several kernel-based algorithms to improve computational efficiency or enhance sparsity, for example, Zhou Zhou (2016)obtains the sparse solution of least square support vector machine (LSSVM) by using ICF, Chen et al. Chen and Zhou (2018) introduce ICF into robust LSSVM and enables it to classify and regress large-scale datasets with noise, Frederix et al. Frederix and Barel (2013) propose a sparse spectral clustering method with ICF, etc. ICF with different pivot selection rules are presented, see Bach and Jordan (2002); Harbrecht et al. (2012); Patel et al. (2016). Recently, Zhou proposed an improved ICF with a new pivot selection rule Zhou (2016). One advantage of this method is that in each iteration, only the diagonal elements of error matrix are required, not the whole matrix. In the sequel, when we refer to ICF, we mean Zhou’s method.
In this section, we first describe the kernel -means clustering, and then describe incomplete Cholesky factorization method.
3.1 Kernel -means clustering
Let be the input dataset consisting of samples, where . The dataset can be participated into clusters: . be the number of samples in cluster . The objective of kernel -means clustering is to minimize the sum of kernel distances between each sample and the center of the cluster to which the sample belongs, that is, to minimize the following optimization problem:
where is the mapping to project the samples to high-dimensional feature space. In fact, we usually do not need to know what the mapping is, because they often appear in the form of inner product . We denote it as , i.e. , where is called the kernel function. The factor is introduced only for the purpose of normalization.
Let be the kernel matrix with , and be the full eigenvalue decomposition of . Denote be the columns of the , then the problem (1) is equivalent to the following optimization problem Wang et al. (2019):
Problem (2) can be solved by -mean clustering algorithm. However, the entire kernel matrix and its eigenvalue decomposition must be computed and stored in advance, which cost and running time, respectively. Therefore, the total computational complexity of solving optimization problem (2) is , where is the number of iterations of -means algorithm. When dataset contains points greater than a few ten thousands, the computational cost of this method is very high. The goal of this paper is to reduce both the computational complexity and the memory requirements of kernel -means clustering.
3.2 Incomplete Cholesky factorization
For a positive semi-definite matrix , the incomplete Cholesky factorization constructs a low-rank approximate matrix for , where is the column/row indices of , is a subset of , which contains indices of the selected columns, denotes a sub-matrix of composed by the selected columns whose indices are in , and is a square sub-matrix of composed by the selected rows whose indices are in . The subset is generated as follows. ICF first sets as an empty set, and then iteratively joins the index corresponding to the largest diagonal entry of error matrix into to minimize until the termination condition is satisfied, where and are the and the error matrix in the -th iteration, respectively. Minimizing also implies minimizing trace norm, and the upper bounds of and , because for positive semi-definite matrix . is updated by the following theorem Zhou (2016).
Denote . Let be the selected index into and . Set and . If , then with , where , being the -th row of and . Furthermore, , .
4 The convergence of ICF
In this section, we discuss the convergence of ICF, which is not discussed in previous literature. Firstly, we will prove that ICF produces a rank approximate matrix after iterations. Secondly, we show that ICF can generate a rank matrix in iterations without breakdown. At last, the convergence of ICF will be verified.
For symmetric positive semidefinite (SPSD) matrix , ICF generates a rank approximate matrix for after iterations.
We use inductive method to prove this theorem. First, we set the number of iteration , then as is a SPSD matrix, therefore the rank of is 1.
Next, we assume that the rank of is , and then prove that the rank of is .
Suppose that ICF can select the -th element into , that is, there exists , then , , and .
Set be the Cholesky factorization of and be the Cholesky factorization of , where
Therefore, is a full rank matrix. The ranks of and are both , hence the rank of is . The theorem is proven. ∎
Theorem 2 indicates that the columns of are linear independent, that is, the selected columns from corresponding to the are linear independent. Next, we will show that ICF cannot breakdown before selecting linear independent columns from , where .
Set the rank of SPSD matrix as . ICF can sample linear independent columns from after iterations.
Theorem 2 has proven that ICF can generate a rank approximate matrix for after iterations. Next, we use reduction to absurdity to prove that ICF does not breakdown before iterations.
Assume for any , after iterations, , then for , . Set , then is not a full rank matrix as in (3).
Because is a rank SPSD matrix, has eigenvalue decomposition , where is a diagonal matrix, is a column orthogonal matrix. Denote as a sub-matrix of . It is comprised by the rows of with row indices corresponding to . Then has the decomposition . is not a full rank matrix, because is not a full rank matrix. Therefore, the row of can be represented linearly by the first rows. This conclusion is held for any in , hence every row in can be linearly represented by , and the rank of is at most . This contradicts that is a rank matrix. Therefore, the assume is not invalid, and there exists at least one satisfying after iterations, where and . In other words, ICF cannot breakdown before iterations. The theorem is proven. ∎
The number of iterations in ICF should be set no more than the rank of .
Assume there exists satisfying , then , and the ranks of and are both . Therefore, the rank of is , which contradicts that is a rank matrix. So the assumption is invalid. The number of iterations should be set no more than . ∎
The approximation error decreases monotonously as the number of iterations increases.
From Theorem 1, we obtain , . Therefore,
where and for . The total error .
When increases, increases. Hence decreases. Moreover, the number of elements in declines as increases. Therefore, the total approximation error decreases monotonously with the increasing of . ∎
Theorem 5 only shows that the approximation error decreases as increases, and when , then . However, it does not give the decline rate. The following theorem indicates that the error decreases exponentially when the eigenvalues of decay exponentially sufficiently fast.
Denote be the rank- approximation of , be the -th largest eigenvalue of matrix . Assume
for some uniformly in . Then, satisfies , where is a constant.
Because is the (incomplete) Cholesky factorization of , we have
In the above formula, the first inequality holds according to Harbrecht et al. (2012). Denote , then
The approximation error is bounded by
This implies . The theorem is proven. ∎
In order to further verify the Theorem 5 and 6, we conducted experiments on datasets USPS and MNIST by using ICF. The eigenvalues of are decay exponentially for these two datasets. We applied Gaussian kernel function in experiments, , where is the parameter of the kernel function, which were set as and for these two datasets, respectively. Fig. 1 gives the experimental results. It shows that the approximation error decreases exponentially as the increasing of iterations . Theorem 6 and the experimental results in Fig. 1 show that ICF is a exponential convergence algorithm. As the approximation error drops rapidly at first, in fact, it is sufficient to set as a few hundred number to get satisfy results, where .
5 Kernel -means clustering using incomplete Cholesky factorization
The runtime complexity of kernel -means clustering is very high, which leads to kernel -means algorithms running slowly and can not deal with large-scale datasets. The main reason is that standard kernel -means algorithm needs to compute entire kernel matrix. In this section, we apply incomplete Cholesky factorization method to obtain the low rank approximation of the kernel matrix. The new algorithm avoids computing the entire kernel matrix, reducing the computational complexity and the storage space of kernel -means clustering, but it can achieve comparable clustering performance as standard kernel -means using the entire kernel matrix.
ICF algorithm outputs a matrix so that kernel matrix . When has the eigenvalue decomposition , . Therefore, the problem (2) is equivalent to the following optimization problem:
where denotes the columns of . The idea of the kernel -means clustering using incomplete Cholesky factorization is that run Algorithm 1 first to obtain matrix , and then run -means clustering algorithm using the columns of as the input data to get the clustering results. Algorithm 2 lists the new algorithm in detail.
In the following, we bound the difference between the solutions of (2) and (5). Firstly, we give the equivalent form of (2). Let is a indicator matrix which has one non-zero element per row. When the -th sample belongs to the -th cluster, then , where denotes the number of samples in cluster . Note that is an identity matrix and is a symmetric matrix. Then,
where is a identity matrix. Therefore, (2) is equivalent to
The proof is completed by using (4). ∎
Theorem 7 indicates that the approximation error of the kernel -means clustering using ICF reduces as increases. The rate of decline is exponential.
Computational Complexity. The Algorithm 2 only consists of two steps. The first step is to perform ICF algorithm, the complexity of which is Zhou (2016). The second step is to run -means clustering on matrix , which takes time, where is the number of iterations required for convergence. Hence, the total computational complexity of Algorithm 2 is . By comparison, directly solving (2) by using entire kernel matrix takes time. Therefore, our ICF-based method greatly reduces the computational complexity.
In order to measure the performance of the new algorithm, we compare our proposed algorithm with kernel -means clustering and some of its improved algorithms in terms of clustering accuracy and time consumption. The first set of experiments was carried on three -dimensional synthetic datasets to show that the new algorithm can cluster non-linear data points well. The second set of experiments performs on several real-world datasets. The experimental results on medium-sized datasets demonstrate that the proposed algorithm is not only faster than the kernel -means clustering, but also can obtain as good performance as the kernel -means algorithm in terms of clustering accuracy. For large-sized real-world datasets, the full kernel matrix is infeasible, we only compare the performance of the proposed algorithm with improved -means algorithms. Gaussian kernel function was used for all the datasets. All algorithms were implemented in MATLAB and run on a 2.40 GHz processor with 8 GB RAM.
6.1 Synthetic datasets experiments
In order to show the clustering effect of the proposed algorithm, we generate three datasets named Ring, Parabolic and Zigzag, which cannot be clustered well by -means algorithm. Each dataset contains two clusters, and each cluster contains data points. The number of sampled data points is set as in ICF for all the datasets. The parameter in the Gaussian kernel function is set as , and for Ring, Parabolic and Zigzag datasets, respectively. Fig. 2 gives the experimental results. Fig. 2 illustrates that the new algorithm can cluster data points well even only using points. The clustering accuracy for each dataset is . This validates the performance of the new algorithm very well.
6.2 Real-world datasets experiments
In order to evaluate the performance of the proposed algorithm, we compare it with several state-of-art kernel clustering algorithms on four real-world datasets in terms of clustering accuracy and time.
We use two medium-size dataset and two large-size datasets to evaluate the performance of the algorithms. These datasets can be downloaded from LIBSVM website
PenDigits:The pen-based recognition of handwritten digits dataset contains training samples and test samples from classes. We combine them to form a dataset containing samples. Each sample is represented by a -dimensional vector.
Satimage: This dataset contains 4,435 training points and 2,000 test points with classes. We combine them to form a dataset containing points. The dimension of the data is 36.
Shuttle: This is a dataset with classes containing data points. Each points has features.
Mnist: This is a handwritten digits dataset containing data points. Each point is described by a vector of dimensions and assigned to one of classes, each class representing a digit.
We compare our algorithm with the kernel -means algorithms to verify that they achieve similar clustering accuracy. We also compare the proposed algorithm with improved kernel -means algorithms, in which full kernel matrix need not be computed. The comparison algorithms are listed as follows:
Kernel: The kernel -means algorithm Schölkopf et al. (1998) proposed by Schölkopf et al. This method requires to calculate the entire kernel matrix. The code has been included in the Matlab package.
KernelChol: This algorithm calculates the entire kernel matrix first, then complete Cholesky factorization is used to decompose the entire kernel matrix. Finally, the -means clustering algorithm is adopted on the rows of the decomposed matrix to obtain the clustering results.
Approx: The approximate kernel -means algorithm Chitta et al. (2011), which employs a randomly selected subset of the data to compute the cluster centers.
RFF: The random fourier feature (RFF) kernel -means clustering algorithm is proposed in Chitta et al. (2012). This algorithm applying RFF method to approximate the full kernel matrix, and then the -means clustering is used to the points in the transformed space.
Nyström: The entire kernel matrix is approximated as by Nyström method in this algorithm, where is randomly sampled from . Then -means clustering is applied on the rows of .
ICF: The kernel -means clustering using ICF is proposed by this paper.
We use the Gaussian kernel function for all the algorithms. The kernel parameters are set as , , and for PenDigits, Satimage, Shuttle and Mnist datasets, respectively. The number of elements in is denoted by “subsetsize”, which is varied from to . For approximate kernel -means, RFF kernel -means and Nyström kernel -means algorithms, “subsetsize” is the size of randomly selected subset, the number of Fourier components and the number of elements in , respectively. The error bound in ICF is set as for all the datasets. The maximum number of iterations are set as for -means and kernel -means. Each experimental result is the average of 10 independent experiments.
We compare all the algorithms mentioned in section 6.2.2 on PenDigits and Satimage datasets. However, for other two datasets, the full kernel matrix is infeasible, therefore we only compare the improved kernel -means algorithms, in which full kernel matrix does not need to be computed. Figs. 3- 5 show the clustering accuracy and running time of the algorithms on all datasets.
The accuracy of the kernel -means using ICF increases with the subsetsize increasing. That is because the larger the subsetsize is, the smaller the approximate error of ICF is.
For PenDigits and Satimage datasets, the proposed algorithm has the same accuracy as kernel -means with full kernel matrix, when subsetsizes are larger than and , respectively. However, Figs. 5(a) and 5(b) illustrate that the clustering time of the new algorithm is great less than the kernel -means clustering algorithms with full kernel matrix.
Fig. 3 indicates that the accuracy of ICF-based algorithm is better than the Nyström-based kernel -means algorithm, RFF-based kernel -means algorithm and approximate kernel -means algorithm.
The new algorithm can achieve good clustering accuracy when subsetsize is larger than a very small constant. For PenDigits, Satimage, Shuttle and Mnist datasets, the subsetsizes are just set as 25, 50, 50 and 50 to achieve good clustering accuracy. Fig. 5 shows that the running time of the four algorithms (ICF, Approx, RFF and Nystrom) is similar, while the clustering accuracy of the new algorithm is better than the other three algorithms.
The variances of Nyström-based algorithm, RFF-based algorithm and Approximate kernel -means are greater than the proposed algorithm. That is because these three methods are all based on the idea of randomly sampling.
In terms of running time, kernel -means using complete Cholesky factorization algorithm is slower than standard kernel -means due to complete Cholesky factorization consuming much time. However, our ICF-based algorithm greatly reduces running time, and faster than the standard kernel -means. This verifies the effectiveness of our method.
Compared with three improved kernel -means algorithms, the running time of the new algorithm increases slightly faster. However, when the subsetsize is small, the running time of these four algorithms is similar, while the clustering accuracy of our algorithm is better than that of the other three algorithms.
We have proposed a fast kernel -means clustering algorithm, which uses incomplete Cholesky factorization and -means clustering to obtain a good approximation of kernel -means clustering. We have analyzed the convergence of ICF algorithm and shown that the ICF is exponentially convergent if the eigenvalues of the kernel matrix exponentially decrease. We also have bounded the approximate error between ICF-based kernel -means algorithm and kernel -means clustering algorithm, and shown that the approximate error decreases exponentially. The experimental results illustrate that the proposed method is able to yield similar clustering accuracy as the kernel -means using entire kernel matrix, while the running time and the storage space are greatly reduced. In the future, we plan to investigate and research the minimum size of sampled subset required to yield similar accuracy as the kernel -means with entire kernel matrix.
This work is supported by the National Natural Science Foundation of China (NNSFC) [No. 61772020].
- journal: Journal of LaTeX Templates
- https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/
- Kernel independent component analysis. Journal of machine learning research 3, pp. 1–48. Cited by: §2.
- Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 15. Cited by: §1.
- Sparse algorithm for robust lssvm in primal space. Neurocomputing 275, pp. 2880–2891. Cited by: §2.
- Approximate kernel k-means: solution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 895–903. Cited by: Table 1, §2, item 3.
- Efficient kernel clustering using random fourier features. In 2012 IEEE 12th International Conference on Data Mining, pp. 161–170. Cited by: Table 1, §2, item 4.
- Multiple sequence alignment with hierarchical clustering. Nucleic acids research 16 (22), pp. 10881–10890. Cited by: §1.
- Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 2790–2797. Cited by: §1.
- A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §1.
- A novel greedy algorithm for nyström approximation. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 269–277. Cited by: §2.
- Sparse spectral clustering method based on the incomplete cholesky decomposition. Journal of Computational and Applied Mathematics 237, pp. 145–161. Cited by: §2.
- CD-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics 28 (23), pp. 3150–3152. Cited by: §1.
- Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks 13 (3), pp. 780–784. Cited by: Table 1.
- Revisiting the nyström method for improved large-scale machine learning. Journal of Machine Learning Research 17, pp. 1–65. Cited by: §2.
- On the low-rank approximation by the pivoted cholesky decomposition. Applied numerical mathematics 62 (4), pp. 428–440. Cited by: §2, §4.
- Data clustering: 50 years beyond k-means. Pattern recognition letters 31 (8), pp. 651–666. Cited by: Table 1.
- A recommender system using ga k-means clustering in an online shopping market. Expert systems with applications 34 (2), pp. 1200–1209. Cited by: §1.
- An adaptive spatial fuzzy clustering algorithm for 3-d mr image segmentation. IEEE transactions on medical imaging 22 (9), pp. 1063–1075. Cited by: §1.
- Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Cited by: §1.
- Deterministic column sampling for low-rank matrix approximation: nyström vs. incomplete cholesky decomposition. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 594–602. Cited by: §2.
- Nonlinear component analysis as a kernel eigenvalue problem. Neural computation. Cited by: §1, item 1.
- Scalable kernel k-means clustering with nyström approximation: relative-error bounds. Journal of Machine Learning Research 20. Cited by: Table 1, §2, §3.1.
- Nyström method vs random fourier features: a theoretical and empirical comparison. In Advances in neural information processing systems, pp. 476–484. Cited by: §2.
- Improved nyström low-rank approximation and error analysis. In Proceedings of the 25th international conference on Machine learning. ACM, pp. 1232–1239. Cited by: §2.
- Sparse LSSVM in primal using Cholesky factorization for large scale problems. IEEE Transactions on Networks and Learning Systems 27 (4), pp. 783–795. Cited by: §1, §2, §3.2, §3.2, §5.