A High Performance Implementation of Spectral Clustering on CPUGPU Platforms
Abstract
Spectral clustering is one of the most popular graph clustering algorithms, which achieves the best performance for many scientific and engineering applications. However, existing implementations in commonly used software platforms such as Matlab and Python do not scale well for many of the emerging Big Data applications. In this paper, we present a fast implementation of the spectral clustering algorithm on a CPUGPU heterogeneous platform. Our implementation takes advantage of the computational power of the multicore CPU and the massive multithreading and SIMD capabilities of GPUs. Given the input as data points in high dimensional space, we propose a parallel scheme to build a sparse similarity graph represented in a standard sparse representation format. Then we compute the smallest eigenvectors of the Laplacian matrix by utilizing the reverse communication interfaces of ARPACK software and cuSPARSE library, where is typically very large. Moreover, we implement a very fast parallelized means algorithm on GPUs. Our implementation is shown to be significantly faster compared to the best known Matlab and Python implementations for each step. In addition, our algorithm scales to problems with a very large number of clusters.
CPUGPU platform; spectral clustering; sparse similarity graph;reverse communication interface; kmeans clustering
1 Introduction
Spectral clustering algorithm has recently gained popularity in handling many graph clustering tasks such as those reported in [van2013community, jin2015, craddock2012whole]. Compared to traditional clustering algorithms, such as kmeans clustering and hierarchical clustering, spectral clustering has a very well formulated mathematical framework and is able to discover nonconvex regions which may not be detected by other clustering algorithms. Moreover, spectral clustering can be conveniently implemented by linear algebra operations using popular scientific software environments such as Matlab and Python. Most of the available software implementations are built upon CPUoptimized Basic Linear Algebra Subprograms (BLAS), usually accelerated using multithread programming. However, such implementations scale poorly as the problem size or the number of clusters grow very large. Recent results show that GPU accelerated BLAS significantly outperforms multithreaded BLAS libraries such as the Intel MKL package, LAPACK and Goto BLAS [eddelbuettel2010benchmarking, cullinan2013computing]. Moreover, hybrid computing environments, which collaboratively combine the computational advantages of GPUs and CPUs, further boost the overall performance and are able to achieve very high performance on problems whose sizes grow up to the capacity of CPU memory [lee2014boosting, agulleiro2012hybrid, wu2014optimized, wuachieving, lihybrid, akimova2012parallel]. In this paper, we present a hybrid implementation of the spectral clustering algorithm which significantly outperforms the known implementations, most of which are purely based on multicore CPUs.
There have been reported efforts on parallelizing the spectral clustering algorithm. Zheng et al. [zheng2008parallelization] presented both CUDA and OpenMP implementations of spectral clustering. However, the implementation was targeted for a much smaller data size than the work in this paper, and moreover, their implementation achieve a relatively limited speedup. Matam et al. [matam2011gpu] implemented a special case of spectral clustering, namely the spectral bisection algorithm, which was shown to achieve high speedups compared to Matlab and Intel MKL implementations. Chen et al. [Chen11, song2008parallel] implemented the spectral clustering algorithm on a distributed environment using Message Passing Interface (MPI), which is targeted for problems whose sizes that could not fit in the memory of a single machine. Tsironis and Sozio [tsironis2013accurate] proposed an implementation of spectral clustering based on MapReduce. Both implementations were targeted for clusters, and involve frequent data communications which will clearly constrain the overall performance.
In this paper, we present a hybrid implementation of spectral clustering on a CPUGPU heterogeneous platform which significantly outperforms all the best implementations we are aware of, which are based on existing parallel platforms. We highlight the main contributions of our paper as follows:

Our algorithm is the first work to comprehensively explore the hybrid implementation of spectral clustering algorithm on CPUGPU platforms.

Our implementation makes use of sparse representation of the corresponding graphs and can handle extremely large input sizes and generate a very large number of clusters.

The hybrid implementation is highly efficient and is shown to make a very good use of available resources.

Our experimental results show superior performance relative to the common scientific software implementations on multicore CPUs.
The rest of the paper is organized as follows. Section II gives an overview of the spectral clustering algorithm, while describing the important steps in some detail. Section III describes the operating environment and the necessary software dependencies. Section IV provides a description of our parallel implementation, while Section V evaluates the performance of our algorithm with a comparison with Matlab and Python implementations on both synthetic and realworld datasets. The codes are available on https://github.com/yujumd/fastsc.
2 Overview of Spectral Clustering Algorithm
Spectral clustering was first introduced in 1973 to study the graph partition problem [donath1973lower]. Later, the algorithm was extended in [shi2000normalized, ng2002spectral], and generalized to a wide range of applications, such as computational biology [pentney2005spectral, higham2007spectral], medical image analysis [jin2015, craddock2012whole], social networks [white2005spectral, mishra2007clustering] and information retrieval [chifu2015word, mcfee2014analyzing]. A standard procedure of the spectral clustering algorithm to compute clusters is described next [von2007tutorial],

Step 1: Given a set of data points and some similarity measure , construct a sparse similarity matrix that captures the significant similarities between the pairs of points.

Step 2: Compute the normalized graph Laplacian matrix as where is the unnormalized graph Laplacian matrix defined as and is the diagonal matrix with each element .

Step 3: Compute the eigenvectors of the normalized graph Laplacian matrix corresponding to the smallest nonzero eigenvalues.

Step 4: Apply the means clustering algorithm on the rows of the matrix whose columns are the eigenvectors to obtain the final clusters.
Given the similarity graph defined by the similarity matrix , the basic idea behind spectral clustering is to partition the graph into partitions such that some measure of the cut between the partitions is minimized. The traditional graph cut is defined as follows:
(1) 
(2) 
To ensure that the each partition represents a meaningful cluster of reasonable size, two alternative cut measures are often used, namely RatioCut and normalized cut Ncut. Note that we use below as the number of nodes in and as the sum of the degrees of all the nodes in .
(3) 
(4) 
In our implementation, we focus on the problem of minimizing the Ncut which has an equivalent algebraic formulation as defined next.
(5) 
That is, we need to determine a matrix whose columns are indicator vectors, which minimizes the objective function introduced above.
Since this problem is NPhard, we relax the discrete constraints on are removed, thereby allowing to be any matrix in . Note that there is no theoretical guarantee on the quality of the solution of the relaxed problem compared to the exact solution of the discrete version. It turns out that the relaxed problem is a wellknown trace minimization problem, which can be exactly solved by taking as the eigenvectors with the smallest eigenvalues of the matrix or equivalently the generalized eigenvectors corresponding to the smallest eigenvalues of . The kmeans clustering is then applied on the rows of to obtain the desired clustering.
The algorithm described above begins with a set of dimensional data points and builds the similarity graph explicitly from the pairwise similarity metric. The similarity graph is usually stored in a sparse matrix representation, which often reduces the memory requirement and computational cost to linear instead of quadratic. For the general graph clustering whose input is specified as a graph, our spectral clustering algorithm starts directly in Step 2. Otherwise, we build our sparse graph representation from the given set of data points.
3 Environment Setup
3.1 The Heterogeneous System
The CPUGPU heterogeneous system used in our implementation is specified in Table 1.
CPU Model  Intel Xeon E52690 

CPU Cores  8 
DRAM Size  128GB 
GPU Model  Tesla K20c 
Device Memory Size  5GB GDDR5 
SMs and SPs  13 and 192 
Compute Capability  3.5 
CUDA SDK  7.5 
PCIe Bus  PCIe x16 Gen2 
The CPU and the GPU communicate through the PCIe bus whose theoretical peak bandwidth is 8 GB/s. The cost of data communication can be quite significant for largescale problems. To achieve the best overall performance, our implementation leverages the GPU to compute the most computationally expensive part while minimizing the data transfer between the host and the device.
3.2 CUDA Platform
CUDA is a generalpurpose multithreaded programming model that leverages the large number of GPU cores to solve complex data parallel problems. The CUDA programming model assumes a heterogeneous system with a host CPU and several GPUs as coprocessors. Each GPU has an array of Streaming Multiprocessors (SM), each of which has a number of Streaming Processors (SP) that execute instructions concurrently. The parallel computation on GPU is invoked by calling customized kernel functions using thousands of threads. The kernel function is executed by blocks of threads independently. Each block of threads can be scheduled on any Streaming Multiprocessors (SP) as shown in Figure 1. The kernel function takes as parameters the number of blocks and the number of threads within a block.
In addition, NVIDIA provides efficient BLAS libraries for both sparse
3.3 ARPACK Software
ARPACK is a software package designed to solve largescale eigenvalue problems [lehoucq1997arpack]. ARPACK is reliable and achieves high accuracy, and is widely used in modern scientific software environments. It contains highly optimized Fortran subroutines that are able to solve symmetric, nonsymmetric and generalized eigenproblems. ARPACK is based on the Implicitly Restarted Arnoldi Method (IRAM) with nontrivial numerical optimization techniques [lehoucq1996deflation, sorensen1992implicit]. In our implementation, we adopt ARPACK++
3.4 OpenBLAS
OpenBLAS
4 Implementation
4.1 Data Preprocessing
Given the dimensional data points, the preprocessing step constructs the similarity matrix from the data points. The clustering problem is reformulated as a graph clustering where the graph is represented by the similarity matrix.
As mentioned before, the similarity matrix is usually constructed to be sparse, which reduces the memory requirement and enables high computational efficiency. The sparsity patterns of the similarity matrices are highly dependent on the specific application. The following are several common ways to construct a sparse similarity matrix [von2007tutorial].

threshold graph: The similarity graph is constructed where data points are connected if their similarity measure is above the threshold .

distance graph: The similarity matrix is construct by only connecting data points that are within a spatial distance .

nearestneighbor graph: The similarity graph is constructed where two data points and are connected only if either is among the most similar data points of , or is among the most similar data points of . Note that the parameter is unrelated to the number of clusters used in the next section.
The notion of the similarity measure between data points also varies depending on the application. Typical measures are the following.

Cosine Similarity Measure
(6) 
Cross Correlation
(7) 
Exponential decay function
(8)
Although the sparse patterns and similarity measures are different depending on the application, the general construction of the similarity matrix can be accelerated under the CUDA programming model regardless of the preprocessing used. Here we provide a parallel implementation for a specific sparsity pattern and similarity measure.
We consider the input data as a matrix where is the number of data points and is the dimension of each data point. The goal is to construct a sparse matrix representation of the similarity graph using the distance graph structure and cross correlation as the similarity measure. We assume the neighborhood information is given by a list , which contains all pairs of indices of data points that are within distance. The number of such pairs is the number of edges in the graph. The procedure for constructing the sparse similarity matrix represented in Coordinate Format (COO) format is described in Algorithm 1.
Algorithm 1 Construction of Sparse Similarity Matrix 

1. Transfer the input data and edge lists from CPU to GPU. 
2. Initialize length vectors and on GPU. 
3. Initialize length vector on GPU. 
4. Execute kernel function compute_average where each thread computes 
5. Execute kernel function update_data where each thread updates one row of data and compute 
6. Execute kernel function compute_similarity where each thread computes the similarity between the pair of data points in . 
7. The edge list and the vector form the sparse graph represented in the Coordinate Format (COO) format. 
The above procedure is highly data parallel and easy to implement under the CUDA programming model. In general, there are two sparse matrix representations that we use in our work.

Coordinate Format (COO): this format is the simplest sparse matrix representation. Essentially, COO uses tuples to represent all the nonzero entries. This can be done through three separate length arrays that respectively store the row indices, column indices, and the corresponding nonzero matrix values.

Compressed Sparse Row Format (CSR): this consists of three arrays, one containing the nonzero values, the second containing the column indices of the corresponding nonzero values, and the third contains the prefix sums of the number of nonzero entries of the rows.
Other sparse formats such as Compressed Sparse Column Format (CSC), Block Compressed Sparse Row Format (BSR) are also supported in our implementation.
4.2 Parallel Eigensolvers
Given the similarity graph represented in some sparse format and the desired number of clusters , this step computes the eigenvectors corresponding to the smallest eigenvalues of normalized Laplacian where is the sparse matrix and is the diagonal matrix with each element . We assume that are all positive, otherwise the isolated nodes can be removed from the graph. The eigenvectors corresponding to the smallest eigenvalues of the normalized Laplacian are exactly the eigenvectors corresponding to the largest eigenvalues of . Since computing the largest eigenvalues results in better numerical stability and convergent behavior, we focus our attention on computing the eigenvectors corresponding to the largest eigenvalues of .
The sparse matrix multiplication can easily be computed as follows:
(9) 
The corresponding computation is data parallel and has complexity O(). We assume that the sparse similarity matrix initially resides in the device memory, represented in COO format. The parallel computation is described in Algorithm 2. Note that the will be transformed to the CSR format to perform the sparse matrixvector multiplication at the next step.
Algorithm 2 Parallel Computation of 
1. Initialize a nlength vector with 1.0 for all elements. 
2. Compute the vector where each element by calling cusparseDcsrmv in cuSPARSE library 
3. Execute the kernel function ScaleElements where each thread i processes one item in COO format and scales the element value by the inverse of . 
4. Compress the row indices through the cuSPARSE interface cusparseXcoo2csr. 
5. The compressed row indices, the column indices and the updated element value form the CSR representation of 
An important feature of the ARPACK software is the reverse communication interfaces, which facilitate the process of solving largescale eigenvalue problems. The reverse communication interfaces are CPUbased interfaces that encapsulate implicitly restarted Arnoldi/Lanczos method, which is an iterative method to obtain the required eigenvalues and corresponding eigenvectors. For each iteration, the interface provides a length vector used as input and the output of sparse matrixvector multiplication is provided back to the interface. ARPACK interfaces combine the optimized Fortran routines and CPUbased BLAS library OpenBLAS, which is one of the most efficient CPUbased BLAS library. ARPACK provides the flexibility in choosing any matrix representation format and the function to obtain the results of matrixvector multiplication. In our implementation, the matrixvector multiplication is performed on the GPU. For each iteration, the input vector is transferred from the CPU to the GPU and the output vector is transfered back to the interface. The detailed implementation is shown in Algorithm 3.
Algorithm 3 Parallel Eigensolver 

1. Initialize the object Prob with parameters. 
2. While !Prob.converge() 
Prob.TakeStep(). 
Transfer the data located at Prob.GetVector() from host to device. 
Call cusparseDcsrmv to perform matrixvector multiplication on device. 
Transfer the result from device to host and put it at the location addressed by Prob.PutVector(). 
3. Compute the eigenvectors by Prob.FindEigenvectors(). 
The object Prob is initialized as the eigenvalue problem for the symmetric real matrix with the largestmagnitude eigenvalues. TakeStep() is an interface that performs the necessary matrix operations based on the multithreaded OpenBLAS library. For each iteration, the multiplication of sparse matrix and dense vector is computed on the GPU where 1) the sparse matrix is reside on GPU; 2) the input vector, whose location is indicated by Prob.GetVector(), is transferred from CPU to GPU; 3) the result is transfered back from GPU to CPU to the position Prob.PutVector(). After the object Prob reaches convergence, the eigenvectors are computed by Prob.FindEigenvectors().
The complexity of Algorithm 3. largely depends on the interfaces TakeStep() and FindEigenvectors(). Both routines depend on the number of Arnoldi/Lanczos vectors, which is usually set as . TakeStep() involves the eigenvalue decomposition and iteratively QR factorization of matrix, as well as a few dense matrixvector multiplication. Therefore the complexity for TakeStep() is at least . Moreover, the general complexity for sparse matrixvector multiplication is . The number of iteration depends on the initial vector and properties of the matrix. The complexity FindEigenvectors() is . Hence the overall complexity is,
(10) 
As far as we know, the procedures described in Algorithm 3 are currently the most efficient and convenient way to solve general eigenvalue problems for largescale matrices. We leverage the existing software ARPACK on CPU to perform the complex eigensolver procedures and the GPU to perform the expensive matrix computations. Results in Section V. will show that the data communication overhead is negligible compared to the overall computational cost and the overall implementation is very efficient compared to other software that relies on CPUbased sparse matrixvector multiplication.
4.3 Parallel kmeans clustering
The kmeans clustering algorithm is an iterative algorithm to partition the input data points into k clusters whose objective function is to minimize the sum of squared distances between each point and its representative. In spectral clustering, the kmeans algorithm is used to cluster the rows of the matrix consisting of the eigenvectors. Each such row can in fact be viewed as
a reduced dimension representation of the original data point. There are several GPUbased implementations of the kmeans clustering such as [zechner2009accelerating, wu2011efficient]. However, none of these implementations seem
to be efficient for largescale problems, especially when k is very large. Our implementation is a revised version
from an opensource project
We assume that the lowdimensional representation initially resides in the CPU memory where is the number of data points and is the desired number of clusters. The implementation is described in Algorithm 4.
Algorithm 4 Parallel Kmeans Algorithm 
1. Transfer the data from the CPU to the GPU. 
2. Randomly select points as the centroids of the k clusters stored in 
3. While (the centroids change) do 
Compute the pairwise distances between data points and the centroids. 
Update the new label of each data point. 
Compute the new centroids of the clusters. 
4. Transfer the labeling result from GPU to CPU. 
Step 2 is the most common way to initialize the centroids. However, we use a more effective initialization strategy, referred to as the kmeans++ initialization, which has been shown to converge faster and achieve better results than the traditional kmeans algorithm [arthur2007k]. This initialization is simple to implement in parallel using basic routines in CUDA Thrust library, as described in Algorithm 5,
Algorithm 5 Parallel kmeans++ Initialization 
1. Pick the initial data point uniformly at random from 1 to n. 
2. Initialize the nlength vector Dist where each element is the shortest distance between the data point and the current centroids. 
3. for = 2 to k 
Compute the nlength vector such that 
Choose the centroid as the data point with probability 
Compute the vector newDist such that each element as the distance between the data point and the new centroid 
Update Dist 
Step 3 in Algorithm 4 is the main loop that iteratively updates the labels of the data points and the corresponding centers of the clusters until convergence (or the maximum number of iterations is reached). Given the data points and centroids , the pairwise distance matrix is computed as follows.
(11) 
After expanding the right hand side, the distance matrix can be expressed as
(12) 
Hence, we compute two additional vectors and ,
(13) 
(14) 
The matrix can be initialized as the sum of the corresponding elements in and
(15) 
The pairwise distance matrix is then computed by level3 BLAS function provided in the cuBLAS library.
(16) 
For each data point, the new label is updated by as the index of centroid which has the minimum distance to the data point. Meanwhile, a global variable is maintained to record the number of label changes during the update.
The new centroids are updated as the mean value of all the data points sharing the same label. To identify the points in each cluster, we sort the data points according to their new labels. Each GPU thread will then independently work on a consecutive portion of the sorted data points where most of these points share the same label.
The entire workflow of our implementation is summarized in Figure 2.
5 Evaluation
5.1 Datasets
We evaluate our parallel implementation on several realworld and synthetic datasets. The Diffusion Tensor Imaging (DTI) dataset is given as a set of data points, each of which is characterized by a 90dimensional array. The other datasets are specified by an undirected graph data where the edges are given by an edge list. The problem sizes and the numbers of clusters generated are shown in Table 2. A brief description of each dataset is given next.

DTI: The Diffusion Tensor Imaging(DTI) dataset is the brain image data of a subject chosen from a publicly accessible medical dataset provided by Nathan Kline Institute (NKI). The dataset captures the diffusion of the water molecules in the brain tissues, which can be used to deduce information about the fiber connectivity in the human brain. After preprocessing steps [jin2015], the input data consists of data points, each of which represents a 2mm2mm2mm brain voxel. The entire data points constitute the brain volume. Each data point is characterized by a 90dimensional array representing the connectivity strength of the voxel to 90 brain regions (representing a segmentation of the grey matter). The task is to cluster the voxels that share similar connectivity profiles. To facilitate the construction of the similarity matrix, an edge list is provided which contains all pair of voxels that are within 4 millimeter distance.

FB: This dataset is a dataset collected by a Facebook application. It contains the graph where each node represents an anonymous user and edges exist between users that share similar political interests[snapnets].

DBLP: This dataset consists of a comprehensive coauthorship network in a computer science bibliography. The nodes represent the authors. Authors are connected if they coauthored at least one publication[snapnets]. The dataset contains more than 5000 communities. Here we set the number of clusters to 500 for experimental purposes.

Syn200: The synthetic dataset is randomly generated by the stochastic block model [karrer2011stochastic]. The stochastic block model assumes that the data points are partitioned into disjoint subsets, . A symmetric matrix is provided to model the intercommunity edge probability. The synthetic sparse graph is randomly generated such that two nodes are connected with probability if they are within the same cluster and if they are in different clusters.
Dataset  Nodes  Edges  Clusters 

DTI  142541  3992290  500 
FB  4039  88234  10 
DBLP  317080  1049866  500 
Syn200  20000  773388  200 
5.2 Environment and Software
The computing environment is a heterogeneous CPUGPU platform with CPU and GPU specifics shown in Table 1. The software and packages used are as follows,

Matlab: Matlab is a highlevel language that provides interactive programing environment, which is widely used by scientists and engineers. The version of Matlab used for our implementation is 2015a. The sparse matrix representation and operations are the builtin functions. The kmeans clustering is the function in Statistical and Machine Learning toolbox.

Python: Python software packages, such as Numpy, Scipy and sklearn, are popular tools to perform scientific computations. The version of Python binary for our implementation is 2.7.11. The sparse representation and functions to solve the eigenvalue problems are from Scipy package. The kmeans clustering function is from sklearn.cluster module. The module versions are Numpy1.10.4, Scipy0.16.1 and sklearn0.17 respectively.
Linear algebra and numeric functions are by default multithreaded in Matlab on multicore and multiprocessor machines
5.3 Performance Analysis
We measure the running time of our spectral clustering algorithm on the three components separately: 1) computation of the similarity matrix; 2) sparse matrix eigensolver; and 3) the kmeans clustering algorithm. For the CUDA implementation, we measure the time costs that include both the computational time as well as the extra time for library initialization time and data communication. Specifically, we evaluate the performance of each of the following components:

Computation of the similarity matrix:

initialize CUDA libraries.

transfer data and edge list from CPU to GPU.

construct the similarity matrix.


Sparse matrix eigensolver:

data communication between CPU and GPU;

computation of the eigenvectors;

transfer of the eigenvectors from CPU to GPU.


Kmeans clustering:

perform the kmeans clustering;

tranfer the clustering result from GPU to CPU.

Time/s  CUDA  Matlab  Python 

Compute Similarity Matrix  0.0331  221.249  220.880 
Sparse Eigensolver  475.442  603.165  3281.973 
Kmeans Clustering  5.407  1785.17  2154.7818 
It is clear that our CUDA implementation significantly outperforms the currently fastest known Matlab and Python implementations at each step. Since the computation of the similarity matrix is highly parallel, the CUDA implementation achieves linear speedups by taking advantage of the GPU with thousands of threads computing the cross correlation coefficients concurrently. For the Matlab and Python implementations, the results are based on the serial implementation which loops over the edge list and computes the correlation coefficient explicitly using the builtin function. We also tested an alternative implementation which takes advantage of vectorization techniques that recast the loopbased operation into matrix and vector operations. The optimized Matlab and Python implementationd take and trespectively to compute the similarity matrix.
Both Matlab and Python packages utilize the reverse communication interfaces of ARPACK to compute the eigenvectors of largescale symmetric matrix, and hence all of the three implementations share similar procedures and interfaces. The basic difference is related to the function to compute the sparse matrixvector multiplication. Our CUDA implementation utilizes the GPU and the cuSPARSE library to compute the multiplication while Matlab and Python utilize their builtin routines. Since the GPU performs significantly better than the CPU on BLAS operations [cullinan2013computing], the CUDA implementation achieves better performance than Matlab and Python even with the communication overhead. However, since the time complexity of implicitly restarted Lanczos method is approximately , the time spent on the reverse communication interfaces scales relatively poorly, which may become the most computationally expensive part when is large.
As for the kmeans clustering algorithm, our CUDA implementation achieves more than 300x speedup over the Matlab and Python implementations. The running time of this step depends on the centroid initialization. The CUDA and Python implementations utilize the kmeans++ initialization, which leads to fewer number of iterations in general than Matlab. Moreover, in the CUDA implementation, the process of transforming the computation of the pairwise distance matrix to the BLAS operations significantly accelerates the running time of the algorithm.
The performance results for the graph datasets (FB, Syn200, dblp) are shown in Table 4 through Table 6 and Figure 4 through Figure 6. Similar to the previous results, our CUDA implementation achieves the best performance among the three implementations at each step. However, the speedup ratio depends on the specific problem size.
Time/s  CUDA  Matlab  Python 

Sparse Eigensolver  0.0216  0.1027  0.0851 
Kmeans Clustering  0.007251  0.0205  0.0259 
The FB dataset contains a very small graph with 4039 nodes and involves very few clusters . Because the number of clusters is small, the most expensive computation of sparse eigensolver is the sparse matrixvector multiplication. Therefore for this step,the CUDA implementation achieves around 5x speedup over the other implementations. For the kmeans clustering step, the CUDA implementation shows only a minor speedup by a factor of around 4x.
Time/s  CUDA  Matlab  Python 

Sparse Eigensolver  4.1153  6.9531  18.915 
Kmeans Clustering  0.02478  38.3728  2.4719 
The Syn200 dataset contains a mediumsized synthetic graph with 200 clusters. The CUDA implementation achieves a slight improvement in computing the eigenvectors since the performance is mainly constrained by the CPUbased routines. For the of kmeans clustering step, the CUDA implementation achieves over 100x speedup.
Time/s  CUDA  Matlab  Python 

Sparse Eigensolver  682.643  1885.2303  9338.31 
Kmeans Clustering  1.79456  1012.92  719.686 
The dblp dataset contains a largescale graph with 500 clusters. Both Matlab and Python implementations perform poorly for such a problem size. Our CUDA implementation achieve around 3x speedup in sparse eigensolver in spite of the fact that the performance is still constrained by the CPUbased interfaces. In the kmeans clustering step, the CUDA implementation achieves over 400x speedup.
Table 7 shows a comparison between data communication time and computation time for the CUDA implementation on each of our four datasets. The data communication time includes 1) input data transfered from CPU to GPU; 2) data communication between CPU and GPU during the execution of the eigensolver stage; 3) output results that are transferred from GPU to CPU. Given that the bandwidth remains constant during the execution of the algorithm, the time complexity of data communication is depending on the sparsity ratio of the similarity matrix and the number of Arnoldi iterations n; the time complexity of computation is . Therefore we expect the data communication time to be less than the computational time as in fact illustrated in the Table 7, especially for largescale problems.
Time/s  Communication  Computation 

DTI  2.248  475.213 
FB  0.002131  0.02635 
DBLP  2.731  680.31 
Syn200  0.0741  3.8201 
In conclusion, our CUDA implementation always achieves better performance than Matlab and Python implementations for each step. The speedup ratio largely depends on the specific problem size. Our traget applications involve problems with a large number of clusters. Our implementation achieves significant speedups for the steps of computing the similarity matrix and the kmeans clustering due to the massive computational power of GPU. Moreover, we always achieve some speedups for the sparse eigensolver step by accelerating the computations involving matrixvector multiplications.
6 Conclusion
We presented a high performance implementation of the spectral clustering algorithm on CPUGPU platforms. Our implementation leverages the GPU to accelerate highly parallel computations and Basic Linear Algebra Subprograms (BLAS) operations. We focused on the acceleration of the three major steps of the spectral clustering algorithm: 1) construction of the similarity matrix; 2) computation of eigenvectors for largescale similarity matrices; 3) kmeans clustering algorithm. We believe that we are the first to accelerate the largescale eigenvector computation by combining the interfaces of traditional CPUbased software packages ARPACK and GPUbased CUDA library. Such a combination achieves good speedups compared to other CPUbased software. We deploy a smart seeding strategy and utilize BLAS operations to implement the fast kmeans clustering algorithm. Our implementation is shown to achieve significant speedup compared to Matlab and Python software packages, especially for largescale problems.
Acknowledgment
We gratefully acknowledge funding provided by The University of Maryland/Mpowering the State through the Center for Healthrelated Informatics and Bioimaging (CHIB) and the NVIDIA Research Excellence Center at the University of Maryland.
Footnotes
 http://docs.nvidia.com/cuda/cusparse/
 http://docs.nvidia.com/cuda/cublas/
 http://reuter.mit.edu/software/arpackpatch/
 http://www.openblas.net/
 https://github.com/bryancatanzaro/kmeans
 http://www.mathworks.com/discovery/matlabmulticore.html