Nearest Neighbor based Clustering Algorithm for Large Data Sets
Clustering is an unsupervised learning technique in which data or objects are grouped into sets based on some similarity measure. Most of the clustering algorithms assume that the main memory is infinite and can accommodate the set of patterns. In reality many applications give rise to a large set of patterns which does not fit in the main memory. When the data set is too large, much of the data is stored in the secondary memory. Input/Outputs (I/O) from the disk are the major bottleneck in designing efficient clustering algorithms for large data sets. Different designing techniques have been used to design clustering algorithms for large data sets. External memory algorithms are one class of algorithms which can be used for large data sets. These algorithms exploit the hierarchical memory structure of the computers by incorporating locality of reference directly in the algorithm. This paper makes some contribution towards designing clustering algorithms in the external memory model (Proposed by Aggarwal and Vitter 1988) to make the algorithms scalable. In this paper, it is shown that the Shared near neighbors algorithm is not very I/O efficient since the computational complexity is same as the I/O complexity. The algorithm is designed in the external memory model and I/O complexity is reduced. The computational complexity remains same. We substantiate the theoretical analysis by showing the performance of the algorithms with their traditional counterpart by implementing in STXXL library.
Keywords:Clustering of Large data sets, Nearest Neighbor clustering, Shared Near Neighbors clustering, External memory clustering algorithms
Clustering is an unsupervised learning technique in which data or objects are grouped into sets based on some similarity measure. The data points in a group are similar and the points across the groups are dissimilar. There are few typical requirements for a good clustering technique in data mining [16, 9]. Versatility, ability to discover clusters with different shapes, minimum number of input parameters, robustness with regard to noise, insensitive to the data input order, scalable to high dimensionality, scalable to large data sets are the important requirements.
1.1 Clustering of Large Data Sets
The performance of the algorithm should not decrease with the increase of the data size. Most of the clustering algorithms are designed for small data sets and they fail to fulfill the last requirement i.e. scalable to large data sets. Many scientific, engineering and business applications frequently produces very large data sets . The definition of “Large” varies with the changes in technology, mainly the memory and the computational speed of the computers. The data set which is large in today’s computing environment may not remain as large after a few years. However the data size is increasing at much faster than the technology to handle it. Majority of the clustering algorithms are not designed to handle large data sets. There are a few approaches proposed in the literature to handle large data sets, e.g., Decomposition and Incremental approaches [10, 12]. Parallel implementation is also used to handle large data sets.
Few algorithms are also devised in the literature which use preprocessing steps like summarization, incremental, approximation, distribution etc., to efficiently cluster large data sets. With the help of preprocessing steps, they actually store the summary of the data set and generate the clusters only considering the summary. Few examples include: Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH algorithm) , CLARANS , Clustering Using Representatives (CURE) , scalable K-mean++ , etc. The BIRCH and CLARANS algorithms are suitable when the clusters are convex or spherical shape of uniform size. However, they compromise with the quality when clusters have different sizes or non-spherical shapes . Also random sampling and randomized search, are used by these algorithms which degrade the quality of the clustering because all the data points are not considered [18, 7].
In the traditional algorithm design, it is assumed that the main memory is infinite and it allows uniform and random access to all its locations. But in reality the present day computers have multiple levels of memory and accessing data from each level has its own cost and performance characteristics. If the data is too large to fit in the main memory then it has to be stored in the disk of the machine. Disk access time is millions times slower than the main memory access time. Most of the clustering algorithms assume that the main memory is large enough for the data set. However for large data sets, this is not a realistic assumption. So in case of large data, the usual computational cost may not be an appropriate performance metric but number of input/outputs (I/Os) can be more appropriate performance measure. Different designing techniques are used to design algorithms for large data sets. External memory algorithms are one such class of algorithms which exploits the hierarchical memory structure of the computers by incorporating locality of reference directly in the algorithm . The external memory model was introduced by Aggarwal and Vitter in 1988. The Input/Output model (I/O-model) views the computer consisting of a processor, internal memory (), and external memory (disk). The external memory is considered unlimited in size and is divided into blocks of consecutive data items. Transfer of a block of data between disk and RAM is called an I/O.
1.2 Contribution of the paper
Shared Near Neighbor SNN  is a technique, in which similarity of two points is defined based on the number of common neighbors, the two points share. The main advantage of the shared near neighbor based clustering algorithms is that the number of clusters is not specified, it is auto generated by the algorithm. Document clustering, temporal time series clustering are few examples where SNN clustering technique is used. In this paper, SNN based clustering algorithm is designed in external memory model to make it scalable. The computational as well as the I/O complexity of the SNN algorithm is . This is the reason, why the SNN algorithm is not I/O efficient, hence unsuitable for large data sets. The traditional SNN algorithm is designed in the external Memory Model to make it I/O efficient. We show that the I/O complexity of the proposed algorithm is which is a factor improvement over the traditional SNN algorithm. The computational complexity remains same. Both traditional as well as proposed algorithms are implemented and the performance of the proposed algorithms is compared with its traditional counterpart. The proposed algorithm outperforms the in-core algorithm, as expected from the theoretical results.
1.3 Organization of the paper
This paper is organized as follows: In Section 2, the proposed scalable shared near neighbors based clustering algorithm and its I/O analysis is described. Section 3 contains the experimental results and observations. The concluding remarks and future works are given in Section 4.
2 Proposed Scalable Clustering Algorithm based on SNN
Shared Near Neighbor (SNN) is a technique in which similarity of two points is defined based on the number of neighbors, the two points share . It can efficiently generate clusters of different sizes and shapes.
The inputs of the SNN algorithm are two parameters: (size of the nearest neighbors list) and (similarity threshold). The performance of the algorithm depends upon these parameters. In  an analytical process was proposed to find the most appropriate values of the input parameters.
2.1 Traditional Shared Near Neighbors Algorithm
The SNN algorithm has two steps. In the first step the k-nearest neighbor of all the points are calculated. Distance between two points can be calculated using any one of the distance measures. The k-nearest neighbors of a point are arranged in ascending order. As each point is its own zeroth neighbor so first point of each neighborhood row indicates the point number itself. In the second step the shared near neighbor of each data point is calculated. Assume that and with are any two points having at least (similarity threshold) matching neighbors and both points belong to each other’s neighborhood list. Then the bigger index, e.g., is replaced by the smaller index, . That means since and are similar so is labeled as . I/O complexity of this algorithm can be analyzed as follows: For the first step I/O complexity is and for second step number of I/Os is . Hence overall complexity of the traditional algorithms is which is same as computational complexity.
2.2 Proposed Scalable Shared Near Neighbors Algorithm
The traditional algorithm is not very I/O efficient, hence it is not suitable for large data sets. In this section, we design the traditional algorithm in the external memory model to make it I/O efficient, hence scalable for large data sets. The computational steps of the proposed algorithm is same as the traditional algorithm but the data access pattern is modified to make the algorithm scalable.
2.2.1 Computation of k-Nearest Neighbor Matrix:
First step of the algorithm is I/O efficient generation of the k-nearest neighbor matrix. Assume that the dataset is partitioned into blocks each of size . Here is a parameter to be fixed depending on the available main memory. Read any two blocks and into main memory and calculate the distance between each pair of points in the main memory. Store the distance in a temporary vector called “dist” of size and corresponding points index in matrix block of size . After computation of the k-nearest neighbor of the block , write the matrix block into the external memory. Repeat the process for times. This process will generate matrix and the procedure is described in Algorithm 1. The main memory contains 2-blocks of size and 2-blocks of size . Hence .
2.2.2 The clustering step:
In the first step of the algorithm matrix of size is generated which is the input to the next phase of the algorithm. Assume that the matrix is divided into blocks of size each. Also assume that label table of size is divided into blocks of size each. Read any two blocks and where and also read two blocks of label table, and into the main memory. Then find all possible pair points satisfying the SNN similarity criteria of and blocks. In this way the label of all the points of the block is calculated. Repeat the process for times. This process will generate cluster labels. The procedure is described in Algorithm 2.
Here the main memory contains 2-blocks of size and 2-blocks of size . Hence . The transfer of blocks between main memory and disk is shown Figure 1.
2.3 I/O Analysis
Traditional algorithm takes number of I/Os. I/O complexity of the proposed algorithm is described here.
2.3.1 I/O complexity of Algorithm 1
The main memory contains 2-blocks of size and 2-blocks of size . Hence , i.e., . Total number of I/Os required to generate the matrix is =
2.3.2 I/O complexity of Algorithm 2
Here , i.e., . Total number of I/Os required by the algorithm is= .
So the total number of I/Os incurs by two phases of the algorithm is . The dimension is a constant, so ignoring the constant term , the I/O complexity of the algorithm is which is a factor improvement over the traditional algorithm.
3 Experimental Results
3.1 Performance of the Proposed Algorithm
Many external memory software libraries are being designed. Few of them to mention are STXXL , LEDA-SM , TPIE . STXXL is used in our implementation. STXXL is the implementation or adaptation of C++ STL (standard template library) for external memory computations . Both the traditional and the proposed algorithms are implemented in STXXL .
Since both the algorithms follow exactly same computational steps, computational complexity remains same and so both of them generate same set of clusters. Hence the quality analysis is omitted. Our main focus is on analyzing and reducing the I/O complexity of algorithm.
The data sets are generated randomly. The dimension of the data is and the size of the data set varies from . The main memory size is restricted to 1 MB only and the Hard disk size is GB. The algorithm is implemented on Ubuntu system with a GHz CPU(Intel Core 2 Duo) and GB main memory.
For ease of implementation the algorithmic block size () is set same as the disk block size (). When the block size is set to 8 KB and available main memory is restricted to 1 MB total number of reads or writes goes beyond 500 for 10 data points in case of traditional algorithm. While in proposed algorithm number of read or writes is less than even for data points. The similar results were obtained for total number of I/Os and total data read and written. Total number of I/Os for traditional algorithm exceeds for data points while it is less than for data points in case of proposed algorithm. In-core algorithm fails to give result after days for points. Figure 2, 2, 2 and 2 illustrate number of reads, writes, data R/Ws and I/Os respectively.
3.2 Effect of Main Memory Size on the Performance of the Proposed Algorithm
The proposed algorithm is run on different sizes of main memory to study the effect of main memory on the performance of the algorithm. The main memory is restricted to 1MB, 4MB, 16MB and 128MB.
It is clear from the graph that the I/O reduces as the main memory increases. If we closely observe a graph we can see that when the number of data points is and main memory size is 128 MB, the line denoting the total number of I/O is very close to x-axis. Similar effect of main memory size can be seen in other graphs as well. That substantiated the theoretical I/O analysis as the I/O is dependent on available main memory size. Figure 3, 3 and 3 illustrate number of reads, writes and I/Os respectively.
This paper makes some contribution in the field of big data clustering by redesigning the existing algorithm in external memory model. The shared near neighbors (SNN) algorithm has been designed on external memory model. It is shown that the I/O complexity of the proposed algorithm is which is a factor improvement over the traditional SNN algorithm. Both algorithms produce the same set of clusters. Both algorithms are implemented in STXXL to compare their performance with the in-core algorithms. The design technique can be used to adapt various existing algorithms for large data. Without theoretical analysis it is often difficult to say which clustering algorithm will perform better for different sized data sets. So one of our future work is to analyze the I/O complexity of the best known clustering algorithms of the literature and design them on the external memory model to make them suitable for massive data sets.
-  Abello, J., Pardalos, P.M., Resende, M.G.: Handbook of massive data sets. Springer (2002)
-  Aggarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Communications of the ACM 31(9), 1116–1127 (1988)
-  Arge, L., Procopiuc, O., Vitter, J.: Implementing I/O-efficient data structures using TPIE. In: Algorithms â ESA 2002, pp. 88–100. Springer (2002)
-  Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proceedings of the VLDB Endowment 5(7), 622–633 (2012)
-  Crauser, A., Mehlhorn, K.: LEDA-SM: Extending LEDA to secondary memory. In: Algorithm Engineering, pp. 228–242. Springer (1999)
-  Dementiev, R., Kettner, L., Sanders, P.: STXXL: standard template library for XXL data sets. Software: Practice and Experience 38(6), 589–637 (2008)
-  El-Sharkawi, M.E., El-Zawawy, M.A.: Algorithm for spatial clustering with obstacles. In: International Conference on Intelligent Computing and Information Systems (ICICISâ02) (2002)
-  Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record. vol. 27, pp. 73–84. ACM (1998)
-  Han, J., Kamber, M.: Data Mining, Southeast Asia Edition: Concepts and Techniques. Morgan kaufmann (2006)
-  Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM computing surveys (CSUR) 31(3), 264–323 (1999)
-  Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers 100(11), 1025–1034 (1973)
-  Judd, D., McKinley, P.K., Jain, A.K.: Large-scale parallel data clustering. In: Proceedings of the 13th International Conference on Pattern Recognition. vol. 4, pp. 488–493. IEEE (1996)
-  Moreira, G., Santos, M.Y., Moura-Pires, J.: SNN Input Parameters: how are they related? In: International Conference on Parallel and Distributed Systems (ICPADS). pp. 492–497. IEEE (2013)
-  Musser, D.R., Derge, G.J., Saini, A.: STL tutorial and reference guide: C++ programming with the standard template library. Addison-Wesley Professional (2009)
-  Ng, R.T., Jiawei, H.: CLARANS: a method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering 14(5), 1003–1016 (2002)
-  Zaïane, O.R., Foss, A., Lee, C.H., Wang, W.: On data clustering analysis: Scalability, constraints, and validation. In: Advances in Knowledge Discovery and Data Mining, pp. 28–39. Springer (2002)
-  Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record. vol. 25, pp. 103–114. ACM (1996)
-  Zhang, X.: Contributions to Large Scale Data Clustering and Streaming with Affinity Propagation. Application to Autonomic Grids. Ph.D. thesis, Université Paris-Sud (2010)