Nonparametric Powerlaw Data Clustering
Abstract
It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the means clustering procedure. This approach shows simplicity in implementation and solidity in theory, while it also provides a feasible way to inference in large scale datasets. However, several problems remains unsolved in this pioneering work, including the powerlaw data applicability, mechanism to merge centers to avoid the overfitting problem, clustering order problem, e.t.c.. To address these issues, the PitmanYor Process based kmeans (namely pypmeans) is proposed in this paper. Taking advantage of the PitmanYor Process, pypmeans treats clusters differently by dynamically and adaptively changing the threshold to guarantee the generation of powerlaw clustering results. Also, one center agglomeration procedure is integrated into the implementation to be able to merge small but close clusters and then adaptively determine the cluster number. With more discussion on the clustering order, the convergence proof, complexity analysis and extension to spectral clustering, our approach is compared with traditional clustering algorithm and variational inference methods. The advantages and properties of pypmeans are validated by experiments on both synthetic datasets and real world datasets.
I Introduction
Powerlaw data is ubiquitous in real world. Examples include social networks on Facebook, topics in web forums and citations among published papers. This kind of data differs from the traditional ones by the large fluctuations that occur in the tails of the distributions. Increased attentions have been received in recent years for detection the powerlaw phenomena and characterization the structure of such kind of data. Clustering is one essential technique for data structure learning owing to its capability of grouping data collections automatically. However, the key challenge of employing clustering on powerlaw data lies on the difficulties of inferring the cluster number, as well as determining the cluster sizes.
In the previous decades, various clustering methods have been proposed in dealing with different kinds of data. However, most of them, including classic kmeans [1][2][3], Mixture Models [4][5], Spectral Clustering[6][7], Mean Shift[8][9] , etc., assume the cluster number to be a kind of prior information which should be provided by users, the value of which is usually unknown for the user. A few initial approaches[10][11][12][13] have been proposed to handle this unknown cluster number problem. However, most of them address this problem from the model selection criteria, and this leads to a dilemma in the selection of criteria.
Bayesian nonparametric learning, a fast growing research topic in recent years, can be utilized as an effective approach to address the parameter selection issue. Its core idea is to treat the required parameters, e.g., the cluster number, under a hyperdistribution and employ inference methods to learn the posterior probability of the latent variables given the observations. It demonstrates its significance contributions in parameter inference. However, it often suffers from the difficulty of designing learning schemes based on conjugate assumption, as well as computational complexity induced by inference and sampling . To address this problem, a new method called ”DPmeans”[14][15] has been proposed to bridge the classic kmeans clustering and the nonparametric Dirichlet Process Gaussian Mixture Model (DPGMM). Taking advantage of the asymptotic zerocovariance property of Gaussian Mixture Models, dpmeans naturally introduces a fixed threshold to determine whether a data point should belong to an existing cluster or a new cluster should be created for it. It provides a unified view to combine Bayesian nonparametric methods and the hard clustering algorithms to address scale learning in large datasets.
However, several issues remain unsolved in this promising pioneering work. (i) The method is not designed for powerlaw data. A global threshold for all clusters may result in clusters with similar sizes. (ii) Mechanism to merge closed centers is needed. The algorithm may result in many small clusters. Some of them should be merged if they are closed enough. (iii) The clustering order influences the result. Strategies should be discussed more to address this issue.
This paper proposes a novel clustering approach, PitmanYor Processmeans (pypmeans), for clustering powerlaw data. A modified PitmanYor Process [16][17] is first proposed to approximate the powerlaw data structure in hard partition. Unlike the fixed threshold proposed in dpmeans, the modified PitmanYor Process introduces a deregulated threshold whose value changes in accordance to the cluster number during clustering. The larger the cluster number, the smaller the threshold would be. In this way, pypmeans establishes a proper connection between the cluster number and threshold setting.
A center agglomeration procedure is also proposed to adaptively determine the cluster number. To address the issue that the clustering procedure may result in isolated small clusters in which are are not far from each other, we check the interdistance between each pair of cluster centers and combine them if the distance of the two clusters are smaller than a value. This will prevent the clustering result from over fitting the powerlaw distribution while taking no account of the real data distribution.
The heuristic ”furthest first” strategy is more discussed here to address the data order issue. We further prove that once the cluster number stops increasing, arbitrary order of the remained data points will result in the same clustering result.
The convergence of pypmeans is proved and the complexity of the algorithm is analyzed. We further extend our method to spectral clustering to prove the effectiveness of our work.
The contribution of our work is summarised as follows:

we extend the newly proposed dpmeans to the modified PitmanYor process based kmeans algorithm to address the powerlaw data, which is a generalization and being able to cluster both the powerlaw dataset and normal dataset.

we integrate a center agglomeration procedure into the main implementation to overcome the overfitting problem.

we introduce a heuristic ”furthest first” strategy to address the data order issue during clustering procedure.

we prove the convergence of pypmeans and calculate the complexity of the algorithm. We also extend the method to fit spectral clustering to expand our approach to multiple clustering algorithms.
The remaining part of the paper is organized as follows. Section II introduces related work on Gaussian Mixture Models with its new derivatives and the PitmanYor Process. The modified PitmanYor Process towards powerlaw data is represented in Section III. Then followed by Section IV detail discusses our proposed approach pypmeans, including the main implementation, the strategy for data order issue, and the center agglomeration procedure to avoid overfitting. Further discussion on pypmeans’ convergence proof, complexity analysis and its extension to spectral clustering can be found in Section V. Section VI introduces experimental results on different datasets to prove the effectiveness of our work. Followed by the last section which draws a conclusion of this paper.
Ii Background
We briefly introduce the relations between Gaussian Mixture Models with its derivatives to means clsutering[4], and the pitmanyor process.
Iia Gaussian Mixture Models with its new derivatives to means
Gaussian Mixture Models(GMM) treats a dataset as a set containing the samples from several Gaussian distributions. The likelihood of a data point in the dataset can be calculated as::
(1) 
Here is the components’ number, denotes the proportion of component , and is ’s Gaussian likelihood in component .
The local maximum of a Gaussian Mixture Model can be achieved by applying Expectation Maximization with iteration of the following equations:
(2) 
Here denotes the probability of assigning data point to cluster . As a result, it is regarded as a kind of soft clustering [14] which is different from traditional hard clustering (e.g. kmeans) in the way of assigning single point to multiple clusters with probabilities.
Actually, the “asymptotic” link (i.e. zerovariance limit) between the GMM and means clustering is a wellknown result as in [4][15]. More specifically, the covariance matrices of all mixture component in GMM are assumed to be , and becomes
(3) 
with ’s calculation is changed as:
(4) 
Consider the case , the smallest term of will dominate the denominator of Eq. (4). Thus, becomes:
(5) 
In this case, GMM degenerates into means which assigns each points to its nearest clustering with probability of 1.
Both GMM and kmeans are suffering from the selection of cluster numbers. To address the problem, Dirichlet Process [16][17] is introduced into kmeans recently. Taking advantage of the Dirichlet Process, a distance threshold can be generated to prevent data points from being assigned to a cluster if the distances exceed the threshold. If a data point fails to be assigned to all clusters, a new cluster will be created by taking it as the cluster center.
More specifically, a dirichlet process can be denoted as , with the hyperparameter and basement distribution . The hyperparameter alfa can be further written in the form of for some , and the base measurement is used to generate Gaussian distribution . Gibbs sampling is take to address the the above process. The probability used in Gibbs sampling can be written as
(6) 
Where denotes the existing th cluster, represents the size of the th cluster excluding data point and is the normalizing constant.
While , the allocated label for data point becomes:
(7) 
Once Gibbs sampling assigns a new label to , a new cluster will be generated with Gaussian distribution . Within finite steps, local minimum can be achieved. All data points are assigned to corresponding clusters with distances to the centers smaller than the given threshold.
Taking advantage of Dirichlet Process, this work, known as dpmeans, successfully combines the prior information (hyperparameter ) and local information of each component (the gaussian distribution ).
Dpmeans treat each component equally. One universal threshold is set for all clusters. In this way, clustering result tends to contain cluster with similar sizes. However, clusters in real world dataset usually vary a lot. They typically obey powerlaw distributions. Therefore, it would be convenient if the method is able to generate more reasonable clustering results which satisfy powerlaw distribution.
Though a wonderful work, a systematic learning on the unsolved problems, including an increasing cluster number problem, the clustering order problem, complexity analysis, e.t.c., should be focused. Other practical issues including powerlaw data approximating, parameter ’s adjustment also needs to be further investigated.
IiB PitmanYor Process
PitmanYor process (pyprocess)[18][19] is a generalization of Dirichlet Process. In pyprocess, a discount parameter is added to increase the probability of new class generation. Due to the discount parameter ’s tuning effect, it becomes a suitable model to depict powerlaw data. py process degenerate to classic dirichlet process when is set to 0.
A Plya urn scheme is used here to explain the pyprocess’ generative paradigm in the technical perspective. In this scheme, objects of interests are represented as colored balls contained in an urn. At the beginning, the urn is empty. All balls are uncolored. We pick the first ball, paint it with a certain color and put it into the urn. In the following steps, we pick one ball each time, color it and put it into the urn. The color of the ball is allocated according to the following probability.
(8) 
Where denotes the th ball picked, denotes the color assigned to , denote the number of balls in color exclude ball , denotes the whole number of balls without ball .
The process continues until all balls are painted and put into the urn. While the size of each cluster is fixed, the joint probability is unchanged, which refers as “exchangeability”.
Pyprocess preserves Dirichlet Process’ ’rich get richer’ property during the process of assigning colors to balls. The larger size of balls in a certain color, the greater probability that the new ball will be painted in this color. Thanks to the discount parameter d, the probability of generating a new color in pyprocess is greater than that of DP. It can be easily proven that pyprocess draws colors to data points in a powerlaw scheme. Therefore, it would be promising if pyprocess in incorporated into clustering algorithms to help address the powerlaw data.
Iii Modified Pitmanyor Process for hard clustering
Powerlaw data[20], also named as heavytailed behavior data, represents the case that the frequency or the size of some data cluster obey the exponential distribution, i.e., more small sized subsets of cluster data are coming up. In reality life, a wide range of the data obeys powerlaw data, including the frequencies of words in languages, the populations of cities, the intensities of earthquakes. Under most situations, these kind of findings in powerlaw data would be considered as noisy or defective. However, these are at the same time some of the most interesting part from the whole observations.
In our scenarios, one cluster’s data are denoted as the same, and we define the cluster size follows according to the powerlaw distribution. In contrast to the ordinary data, its clustering encounters more difficulties, such as the trivial cluster discovery, the cluster number determination and the related imbalanced problem.
A general form of the density function of powerlaw data is stated as:
(9) 
where is a slowly varying function, which is any function that satisfies with constant, and is one decreasing parameter.
Regarding to these difficulties in powerlaw data clustering, traditional clustering methods tend to group the small size clusters into major clusters or simply treat them as noisy data points. It is unproper while these trivial clusters may still be important to the whole data structure. Many soft clustering methods including the pyprocess have been put forward to effectively mining this kind of data. They have received good results, however, most of them still suffer the complexity problem in implementation and high conditions required. To the best of our knowledge, little work has been done on the hard clustering scenario, nor an equivalence connected with the classic means clustering.
With the core idea in pyprocess of increasing the new cluster generation’s probability, we revise the concentration parameter from to in dpmeans.
More specifically, during each ball’s color painting in the Plya urn scheme, the color allocated according to the paradigm below:
(10) 
It is quite straightforward to see that the balls are exchangeable, which is quite basic for the powerlaw approximating.
Proposition 1.
The revised allocation paradigm in Eq. (10) still keeps the exchangeability property, i.e., the joint probability of a data set is not affected by their orders given each cluster size fixed.
Proof.
Assume the cluster number is , each cluster’s number is . Thus, the joint probability of the data points is
(11) 
The equation is determined only these variables, which is exchangeability preserved. ∎
In our revision, while is fixed at 1, then it is the normal dpmeans.
Iv pyprocess means
Benefit from the revised color allocation paradigm (Eq. (10)), we extend the existed dpmeans algorithm to do a further generalization, which named as “rpymeans”. The induction strategy is also quite similar as dpmeans. Employing the same setting on both of the finite and infinite Gaussian Mixture Model (), the parameters are modified as in our revised pitmanyor process approximating method.
This leads to the related probability of data point assigning to an existed cluster as:
(12) 
following the probability to the new cluster as:
(13) 
While , the dominating term in is the minimal value of , leading to the cluster allocation paradigm as:
(14) 
Here cluster number is constrained to to avoid the minimal distance problem. By shorten the threshold value in accordance to the cluster number, more hidden clusters could be discovered and the clusters would also be more compact than a larger threshold.
Iva main implementation
One stage of our main implementations in the pypmeans clustering is quite similar as in dpmeans. However, differences come up in the fluctuated threshold during the clustering procedure and an stepwise/adaptive density checking procedure.
One definition on the reach of cluster centers is first made to clarify the notation.
Definition 1.
Any data point that lies within the ball is said that is in of (in data), while being outside from the ball is said to be out of (out data).
Under Def. (1), the dpmeans gets results that all of the data points are in of centers .
Figure 1. depicts the process of our implementation.
The whole implementation consists of three procedures: data partition, center recalculation and center agglomeration. Data partition procedure shares similarities with the existed dpmeans, which divides the data into in data and out data. For the in data, its clustering method is according to the usual way as means, while the out data’s clustering employs an adaptive way to determine the cluster, which would be detail discussed later. The centerrecalculation procedure is the same as the corresponding step in means. The center agglomeration procedure is one that to avoid too many trivial clusters. Details would be discussed later.
The detail implementation of our proposed pypmeans is shown in Algorithm 1.
The objective function is identified as the cost of all intercluster distance (cost) adding one penalty term:
(15) 
While cost tends to seek a larger value and the penalty term’s value increases with value increases, the minimum value we are seeking is a tradeoff in considering both of the cases.
IvB reclustering on
The clustering order on the data will affect its performance in our pypmeans, suffering from the same problem as in [16]. We discuss this problem here in accordance with the two stages of our main implementation: in data clustering and reclustering on out data.
On clustering in data points, arbitrary order of these data points would result in the same clustering performance. This assertion applies for the classic means clustering.
On the contrary, the clustering order of out data points would affect the center’s determination in a sequence. The new generated different centers would be surely affect the data belonging. To explore this complex situation, a heuristic search method called “furthest first” is employed here. From the start of the reclustering, we choose the data point whose shortest distance to all the existed centers are the largest, i.e., and set it as the new cluster center. Then we remove data point from and recursively do reclustering.
One benefit of our “furthest first” is that we can avoid the generating of new clusters once the cluster number stopped increasing. Then arbitrary order of remained data points would not affect the clustering performance. This saves the computational cost in defining the centers.
Proposition 2.
During each iteration of our method, once no remaining out data point becoming new cluster, the following data points would not be new centers either.
Proof.
Assume that the shortest distances of the remaining are , corresponding to the variables . While the does not come to be one new center, , then the threshold becomes fixed. Similar as the Apriori rules, then all the remaining data points with will belong to the existed clusters.
Thus, our selection order will not generate the new cluster centers. ∎
We formalize our reclustering procedure on the remaining dataset as Algorithm 2.
IvC center agglomeration procedure
In dpmeans, new cluster generated while new out data encountered, however, it never disappeared even if it gets much closer to another cluster. This could result in an overfitting problem on dividing one dense clusters into two parts. On the other hand, this special overfitting results in an unproper smaller threshold on determining the valid cluster number. Thus, we need to adaptively determine the cluster number.
From the following proposition, we can evaluate the cost value’s change while two clusters combine into one. Then the condition of clustering combining could be well established.
Assuming that are the predefined parameter, c is the current cluster number, and are the corresponding cluster center and cluster size, we have the following proposition:
Proposition 3.
If two clusters satisfy , then combining these two clusters could reduce the value of the objective function (Eq. (15)).
Proof.
Assume and are the two closed cluster, with denoting their combined clusters. Then the two cluster center satisfy the condition with the new combined cluster center :
(16) 
We first show that the combination of two cluster would result an increase in the cost value,
(17) 
Due to the cluster number reducement, the cluster number penalty term jumps from to . Thus, if the condition satisfied
(18) 
Our objective function is decreasing with the two clusters’ combination. ∎
The following simple prototype illustrates our idea more clearly. are four data points to be clusterd. Assume the threshold is and we have previously clustered and being individual clusters. According to the agglomeration procedure, we combine these two clusters since the distance of the two centers are , satisfies the condition . If we do not employ this procedure, then these two clusters would remain the same, leading to a unsatisfied result.
In our detail implementation, each time after the cluster centers recalculated, we run this agglomeration procedure. By checking if any pair of cluster centers satisfies the condition, we can effectively prevent the above situations.
V Further Discussion
The work is extended here for further discussion, including the convergence analysis, the complex analysis, and possible extension to spectral clustering.
Va convergence analysis
Guaranteeing a local minimum value within finite steps is vital in our pypmeans. We approach this goal by first showing that the objective function (Eq. (15)) strictly decreases during each iteration.
Proposition 4.
Proof.
The iteration is divided into three stages: partitioning data points; updating cluster centers; agglomeration procedure.
In Partitioning data points stage, the distance between in data and its newly belonging cluster center would not increased, this is confirmed by [21]. The out data are set as new cluster centers, this shrinkage the cost from to the penalty value . What is more, as increases, the penalty value decreases, which reduces the objective function more.
In the updating cluster centers stage, the mean representation is always the optimal selection with the least cost value.
In agglomeration procedure, the objective function is strictly decreasing as provided in Section IVC.
Thus, the objective function is strictly decreasing. ∎
Employing the similar idea of [21], the convergence property of our pypmeans could be easily obtained.
Theorem 1.
pypmeans converges to a partial optimal solution of the objective function (Eq. 15) in a finite number of iterations.
Proof.
As finite number of data points, we get finite partitions of data points in the maximum.
Assume our declaim is not true, which means that there exist , such that . Without loss of generality, we set .
According to Proposition 4, our objective function strictly decrease while increases. Thus, for any , we get . The inequality also applies to . This leads to the fact that , which is contradict to our assumptions. Thus, the assumption does not success and we get our conclusion. ∎
Under Theorem 1, we can ensure the our procedure could reach a local minimum within finite steps.
VB complexity analysis
Our proposed pypmeans is scalable to the number of data points and the final cluster number . The computational complexity can be analyzed as follows. All the three major computational steps during one iteration are considered as follows.

Partitioning the data points. After initialization of the centers, this steps mainly consists of two different procedures.

For in data with data size , this process is the same as means clustering in simply comparing the distances of data in all cluster centers. Thus, the complexity for this step is

For out data with data size , the reclustering process involves a sort operation, which the quickest complexity is . Under the worst case of each out data being new centers, the complexity cost would be . With an label assigning process, the complexity would be .


Updating cluster centers. Given the partition matrix, updating the cluster centers is to find the means of the data points in the same cluster. Thus, for clusters, the computational cost complexity for this step is .

Agglomeration procedure. This procedure needs to check all the possible pairs of the clusters, thus a complexity of is needed.
Assume the clustering process needs iterations to converge, the total computational complexity of this algorithm is . While is usually set to be s small subset of the algorithm, the algorithm is computational feasible. However, while we set the threshold small values, leading to a larger , then the computational cost would be heavy.
VC pitmanyor spectral clustering
Our work can also be transplanted into the spectral clustering framework. We have first shown that our objective function (Eq. (15)) in pypmeans is equivalent to the trace optimization problem:
(19) 
Where is the kernel matrix.
Detail proof is quite similar as the one in [16], we do not provide the detail here due to the duplicate.
The classical determination of the orthonormal matrices in spectral theory states that while selects to be the top eigenvectors, the objective function in (Eq. 19) reaches its maximum for a fixed clusters. For flexible value in our problem, the objective function (Eq. 19) reaches its maximum while selected to be the matrices of eigenvectors with the nonnegative eigenvalues, corresponding to the adjusted matrix .
Particularly, we determine , the integer number of clusters, through an adaptive measure on the value’s connection to the threshold changing of the eigenvalues of the similarity matrix, i.e.:
(20) 
Where are the predefined parameter, are the decreasing eigenvalues of the kernel Matrix , and denotes the th larger eigenvalue, .
After getting the relaxed cluster indicator matrices , we can cluster the rows of as data points using means clustering, according to the standard spectral clustering method and take the corresponding result as the final clustering result.
Vi Experiments
The experimental evaluation is conducted on three types of datasets, which are grouped into synthetic dataset, UCI benchmarking dataset [22] and US communities’ criminal dataset [23]. All datasets are preprocessed by normalizing each feature on each dimension into the interval [0, 1]. Furthermore, the clustering process of the algorithms is repeated for 50 times at each setting and the average value is taken as the final result. All experiments were run on a computer with Intel Xeno (R) CPU 2.53GHz, Microsoft Windows 7 with algorithms coded in Matlab.
Via Experimental Setting
For sufficient comparison, our proposed pypmeans are compared with three baseline algorithms: means clustering, dpmeans and Dirichlet Process variational learning (V.L.).
Parameters in these algorithms are set accordingly. In means clustering, the predefined cluster number is set as the true number in Synthetic data and we use the random initialization strategy as the starting partition; dpmeans and pypmeans are using the same parameter setting, which will be described later; in V. L., we use the variational inference[24] procedure to do the learning and the related parameters are using cross validation technique to determine.
ViB Performance Metrics
Validating clustering results is always a nontrivial task. Under the presence of true labels in synthetic data, we employ the accuracy to measure the effectiveness of our proposed methods, which is defined as follows:
(21) 
Where is the data size, and denote the true label and the obtained label; is the dirac function as ; map() is a permutation function that maps each cluster label to a category label, and the optimal matching can be found by the Hungarian algorithm[25].
Besides ACC, the NMI (normalized mutual information) is also used in the synthetic data learning, i.e.,
(22) 
where is the number of agreements between clusters and , is the number of data points in cluster , is the number of data points in cluster , and is the total number of data points in the dataset.
ViC Synthetic Dataset
To be more focused, the synthetic dataset is manually set to contain powerlaw behavior in our learning procedure. Here we would like to investigate multiple aspects of our method, including the clustering accuracy and NMI score performance, the relationships between threshold and discovered cluster number, running time, e.t.c..
ViC1 synthetic data generation
The synthetic data is derived from the same generation algorithm as that in [26]. The powerlaw property is reflected by specially assigning more data points to the first few clusters (to the size of about 200) while remaining others as about 30. Also, the cluster number varies from 3 to 150 to cover larger cases. Each cluster is distributed according to the 3dimensional Gaussian Distribution , where is one uniform distributed random centers.
ViC2 Practical parameter setting
We employ the method in [16] to set ’s value. We first roughly estimate the cluster number and initialize the center with the cluster mean. From to , we iteratively select the data point that has the largest distance (the distance is defined as the smallest distance to all the existed centers) as the new generated center. The maximum value of distance while is identified as the value of in our experiment. For ’s value, we experimentally set it as . Detail discussions of the determination will be discussed later.
ViC3 simulation results
Figure 3. shows the running result on the synthetic dataset. The experiments are running on cases with cluster number from 3 to 150 and the corresponding NMI score and accuracy score are recorded. From Figure 3., it is easy to see that both dpmeans and pypmeans get satisfied results while the cluster number is small . However, when more clusters are generated, dpmeans falls below 0.8 in NMI and 70 in accuracy while pypmeans receives a much better performance both in NMI and accuracy. We should also note that when , our pypmeans receives a better performance than means clustering in most cases, even if the later has the true cluster number.
ViC4 parameter learning
In this part, the parameter ’s value is taken from to . The “discovery rate” () is employed to denote the cluster number we have uncovered. By default, we set in pypmeans. The detail result shows in Figure 4. From this figure, we can find that smaller threshold would in a larger discovered cluster number. This is quite reasonable as the smaller threshold would lead to smaller cluster size and then the larger cluster number. Also, our proposed pypmeans can discovery a relative accurate cluster number while it is less than 75; however, the dpmeans can only discover perform well under the 10 cluster number case.
ViC5 cluster number learning
We shows the corresponding cluster number discovered by using the parameter setting in previous in this experiment. Since means always take the true cluster number as a prior information. We check the other three methods’ discovery rate in comparison. We can see that due to the cluster number’s increase, all of the discovery rate slowly decrease. However, we can see that our pypmeans receives a better performance than the existed dpmeans while facing large cluster number situation.
ViC6 running time test
The running time of our methods is tested to validate our complexity analysis, with methods comparable test and selfparameter comparable test. From Figure 6., we can see that our pypmeans runs approximate the same time as dpmeans. Even if in large scale case, the running time is still tolerated; while in variational learning, the running time increases in exponential.
ViD Real World Dataset
Two kinds of real world dataset are used here to further validate our pypmeans’s performance, including the UCI benchmarking dataset and the US communities Crime dataset.
ViD1 UCI benchmarking dataset
In the UCI benchmarking dataset study, two types of the datasets are selected for our study: the powerlaw dataset and “normal” dataset with a few clusters or equal sized clusters, in contrast to the powerlaw dataset. We do this to show that our methods not only could receive better performance on the power law dataset but also could obtain a satisfied result on the “normal” dataset.
Since the powerlaw dataset is limited, we manually make up some by removing some data points in certain clusters in some datasets. One maximum likelihood estimation of the powerlaw density function parameter is used to curve its detail powerlaw behavior, denoted as:
(23) 
Here denotes the cluster size in our study, while represents the cluster number.
The smaller , the larger powerlaw behavior tendency.
With the following table, the detail of UCI benchmarking datasets we are using is shown:
type  data  size  dimension  clusters  

normal  wine  178  13  3  6.02 
satellite  6435  36  7  3.15  
statlog  2310  19  7  
P. L.  yeast  1484  8  10  1.38 
vowel  349  10  11  1.33  
shape  160  17  9  1.49  
pendigits  7494  16  10  1.63  
pageblock  5473  11  5  1.49  
glass  214  9  6  1.94 
We tune the parameter ’s value experimentally to receive a better performance, and the ’s value is default as . Table II is the detail outcomes of the UCI benchmarking data experiments.
dataset  criterion  means  means  means  V. L. 

wine  NMI  0.8126  0.7815  0.8349  0.4288 
ACC  82.04  80.12  94.94  62.92  
C.N.  2.90  2.82  3  3  
satellite  NMI  0.5953  0.5683  0.6125  0.3122 
ACC  66.74  66.58  67.19  34.93  
C.N.  5.96  5.26  6  4  
statlog  NMI  0.6537  0.6570  0.6128  0.4823 
ACC  55.69  55.97  59.91  31.17  
C.N.  6.82  6.62  7  6  
yeast  NMI  0.2476  0.1768  0.2711  0.1063 
ACC  41.3329  36.2278  36.8706  33.6927  
C.N.  9  6.04  10  8  
vowel  NMI  0.4479  0.4125  0.4357  0.3938 
ACC  28.9914  27.5645  29.5645  31.2321  
C.N.  11.90  9.04  11  11  
shape  NMI  0.7405  0.7204  0.6593  0.4279 
ACC  64.00  61.36  63.05  32.50  
C.N.  15.90  14.90  9  9  
pendigits  NMI  0.6903  0.6622  0.6834  0.7024 
ACC  66.86  64.65  69.96  62.57  
C.N.  10.76  8.88  10  10  
pageblock  NMI  0.1817  0.1807  0.1484  0.2622 
ACC  67.70  66.90  44.23  73.91  
C.N.  6.02  5.32  5  5  
glass  NMI  0.3875  0.3784  0.3077  0.2865 
ACC  49.50  48.37  42.88  45.37  
C.N.  5.94  5.00  6  6 
Here C.N. denotes the Cluster Number the method has produced.
From the results given, we can see that on the normal datasets, our method pypmeans performances better or at least as good as the dpmeans and means clustering on most cases. This usually because the cluster number is usually small under this kind of dataset, leading the discount parameter function little in the process.
On the powerlaw dataset in UCI, our pypmeans can receive better result than dpmeans. The ability to automatically learn the threshold plays a vital role in this learning. Although our methods loses at some datasets, it could still be validated valued.
ViD2 communities crime dataset
We also conduct experiments on the communities crime rate dataset.
The communities crime dataset is one collection combines socioeconomic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.
The dataset constitute of nearly 100 attributes. The attributes varies in many aspects of the community, excluding the clearly unrelated attributes. The class label is the total number of violent crimes per population. In this experiment, as the crime rate is one continuous variable ranges in [0, 1], we manually discrete the values into a certain number of intervals and gets the related labels.
Figure 7. depicts the histogram of the each interval’s number under the case of 16 intervals.
The figure above clearly shows the data distributed according to powerlaw behavior.
To avoid the “curse of dimensionality” problem, we apply the feature selection technique [27] and select first 10 features as the most correlated ones in advance.
The parameter are also tuned so as to better describe the true cluster labels. The tuned parameter and related results are shown in Table III..
criterion  pypmeans  dpmeans  means  V. L.  

6.57  NMI  0.2291  0.2075  0.2284  0.0862 
ACC  11.10  12.14  9.92  10.34  
C. N.  53.8  34.9  51.0  13.0  
11.01  NMI  0.1794  0.1680  0.1721  0.0639 
ACC  16.52  22.23  15.27  18.66  
C. N.  23.5  10.6  21.0  11.0  
12.51  NMI  0.1800  0.1715  0.1746  0.0896 
ACC  24.30  27.17  22.60  34.32  
C. N.  11.7  7.7  11.0  9.0  
17.11  NMI  0.1737  0.1536  0.1706  0.1119 
ACC  40.06  48.13  35.33  59.46  
C. N.  5.7  3.6  6.0  6.0 
From the result, we can see that our pypmeans can receive a better performance in both the NMI score and cluster number prediction.
Vii Conclusion
One novel modified PitmanYor Process based method is proposed here to address the powerlaw data clustering problem. With the discount parameter in pyprocess slightly adjusted, the powerlaw data is to be perfectly depicted. We also introduce one center agglomeration procedure, leading to an adaptively way in determining the number of clusters. Further, we extend our work to the spectral clustering case to address more sophisticated situations.
Some other issues are also well discussed here, including the convergence and complexity analysis, the practical issues including one reliable data clustering order. All these have greatly strengthen the solidness and reality applicability of the method.
References
 [1] S. Lloyd, “Least squares quantization in pcm,” Information Theory, IEEE Transactions on, vol. 28, no. 2, pp. 129–137, 1982.
 [2] J. Hartigan and M. Wong, “Algorithm as 136: A kmeans clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
 [3] J. Bezdek, “A convergence theorem for the fuzzy isodata clustering algorithms,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 1, pp. 1–8, 1980.
 [4] C. Bishop and S. S. en ligne), Pattern recognition and machine learning. springer New York, 2006, vol. 4.
 [5] M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture models,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 3, pp. 381–396, 2002.
 [6] J. Shi and J. Malik, “Normalized cuts and image segmentation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, no. 8, pp. 888–905, 2000.
 [7] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.
 [8] Y. Cheng, “Mean shift, mode seeking, and clustering,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 17, no. 8, pp. 790–799, 1995.
 [9] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 5, pp. 603–619, 2002.
 [10] H. Bischof, A. Leonardis, and A. Selb, “Mdl principle for robust vector quantisation,” Pattern Analysis & Applications, vol. 2, no. 1, pp. 59–72, 1999.
 [11] C. Fraley and A. Raftery, “How many clusters? which clustering method? answers via modelbased cluster analysis,” The computer journal, vol. 41, no. 8, pp. 578–588, 1998.
 [12] G. Hamerly and C. Elkan, “Learning the k in kmeans,” in In Neural Information Processing Systems. MIT Press, 2003, p. 2003.
 [13] C. Sugar and G. James, “Finding the number of clusters in a dataset,” Journal of the American Statistical Association, vol. 98, no. 463, pp. 750–763, 2003.
 [14] R. Nock and F. Nielsen, “On weighting clustering,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 28, no. 8, pp. 1223–1235, 2006.
 [15] S. Roweis and Z. Ghahramani, “A unifying review of linear gaussian models,” Neural computation, vol. 11, no. 2, pp. 305–345, 1999.
 [16] M. Jordan and B. Kulis, “Revisiting kmeans: New algorithms via bayesian nonparametrics.”
 [17] B. Kulis and M. Jordan, “Revisiting kmeans: New algorithms via bayesian nonparametrics,” Arxiv preprint arXiv:1111.0352, 2011.
 [18] J. Pitman and M. Yor, “The twoparameter poissondirichlet distribution derived from a stable subordinator,” The Annals of Probability, vol. 25, no. 2, pp. 855–900, 1997.
 [19] H. Ishwaran and L. James, “Gibbs sampling methods for stickbreaking priors,” Journal of the American Statistical Association, vol. 96, no. 453, pp. 161–173, 2001.
 [20] A. Clauset, C. R. Shalizi, and M. E. J. Newman, “Powerlaw distributions in empirical data,” SIAM Rev., vol. 51, no. 4, pp. 661–703, Nov. 2009. [Online]. Available: http://dx.doi.org/10.1137/070710111
 [21] S. Selim and M. Ismail, “Kmeanstype algorithms: a generalized convergence theorem and characterization of local optimality,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 1, pp. 81–87, 1984.
 [22] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci.edu/ml
 [23] M. Redmond and A. Baveja, “A datadriven software tool for enabling cooperative information sharing among police departments,” European Journal of Operational Research, vol. 141, no. 3, pp. 660 – 678, 2002.
 [24] D. Blei and M. Jordan, “Variational inference for dirichlet process mixtures,” Bayesian Analysis, vol. 1, no. 1, pp. 121–144, 2006.
 [25] C. Papadimitriou and K. Steiglitz, Combinatorial optimization: algorithms and complexity. Dover Pubns, 1998.
 [26] J.S. Zhang and Y.W. Leung, “Improved possibilistic cmeans clustering algorithms,” Fuzzy Systems, IEEE Transactions on, vol. 12, no. 2, pp. 209 – 217, april 2004.
 [27] G. Brown, A. Pocock, M.J. Zhao, and M. Luján, “Conditional likelihood maximisation: A unifying framework for information theoretic feature selection,” J. Mach. Learn. Res., vol. 13, pp. 27–66, Mar. 2012.