Graphbased Learning
with Unbalanced Clusters
Abstract
Graph construction is a crucial step in spectral clustering (SC) and graphbased semisupervised learning (SSL). Spectral methods applied on standard graphs such as fullRBF, graphs and NN graphs can lead to poor performance in the presence of proximal and unbalanced data. This is because spectral methods based on minimizing RatioCut or normalized cut on these graphs tend to put more importance on balancing cluster sizes over reducing cut values. We propose a novel graph construction technique and show that the RatioCut solution on this new graph is able to handle proximal and unbalanced data. Our method is based on adaptively modulating the neighborhood degrees in a NN graph, which tends to sparsify neighborhoods in low density regions. Our method adapts to data with varying levels of unbalancedness and can be naturally used for small cluster detection. We justify our ideas through limit cut analysis. Unsupervised and semisupervised experiments on synthetic and real data sets demonstrate the superiority of our method.
Keywords:
Adaptive graph sparsification, small cluster detectionAuthors’ Instructions
1 Introduction and Motivation
Graphbased approaches are popular tools for unsupervised clustering and semisupervised learning(SSL). In these approaches, a graph representing the data set is first constructed. Then a graphbased learning algorithm such as spectral clustering(SC) [4] or SSL algorithms [5, 6] is applied on the graph. Of the two steps, graph construction has been identified to be critical[5, 7, 2, 8, 9]. Effective graph construction strategies turn out to be even more critical in the presence of unbalanced and proximal data. Unbalanced data arises routinely in many applications including multimode(class) clustering and SSL tasks. The focus of this paper is on graph construction for spectral methods and we refer to [10] for modelbased approaches.
Common graph construction methods include graph, fullyconnected RBFweighted(fullRBF) graph and nearest neighbor(NN) graph. graph links two nodes and if . FullRBF graph links every pair of nodes with RBF weights , which is in fact a soft threshold( serves similarly as ). NN graph links and if is among the closest neighbors of or vice versa. It is the most recommended method[7, 5] due to its relative robustness to outliers. In [8] the authors propose matching graph. This method is supposed to eliminate some of the spurious edges of NN graph and lead to better performance[9].
However, for unbalanced and proximal data clusters, SC and graphbased SSL algorithms appear to perform poorly on these conventional graphs. This poor performance is a result of minimizing RatioCut objective on these graphs. For unbalanced and proximal data clusters the RatioCut objective on these graphs tends to put more importance on balancing cluster sizes over reducing cut values. This sometimes leads to cuts that are not meaningful. In Section 2 we will investigate the fundamental reasons that lead to poor results. We will then outline a novel graph construction strategy, whereby the RatioCut objective on this new graph is able to handle varying levels of proximal and unbalanced data. Our rankmodulated degree (RMD) graph construction method, described in detail in Section 3, is based on modulating the degrees in a NN graph. The impact of this strategy is that it results asymptotically in more edges per node in highdensity regions and a sparsification near density valleys. We explore the theoretical basis for these results in Section 4. In Section 5 we present several experiments on synthetic and real datasets and show significant improvements in SC and SSL results over conventional graph constructions.
2 Proximal & Unbalanced Data Clusters
In this section we will investigate some of the reasons that lead to poor SC and SSL performance for conventional graph constructions in the presence of proximal and unbalanced data. We draw upon existing results to justify our reasoning.
Let be the graph constructed from samples drawn IID from some underlying density , where . Let be a 2partition of the nodes separated by a hyper surface . The simple cut is defined as:
(1) 
where is the weight of edge . Spectral clustering techniques are based on minimizing RatioCut:
(2) 
where denotes the number of nodes in . A variant of RatioCut is the so called normalized cut (NCut). Our discussions for RatioCut also extend to NCut and we will not discuss NCut from here on. Note RatioCut augments the simple Cut with a balancing term, which desensitizes partitions from outliers.
Unbalanced Proximal Gaussian Mixture: By means of an example, we will argue that minimizing RatioCut on conventional graphs has fundamental drawbacks for clustering proximal and unbalanced datasets.
For our illustrative experiment we consider data samples drawn IID from a proximal and unbalanced 2D gaussian mixture density,
(3) 
where =0.9, =0.1, =[4.5;0], =[0;0], , as shown in Fig.1. We examine different graph constructions including fullRBF, (RBF) NN and graph. Note that these graph constructions are parameterized by , , . Our SC results here are depicted for reasonable choices of these parameters.
A balanced cut in this case is approximately a line parallel to axis passing through . A cut at the density valley is approximately a line parallel to axis passing through . For the SC to seek a cut at the valley we would need the RatioCut to achieve its minimum at .
The rescaled simple Cut curve in (c),(e) shows that the Cut value is relatively large at due the fact that the density valley is ”shallow.” Fig.1(c) shows RatioCut values for RBF NN for large and small values. Large (unweighted NN behaves similarly) achieves minimum at the balanced position (); while small pulls down RatioCut near the boundaries and turns out to be vulnerable to outliers. (e) shows fhat fullRBF(graph behaves similarly) with large tends to smooth out the curve and is insensitive to location of the valley, while small appears to be vulnerable to outliers. In contrast our method, RMD, appears to be able to reject outliers and achieves minimum RatioCut close to the valley position.
Graph Partitioning, Cutvalues, and Cluster Sizes: By varying in Eq.(3) we can vary the size of unbalanced clusters; varying has the effect of varying proximity of the clusters. For a given value of , we let be the locus of points corresponding to the density valley (for example in Fig.1 this is the line ), and any line that asymptotically results in two balanced partitions (for example in Fig.1 this is the line ). Now for a graph , the lines and describe two different partitions, one unbalanced but respecting the inherent clustering of data and the other balanced but not respecting the underlying data clusters. We denote by the partitions resulting from a cut associated with the line and by the partitions resulting from a cut associated with the line ^{1}^{1}1data samples situated exactly on the line or are randomly assigned.. The Cutratio is defined as the ratio of the Cut values corresponding to the two partitions; denotes the size of unbalanced partition, namely,
(4) 
Now we examine the condition when the natural unbalanced partition has a smaller RatioCut value than the balanced partition. This requires that,
(5) 
where we have substituted for from Eq.(4). A plot of the Cutratio for different unbalanced proportions is shown in Fig.2.
Consequently, Fig. 2 and Eq.(5) points to a fundamental aspect of RatioCut for datasets with unbalanced and proximal clusters. If the tuple lies above the curve, RatioCut value is smaller for balanced partitions than partitioning at density valley (note required Cutratio can be as small as ).
Why do conventional graphs fail? This is best explained by understanding the limitcut analysis results for NN, graph and fullRBF graphs [2, 11]. For appropriately chosen parameters, , and respectively, as the number of samples , the Cut ratio and the unbalanced cluster size converges (with high probability) to:
(6) 
where are the volumes (probability) of sets and under density respectively. is a constant and depends on the specific graph construction. While standard graph construction methods do account for the underlying density , this by itself is insufficient for proximal and unbalanced clusters. For the mixture Gaussian case (Eq.(3)) it follows from Eq.(6) that can be relatively large for an appropriate choice of and a fixed choice of unbalancedness, . Note, , is predominantly controlled through mixture proportions . Eq.(5) and Fig. 2 asserts that in this case RatioCut has a smaller value for balanced partitions even when density valley cut, , is the natural choice.
Parameter tuning: It is possible that the parameters , , and can be tuned to account for unbalancedness. However, large values of , and tends to smooth the underlying distribution (see Fig. 1) and increases the Cutratio, which worsens the problem. In contrast decreasing , and below wellunderstood acceptable thresholds (see [2, 11]) leads to disconnected graphs and sensitivity to outliers (this is also seen in Fig. 1). While changing parameters can globally modify the graph topology, this has poor control over Cutratio. For instance, increasing/decreasing results in a NN graph with uniformly larger/smaller number of neighbors for all the nodes and uniformly larger/smaller Cut values for any cut, leading to poor control of Cutratio.
Controlling Cut Ratio through Graph Sparsification: From the above discussion it is clear that we need to directly control Cutratio. We do so by adaptively sparsifying graph neighborhoods. Neighborhoods for nodes in plausible lowdensity regions are sparsified and those in highdensity regions are “densified”. By controlling this sparsification/densification the Cutratio is controlled and adapted to varying degrees of unbalancedness and proximity. Comparisons between standard constructions and our RMD graph for the Gaussian mixture of Eq.(3) are shown in Fig. 1. As seen our method sparsifies low density regions in contrast to other methods.
3 RMD Graphs: Main Steps
Given data samples in , our rankmodulated degree(RMD) graph based learning involves the following steps:
(1) Rank Computation: The rank of every point is calculated:
(7) 
where denotes the indicator function. Ideally we would like to choose to be the underlying density, of the data. Since is unknown, we need to employ some surrogate statistic. While many choices are possible, the statistic in this paper is based on nearest neighbor distances. Such rank based statistics have been employed for highdimensional anomaly detection [18, 19]. More choices for and a robust procedure for computing are described in Sec.3.1. The rank is a normalized ordering of all points based on , ranges in , and indicates how extreme the sample point is among all the points.
(2) RMD Graph Construction: Connect each point to its deg() closest neighbors. The number of neighbors deg() for point is modulated as follows:
(8) 
where, is a scalar parameter that will be optimized later. Here is the average degree, controls the minimum degree. It is not difficult to see that converges (in distribution) to a uniform measure on the unit interval regardless of the underlying density if is bijective. This implies that the expected value converges to 0.5. Consequently, the average degree across all samples is . Furthermore, the above modulation scheme can be thought of as modulating the degree of each node around a nominal value equal to . The remaining issue is to optimize over the scalar parameter , which is described in Step (4).
(3) Graphbased Learning: The third step involves using RMD graph in a graphbased clustering or SSL algorithm. Spectral clustering algorithms based on RatioCut for 2class and multiclass clustering are now well established. For SSL algorithms we employ Gaussian Random Fields(GRF) and Graph Transduction via Alternating Minimization(GTAM). These approaches all involve minimizing plus some constraints or penalties, where is the cluster indicator function or classification (labeling) function, is the graph Laplacian matrix. This has been shown to be equivalent to minimizing RatioCut(NCut) for unnormalized(normalized) [13]. We refer readers to references [7, 5, 6] for details.
(4) Optimization over : Our final step is to optimize over . Our main assumption is that we have prior knowledge that the smallest cluster is at least of size . We consider the 2cluster case first. The 2partitions resulting from spectral clustering algorithms are now parameterized by : . We now optimize the minimum Cut value over all admissible such that the smallest cluster is no smaller than some threshold :
(9)  
sets the threshold of minimum cluster size, which means clusters of smaller sizes than are viewed as outliers and will be discarded. Algorithms for Kpartition clusters and SSL algorithms can be extended in a similar manner by optimizing suitable objective functions in place of the 2partition cut value. Note that a similar optimization step can also be applied to select the best and with traditional graph constructions as well. We will employ this strategy for the purpose of comparison on real data sets in Sec.5.2.
3.1 Rank Computation
The missing component in our RMD method is the specification of the statistic . We choose the statistic in Eq.(7) based on nearestneighbor distances. Specifically,
(10) 
where denotes the distance from to its th nearest neighbor, and is the average of ’s th to th nearest neighbor distances. Other choices for are listed below.
(1) Neighborhood: is the number of neighbors within an ball of .
(2) Nearest Neighorhood: is the distance from to its th nearest neighbor.
Empirically (and theoretically) we have observed that the average nearest neighbor distance leads to better performance and robustness. To reduce variance during rank computation we adopt a Ustatistic resampling technique [14] with resamplings.
Ustatistic Resampling For Rank Computation:
Given data points,
(a) Randomly split the data into two equal parts: , .
(b) Points in are used to calculate for according to Eq.(10), and vice versa.
(c) Ranks of are computed by Eq.(7) within and similarly for .
(d) Resplit the data and repeat the above steps times. Let be the rank of obtained from the th resampling. We then use the average as the final rank:
(11) 
Properties of the Ranked Data:
(1) High/Low Density Indicator: The value of is a direct indicator of whether lies in high/low density regions(Fig.3).
(2) Smoothness: is the integral of pdf asymptotically(see Thm.4.1 in Sec.4). It’s smooth and uniformly distributed in . This makes it appropriate to modulate the degrees with control of minimum, maximal and average degree.
(3) Precision: We do not need our estimates to be precise for every point; the resulting cuts will typically depend on relatively low ranks rather than the exact value, of most nearby points.
3.2 Salient Properties of RMD Graphs
Our scheme successfully solves the following issues:
(1) Captures density valley: The monotonicity of deg() in immediately implies that nodes in low/high density areas will have fewer/more edges, thus reducing cutratio in Fig.2 and ensuring that the RatioCut has low values at density valleys.
(2) Robustifies against Outliers: The minimum degree of nodes in RMD graph is , even for distant outliers. Furthermore, is the solution to the optimization step (see Eq. 9), and so is robust to outliers as shown in Fig.1(c), where the RatioCut curve of RMD graph(black) goes up near boundaries, guaranteeing the valley minimum is the global minimum.
(3) Adapts to Unbalanced Clusters: The optimization problem of Eq.(9) leads to sizable clusters that can be unbalanced. The reason is that small values of emphasize the Cut value over the balancing term. This has the effect of preferring smaller Cut values with possibly unbalanced partitions over balanced partitions with larger Cut values. This effect is magnified because smaller leads to sparser connections at lowdensity areas. Since the balancing term is not impacted, varying from 1 to 0 moves the partition from the relatively balanced position toward the density valley (see also Thm.4.2 in Sec.4). Practically, provides a flexibility to optimize the tradeoff between the simple Cut and the cluster size. The clustersize threshold in the optimization step (Eq.(9)) is used to constrain clusters that are not too small, thus avoiding outliers. We can also iterate over to find possibly different valley cuts of different sizes. This procedure can sometimes be used for sizeconstrained clustering [15]. We will demonstrate some of these ideas in Sec.5.3.
4 Analysis
The proofs of theorems here appear in the Appendix section. Assume the data set is drawn i.i.d. from density in . has a compact support . Let be the RMD graph. Given a separating hyperplane , denote , as two subsets of split by , the volume of unit ball in .
First we show the asymptotic consistency of the rank of some point . The limit of , , is the complement of the volume of the level set containing . Note that exactly follows the shape of , and always ranges in no matter how scales.
Theorem 4.1
Assume the density satisfies some regularity conditions. For a proper choice of parameters of , as , we have
(12) 
Next we study RatioCut induced on unweighted RMD graph(similar for NCut). The limit cut expression on RMD graph involves an additional adjustable term which varies according to the density. This implies the Cut values in high density areas can be significantly more expensive than in low density areas. Notice that this effect becomes stronger when varies from 1 to 0, which means the minimum will be attained at even smaller density areas. For technical simplicity, we assume RMD graph ideally connects each point to its deg closest neighbors.
Theorem 4.2
Compared to the limit expression on NN graph([2]), there is an additional term here. To see the impact suppose is small; we see that for near modes, and this extra term is nearly . For passing valleys this term is nearly . So graphcut value near modes are penalized more than valleys.
5 Simulations
Many of the examples in this section focus on the unbalanced datasets. Unbalanced data is obtained by sampling the data set in an unbalanced way. Some general simulation parameters are:
(1) In Ustatistic rank calculation (Sec.3.1), we fix the resampling time .
(2) All error rate results are averaged over 20 trials.
Other parameters will be specified below.
5.1 MultiCluster ComplexShaped Clusters
Consider a data set composed of 1 small Gaussian and 2 moonshaped proximal clusters shown in Fig.4. Sample size with the rightmost small cluster and two moons each. In this example, for the purpose of illustration, we did not optimize or any of the other parameters. We fix , and choose , , where is the average NN distance. On NN and matching graphs SC fails for two reasons: (1) SC cuts at balanced positions and cannot detect the rightmost small cluster; (2) SC cannot recognize the long winding lowdensity regions between 2 moons because there are too many spurious edges and the Cut value along the curve is big. SC fails on graph(similar on fullRBF) because the outlier point forms a singleton cluster, and also cannot recognize the lowdensity curve. RMD graph significantly sparsifies the graph at lowdensity regions, enabling SC to cut along the winding valley, detect the small cluster and is robust to outliers. Naturally, these results depend on choices of , , and . However, our choices represent the best case scenarios for these methods and we did not see any significant improvements by varying these parameters.
5.2 Real DataSets
We focus on unbalanced settings and consider several real data sets. We construct NN, match, fullRBF and RMD graphs all combined with RBF weights, but do not include the graph because of its overall poor performance. For fairness of comparison, we vary not only of RMD but also , under the optimization step in Sec.3. For example, the result of RBF NN graph is chosen based on optimizing the following expression:
(14)  
where, denotes the RatioCut partition obtained on the RBF NN graph with nearest neighbor parameter and RBF parameter . The optimization problem is nonconvex but involves search over a small number of parameters. We discretized the parameters in our experiments. We varied in . For the RBF parameter it has been suggested that it should be of the same scale as the average NN distance [6]. This suggested a discretization of as with . We discretized in steps of . Notice that for , RMD graph is identical to NN graph. is set identical to . We assume meaningful clusters are at least of the total number of points . We set the GTAM parameter [9] for the SSL applications. For each SSL run 20 randomly labeled samples are chosen with at least one sample from each class.
Varying Unbalancedness: We start with a comparison for 8vs9 of the 256dim USPS digit data set. We keep the total sample size as 750, and vary the unbalancedness, i.e. the proportion of numbers of points from two clusters, denoted by . Fig.5 shows that as the unbalancedness increases, the performance severely degrades on traditional graphs, while our method can adapt the graphbased learning algorithms to different levels of unbalancedness very well.
Error Rates(%)  USPS  SatImg  OptDigit  LetterRec  

8vs9  1,8,3,9  4vs3  3,4,5  1,4,7  9vs8  6vs8  1,4,8,9  6vs7  6,7,8  
RBF NN  16.67  13.21  12.80  18.94  25.33  9.67  10.76  26.76  4.89  37.72 
RBF matching  17.33  12.75  12.73  18.86  25.67  10.11  11.44  28.53  5.13  38.33 
fullRBF  19.87  16.56  18.59  21.33  34.69  11.61  15.47  36.22  7.45  35.98 
RBF RMD  4.80  9.18  7.87  15.26  19.72  5.43  6.67  21.35  2.92  28.68 
Error Rates(%)  USPS  SatImg  OptDigit  LetterRec  

8vs6  1,8,3,9  4vs3  1,4,7  6vs8  8vs9  6,1,8  6vs7  6,7,8  
GRF  RBF NN  5.70  13.29  14.64  16.68  5.68  7.57  7.53  7.67  28.33 
RBF matching  6.02  13.06  13.89  16.22  5.95  7.85  7.92  7.82  29.21  
fullRBF  15.41  12.37  14.22  17.58  5.62  9.28  7.74  11.52  28.91  
RBF RMD  1.08  10.24  9.74  15.04  2.07  2.30  5.82  5.23  27.24  
GTAM  RBF NN  4.11  10.88  26.63  20.68  11.76  5.74  12.68  19.45  27.66 
RBF matching  3.96  10.83  27.03  20.83  12.48  5.65  12.28  18.85  28.01  
fullRBF  16.98  11.28  18.82  21.16  13.59  7.73  13.09  18.66  30.28  
RBF RMD  1.22  9.13  18.68  19.24  5.81  3.12  10.73  15.67  25.19 
Other Real Data Sets: We apply SC and SSL algorithms on several other real data sets including USPS, waveform database generator(21dim), Statlog landsat satellite images(36dim), letter recognition images(16dim) and optical recognition of handwritten digits(64dim) [16]. We fix 150/600, 200/400/600, 200/300/400/500 samples for 2,3,4class cases, with corresponding orders of class indices listed in Tab.1,2. Tab.1,2 shows that even when and for RBF NN(matching) and fullRBF graphs are optimized to achieve optimal performance, RMD graph still consistently outperforms other methods.
5.3 Applications to Small Cluster Detection
We illustrate how our method can be used to find smallsize clusters. This type of problem arises in community detection in large real networks, where graphbased approaches are popular but smallsize community detection is difficult [17].
Our synthetic dataset depicted in Fig. 6 has 1 large and 2 small proximal Gaussian components along axis: , where , =[0.7;0], =[4.5;0], =[9.7;0], .
Fig.7(a) shows a plot of cut values for different cut positions averaged over 20 Monte Carlo runs. We note that the cutvalue plot resembles the underlying density. Two density valleys are both at the unbalanced positions. The rightmost cluster is smaller than the left cluster, but has a deeper valley.
To apply our method we vary the clustersize threshold in Eq.(9). We can now plot the Cutvalue against as shown in Fig.7(b). As seen in Fig.7(b), when , the optimal cut is close to the valley. However, since the proportion of data samples in the smaller clusters is less than 30% we see that the optimal cut is bounded away from both valleys. As is further decreased, namely, in the range , the optimal cut is now attained at the left valley(). An interesting phenomena is that the curve flattens out in this range. This corresponds to the fact that the cut value is minimized at this position () for any value of . This flattening out can happen only at valleys since valleys represent a “local” minima for the optimization step of Eq. 9 under the constraint imposed by . Consequently, small clusters can be detected based on the flat spots. Next when we further vary in the region , the best cut is attained near the right and deeper valley(). Again the curve flattens out revealing another small cluster.
5.4 Comments on RMD Method
Tuning Parameters: We first describe parameters involved in our RMD method. We have already pointed out that is a parameter that is optimized and so does not count as a tuning parameter. So we are left with parameters and . As we pointed out in Sec. 3 the choice of is based on our prior or desire to find clusters that are sizable, say 5% to 10% of the data. This leaves the choice to a single tuning parameter, namely, . Our method appears to be relatively insensitive to choice of . Note that unlike and , which are used for graph construction, the parameter here is primarily used to relatively order data points based on whether they belong to highdensity or lowdensity regions. In most situations we have encountered this ranking does not substantially change, namely, it is rarely the case where an empirically low ranked data point should have a highrank (i.e. highdensity region). Similar results have also been observed in the context of highdimensional anomaly detection [18, 19].
Time Complexity: The time complexity of Ustatistic rank computation is , and RMD graph construction is , which leads to an aggregate complexity of . In experiments we set , so the complexity is on the same order of constructing a NN graph().
6 Conclusions
We have demonstrated that spectral clustering and graph based semisupervised learning algorithms can fail on conventional graph methods for unbalanced and proximal data clusters. We propose a systematic procedure for graph construction (RMD graph), based on adaptive sparsification and densification of neighborhoods of NN graphs. Our method effectively incorporates density, maintains robustness to outliers, and adapts to different degrees of unbalancedness. We present a optimization framework for graphbased approaches, which allows for best sizable clusters separated by the smallest cut value. By constraining the smallest cluster sizes we can detect multiple small clusters and generate different meaningful cuts. Our simulations demonstrate significant performance improvements over existing methods for synthetic and real datasets. The ability to detect smallsize clusters (Fig.7) indicates that our idea may be utilized in other applications such as community detection in large real networks, where graphbased approaches are popular but smallsize community detection is difficult [17].
References
 [1] Hagen, L., Kahng, A.: New Spectral Methods for Ratio Cut Partitioning and Clustering. In: IEEE Trans. on ComputerAided Design (1992)
 [2] M. Maier, U. von Luxburg, M. Hein: Influence of Graph Construction on Graphbased Clustering. In: NIPS (2008)
 [3] M. Maier, U. von Luxburg, M. Hein: Supplementary materia to: Influence of Graph Construction on Graphbased Clustering. (2008)
 [4] J. Shi and J. Malik: Normalized Cuts and Image Segmentation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 888–905. Vol. 22, No. 8 (2000)
 [5] Zhu, X.: SemiSupervised Learning Literature Survey. (2008)
 [6] Wang, J., Jebara, T., Chang, S.F.: Graph Transduction via Alternating Minimization. In: ICML (2008)
 [7] Luxburg, U. von: A tutorial on spectral clustering. In: Statistics and Computing, vol. 17, no. 4, pp. 395–416 (2007)
 [8] Jebara, T., Shchogolev, V.: BMatching for Spectral Clustering. In: ECML (2006)
 [9] Jebara, T., Wang, J., Chang, S.F.: Graph Construction and bMatching for SemiSupervised Learning. In: ICML (2009)
 [10] Fraley, C., Raftery, A.: ModelBased Clustering, Discriminant Analysis, and Density Estimation. In: Journal of the American Statistical Association. MIT Press (2002)
 [11] Narayanan, H., Belkin, M., Niyogi, P.: On the relation between low density separation, spectral clustering and graph cuts. In: NIPS (2006)
 [12] Huang, B., Jebara, T.: Loopy Belief Propagation for Bipartite Maximum Weight bMatching. In: AISTATS (2007)
 [13] Chung, F.: Spectral graph theory. In: American Mathematical Society. (1996)
 [14] Koroljuk, V., Borovskich, Y.: Theory of Ustatistics (Mathematics and Its Applications). Kluwer Academic Publishers Group. (1994)
 [15] Hoppner, F., Klawonn, F.: Clustering with Size Constraints. In: Computational Intelligence Paradigms. (2008)
 [16] Frank, A., Asuncion, A.: UCI Machine Learning Repository, http://archive.ics.uci.edu/ml
 [17] Shah, D., Zaman, T.: Community Detection in Networks: The LeaderFollower Algorithm. In: NIPS (2010)
 [18] Zhao, M., Saligrama, V.: Anomaly Detection with Score functions based on Nearest Neighbor Graphs, NIPS 2009
 [19] Saligrama, V., Zhao, M.: Local Anomaly Detection, AISTATS 2012
Appendix: Proofs of Theorems
For ease of development, let , and divide data points into: , where , and each involves points. is used to generate the statistic for and , for . is used to compute the rank of :
(15) 
We provide the proof for the statistic of the following form:
(16) 
where denotes the distance from to its th nearest neighbor among points in . Practically we can omit the weight as Eq.(7) in the paper. The proof for the first and second statistics can be found in [18].
Proof of Theorem 1:
Proof
The proof involves two steps:

The expectation of the empirical rank is shown to converge to as .

The empirical rank is shown to concentrate at its expectation as .
The first step is shown through Lemma 2. For the second step, notice that the rank , where is independent across different ’s, and . By Hoeffding’s inequality, we have:
(17) 
Combining these two steps finishes the proof.
Proof of Theorem 2:
Proof
We only present a brief outline of the proof. We want to establish the convergence result of the cut term and the balancing terms respectively, that is:
(18)  
(19) 
where are the discrete version of .
Eq.(18) is established in two steps. First we can show that the LHS cut term converges to its expectation by making use of the McDiarmid’s inequality. Second we show that this expectation term actually converges to the RHS of Eq.(18). This is the most intricate part and we state it as a separate result in Lemma 1.
Lemma 1
Given the assumptions of Theorem 2,
(20) 
where .
Proof
The proof is similar to [3] and we provide an outline here. The first trick is to define a cut function for a fixed point , whose expectation is easier to compute:
(21) 
Similarly, we can define for . The expectation of and can be related:
(22) 
Then the value of can be computed as,
(23) 
where is the distance of to its th nearest neighbor. The value of is a random variable and can be characterized by the CDF . Combining equation 22 we can write down the whole expected cut value
(24)  
(25) 
To simplify the expression, we use to denote
(26) 
Under general assumptions, when tends to infinity, the random variable will highly concentrate around its mean . Furthermore, as , tends to zero and the speed of convergence
(27) 
So the inner integral in the cut value can be approximated by , which implies,
(28) 
The next trick is to decompose the integral over into two orthogonal directions, i.e., the direction along the hyperplane and its normal direction (We use to denote the unit normal vector):
(29) 
When , the integral region of will be empty: . On the other hand, when is close to , we have the approximation :
(30)  
(31)  
(32) 
The term is the volume of dim spherical cap of radius , which is at distance to the center. Through direct computation we obtain:
(33) 
Combining the above step and plugging in the approximation of in Eq.(27), we finish the proof.
Lemma 2
By choosing properly, as , it follows that,
Proof
Take expectation with respect to :
(34)  
(35)  
(36) 
The last equality holds due to the i.i.d symmetry of and . We fix both and and temporarily discarding . Let , where are the points in . It follows:
(37) 
To check McDiarmid’s requirements, we replace with . It is easily verified that ,
(38) 
where is the diameter of support. Notice despite the fact that are random vectors we can still apply MeDiarmid’s inequality, because according to the form of , is a function of i.i.d random variables where is the distance from to . Therefore if , or , we have by McDiarmid’s inequality,
(39) 
Rewrite the above inequality as:
(40) 
It can be shown that the same inequality holds for , or . Now we take expectation with respect to :
(41) 
Divide the support of into two parts, and , where contains those whose density is relatively far away from , and contains those whose density is close to . We show for , the above exponential term converges to 0 and , while the rest has very small measure. Let . By Lemma 3 we have:
(42) 
where denotes the big , and . Applying uniform bound we have:
(43) 
Now let . For , it can be verified that , or , and . For the exponential term in Equ.(40) we have:
(44) 
For , by the regularity assumption, we have . Combining the two cases into Equ.(41) we have for upper bound:
(45)  
(46)  
(47)  