Unsupervised Learning of Mixture Models with a Uniform Background Component
Abstract
Gaussian Mixture Models are one of the most studied and mature models in unsupervised learning. However, outliers are often present in the data and could influence the cluster estimation. In this paper, we study a new model that assumes that data comes from a mixture of a number of Gaussians as well as a uniform “background” component assumed to contain outliers and other noninteresting observations. We develop a novel method based on robust loss minimization that performs well in clustering such GMM with a uniform background. We give theoretical guarantees for our clustering algorithm to obtain best clustering results with high probability. Besides, we show that the result of our algorithm does not depend on initialization or local optima, and the parameter tuning is an easy task. By numeric simulations, we demonstrate that our algorithm enjoys high accuracy and achieves the best clustering results given a large enough sample size. Finally, experimental comparisons with typical clustering methods on real datasets witness the potential of our algorithm in real applications.
Department of Statistics
Florida State University
Tallahassee, FL 323064330, USA Adrian Barbu abarbu@stat.fsu.edu
Department of Statistics
Florida State University
Tallahassee, FL 323064330, USA
Editor:
Keywords: Gaussian Mixture Models, Clustering, Outliers, Loss Minimization
1 Introduction
Over several past decades, mixture models have become the center of many clustering problems. Among various mixture models, Gaussian Mixture Models (GMM) are the most wellknown and studied. As a fundamental model in describing numerous natural and artificial phenomena, GMMs are being studied with different types of methods over the past few decades. In these applications, the data samples are always assumed to originate from various sources where each source can approximately fit a Gaussian model.
Research of GMM has advanced swiftly and vigorously with the advent of the information era. In 1977, EM is formalized by Dempster et al. (1977), marking the beginning of modern clustering algorithms regarding GMM. In 2000, Dasgupta and Schulman (2000) built a framework for twostep EM which has theoretical convergence guarantees. Since then, multiple algorithms have been proposed to make progress on the theoretical bounds and loosen the separation condition. Vempala and Wang (2004) showed improved theoretical results using their spectral projection methods. Feldman et al. (2006) proposed PAC learning of GMM that makes no assumptions about the separation between the means of the Gaussians. Later,Kannan et al. (2008) found another spectral method that can be applicable not only to GMM but also to a mixture of logconcave distributions. Kalai et al. (2010) proposed a polynomialtime algorithm for the case of two Gaussians with provably minimal assumptions on the Gaussians and polynomial data requirements.
Previous algorithms are based on GMM or other distribution family models, and are known as distributions models. Aside from them, some other clustering methods do not require specific distribution assumptions for the data. They actually measure similarity in different ways and perform clustering based on that measure. However, there is no universally accepted definition of the term ”Clustering”. From different points of view, different clustering algorithms can be divided into different categories. Kmeans clustering (Hartigan and Wong (1979); Lloyd (1982); Kanungo et al. (2002)) and its variations are probably one kind of the most popular and widelyused clustering algorithms. Hierarchical Clustering (Johnson (1967); Day and Edelsbrunner (1984)) builds a hierarchy of clusters with different distance metrics. They are typical distancebased clustering algorithms.
DBSCAN (Ester et al. (1996)) is a representative of density models. Given a set of points, it clusters the points that have many nearby neighbors. It also marks the points that are not reachable from any other point as outliers. Based on its properties, DBSCAN can obtain clusters with arbitrary shapes. Major variants for DBSCAN are lDBSCAN(Viswanath and Pinkesh (2006)), STDBSCAN (Birant and Kut (2007)), CDBSCAN(Ruiz et al. (2007)) and PDBSCAN (Kisilevich et al. (2010)).
Spectral Clustering (Shi and Malik (2000); Ng et al. (2002)) uses the eigenvectors of a similarity matrix for dimension reduction of the data before clustering. Though these methods may also be applied on GMM and other mixture models, as shown in our paper, their performance may be no better than the clustering algorithms that specialize in clustering on certain data distributions.
The study of the convergence of most GMM clustering algorithms is always related to the initial value of the GMM parameters. Many methods including EM get stuck in local optima when the initialization is not close enough to the true means. This is why a good initialization is of great significance for many clustering algorithms. There are many more recent methods that try to overcome this drawback and provide good initialization methods. Kmeans (Arthur and Vassilvitskii (2007)) chooses the initial centers in a fast and simple way and achieves certain theoretical guarantees that kmeans cannot. In Karami and Johansson (2014) is presented a hybrid clustering method based on DBSCAN that automatically specifies appropriate parameter values.
In this paper we are interested in GMM corrupted by outliers. In this direction, some recent papers have focused on GMM with a small proportion of noise or outliers. For example, in Melchior and Goulding (2016) is presented an EM version that can deal with noisy and incomplete GMM data samples.
However, in many real clustering problems such as object detection, objects from desired categories are always a minority while a majority of the images are highly variable and cannot be clustered in any particular way. On the other hand, when designing algorithms for GMM, prior knowledge or a reasonable estimate of the number of clusters is of great significance. However, in real image problems the total number of object clusters is very large, on the order of thousands and actually we are interested in only a few of these clusters. This issue can be addressed by semisupervised learning since assigning a label for a single example from a cluster makes it clear that the cluster of importance to us.
All of these aspects motivate us to introduce our model — Gaussian Mixture Models with a Uniform Background Component.
1.1 Our Contributions
In this paper, our Gaussian Mixture Model with Uniform Background (GMMUB) is composed of a Gaussian Mixture Model (GMM) (which we call positives) together with another mixture component which is uniform in a large domain (called negatives). Usually, the negatives dominate the data with a large mixture proportion, as illustrated in Figure 1.
Convergence  Computation  Theoretical  Assumptions and  
Algorithm  Rate  Time  Guarantee  Conditions 
Kmeans with cluster shifting(Pakhira (2014))      
EM for GMMUB (Melchior and Goulding (2016))    GMMUB  
Batch Kmeans (Bottou and Bengio (1995))      
Kmeans(Arthur and Vassilvitskii (2007))      
Hierarchical Clustering (Carlsson and MÃŠmoli (2010))    Finite Metric Space  
Spectral Clustering (VON LUXBURG et al. (2008))  General Assumptions  
DBSCAN (Sriperumbudur and Steinwart (2012))  Holder Continuous Assumption  
Stochastic Kmeans (Tang and Monteleoni (2016))  Geometric Assumptions  
EM (Balakrishnan et al. (2017))  Initialization close enough to MLE  
EM for GMM (Balakrishnan et al. (2017))  GMM with initializations close enough to MLE  
CRLM (ours)  GMM+uniform, separation and coverage 
In Table 1 is shown a comparison of various clustering methods as well as some of their variations. For each method is shown the computation time, whether it has theoretical guarantees of convergence to the true parameters, the convergence rate to the true parameters, and the assumptions made by the algorithm about the data. Here, is the number of iteration steps. Our algorithms is called CRLM, and we will see that it enjoys a fast convergence rate and an acceptable computational complexity.
The Kmeans method enjoys certain theoretical guarantees, since it finds an optimum of the potential function which is bounded by a factor of from the local optimum. However, the actual rate of convergence of the estimated parameters to the true model parameters is not clear. As for Hierarchical Clustering (Carlsson and MÃŠmoli (2010)), the stability and convergence of Hierarchical Clustering are established by measuring the GromovHausdorff distance. Still, the actual rate of convergence remains unclear. Hence, batch Kmeans, Hierarchical Clustering and Kmeans are labeled as clustering methods with theoretical guarantees but without a convergence rate.
Our model is somehow similar to Melchior and Goulding (2016). They modified the EM algorithm to be applicable to GMM with missing data or uniform backgrounds. Our algorithm is different from EM, it does not depend on initialization and it has a strong theoretical guarantee under certain conditions.
We introduce a novel clustering method that finds the positive clusters as local minima of a robust loss function, this way extracting them out of the uniform background. This robust loss function has value zero outside a certain distance from the center of a candidate cluster. In this respect the robust loss function is similar to the negative of a kernel density function, where the kernel is a truncated quadratic. Based on this property, even when the majority of data is from the uniform background, our algorithm is still able to correctly cluster all the positives with high probability under certain assumptions of separation and concentration. Another feature of the algorithm is that it does not rely on a wellchosen initialization. Besides, the process of loss function minimization in our algorithm is quite simple and computationally efficient and avoids the problem of being trapped in local optima unlike gradient descent or EM based methods.
We conduct experiments on simulated data and real data. The simulation results indicate that when the assumptions are met, our algorithm performs better than other clustering methods such as Kmeans, Spectral Clustering, Hierarchical Clustering, etc. Furthermore, experiments on real data indicate that our algorithm remains applicable and powerful on real data applications when most of the assumptions are met.
2 Formulation and Algorithm
The problem we are addressing is to cluster a set of unlabeled training examples coming from a mixture of isotropic Gaussians with mixture weights plus a “negative” mixture component containing uniform samples from inside a large ball with radius . An example for is shown in Figure 1.
2.1 Robust Loss Functions
We will use the following robust loss function
where we fix . Observe that the loss function is zero outside a ball of radius . A graph of the loss for different values of is given in Figure 2, left.
2.2 Finding one Cluster By Loss Minimization and One Step Mean Shift
The goal is to find the cluster parameters by minimizing the cost function:
(1) 
For that, the cost function is computed with center at each training example and . The pair of minimum loss is then used as the initialization for one step of the mean shift algorithm. The algorithm is described in detail in Algorithm 1.
2.3 Finding Multiple Clusters
To find multiple clusters, the one cluster finding algorithm is called repeatedly, after each call eliminating the detected cluster points.
The first cluster by CRLM is regarded as the cluster with the largest clusterability in terms of minimization of robust loss function. It is similar to the cluster with minimal distances within the points of the cluster. Unlike some other methods that update the means of every cluster at the same time, CRLM finds the means of different clusters in different iterations. Another notable feature for CRLM is that it leaves all the points that are somehow noisy to the background cluster. That is a key point why it can cluster GMMUB model with high probability.
3 Main Results
First, we will set up the notation used in this paper and the main assumptions used in the derivation of our main theorems.
3.1 Notations
In the rest of the paper we will use the following terms:

 the number of observations

 the number of positive clusters

 the dimension of the observations, .

 a bound for the norm of the observations to be clustered in

 the true mixture weight of positive cluster

 the true mean and standard deviation of positive cluster

 the estimated mean and standard deviation of positive cluster .

 the points contained in the positive cluster

 the indices for the positive cluster

 a large initial standard deviation for clustering

 a constant in the loss function, in this paper we use

3.2 Assumptions
The following Separation and Concentration Conditions will be used in the proof of our main theorem. These conditions are illustrated in Figure 3. We will later show that these conditions happen with high probability.
C1: Separation Condition Between Positives and Negatives: There are no negative points at a distance less than from any positive point.
C2: Concentration Condition for Positives: For any positive cluster with true mean and covariance matrix we have
To get an overall probability guarantee for C1 and C2, we have the Proposition 1 below, based on the following assumptions:
A1: Large assumption
A2: Separation Assumption Between Positive Clusters
A3: Lower Bound Assumption for
Proposition 1
Given observations from a GMMUB of isotropic Gaussians with mixture weights , true means and variances respectively, and uniform distribution within a ball of radius , with weight . If A1 is satisfied, then C1 and C2 hold with probability at least
Proof Based on Lemma 15 in the Appendix, C2 holds with probability at least . This is mainly because for large the norm for is mostly concentrated around , as illustrated in Figure 4.
If A1 is satisfied then let be any negative point. Based on Lemma 14 ,
Then, for any positive point , and for any negative point , we have:
Therefore, C1 and C2 hold with probability at least if A1 holds.
From C2, two other important results that will be useful for the proof of the main theorem have been derived in Lemma 16 in the Appendix.
3.3 Theoretical Guarantees
We start by giving theoretical guarantees for OCRLM, assuming there is only one Gaussian cluster.
Proposition 2
Let be observations sampled from a mixture of a Gaussian with weight and a uniform distribution inside the ball of radius centered at . If for a given , C1 and C2 are satisfied and
then with probability at least OCRLM will cluster all the observations correctly, where
(2) 
The proof of this proposition is given in the Appendix. This proposition assumes that conditions C1 and C2 are satisfied but the following theorem replaces these conditions with assumptions A1 and A3.
Theorem 3
Let be observations sampled from a mixture of a Gaussian with weight and a uniform distribution inside the ball of radius centered at . If A1 and A3 are satisfied and for a given
then OCRLM will cluster all observations correctly with probability at least
where has been defined in Eq. (2).
Proof Based on Prop 2, when C1 and C2 are satisfied, OCRLM clusters all observations correctly with probability at least .
Using Prop 1, it is clear that the probability that both C1 and C2 hold is at least when A1 and A3 are satisfied.
Hence, OCRLM correctly classifies all observations with probability at least
When there is only one positive cluster, OCRLM will be employed 1 time to find all the positive points.
And when the dimension of the data and the number of observations are large enough, the probability in Theorem 3 can converge to 1.
To generalize Theorem 3 to positive clusters, we need The statement 1 in Lemma 16 to obtain separation condition between pairwise positive clusters.
Similar to proof of Lemma 2, for OCRLM, after each iteration which applies OCRLM to find a positive cluster, we have the following Lemma 4. Based on that, a theorem for finding multiple positive clusters is as followed:
Lemma 4
Let be observations sampled from a mixture of isotropic GMM with means , covariance matrix , weight and uniform distribution within radius , with weight . . If ,
the results after implementing OCRLM on each iteration of loop in CRLM rightly finds out exactly all the points of one certain positive cluster with probability at least
where
(3) 
Proof Based on Proposition 1, C1C2 hold at the probability at least .
Then, similar to proof of Prop 2 in appendix, we have:
Based on Proposition 2, on any iteration of loop, since is a true positive point of certain positive cluster satisfying with probability at least .
Based on Lemma 16, , . If is a true positive point satisfying , based on 16, rightly finds out exactly all the points of one positive cluster.
Theorem 5
Let be observations sampled from a mixture of isotropic GMM with means , covariance matrix , weight and uniform distribution within radius , with weight . . If ,
CRLM rightly cluster all the positives with probability at least
where
where has been defined in Eq. (3)
Proof
We already show that C1 and C2 hold with probability at least
According to Lemma 4, if C1 and C2 hold, for each iteration in loop of CRLM, the results will rightly find out all the points of one positive cluster with probability at least , and all the points found will be removed before the next iteration.
Hence, after kth iteration, CRLM will finds all the points of k positive clusters by implementing OCRLM k times and removing points founded k times. It will rightly cluster all the positive points with probability at least
3.4 Convergence Rate Analysis
To get an probability bound for convergence rate, we need Lemma 12 in appendix.
Based on the result , we have the following Corollary 6 for convergence rate of CRLM.
Corollary 6
Let be observations sampled from a mixture of isotropic GMM with means , covariance matrix , weight and uniform distribution within radius , with weight . . If , A1A3 hold,
denote as the estimated mean of th positive cluster by CRLM, as the true number of positive points in cluster . Then:
, with probability at least
converges to with convergence rate .
Proof Based on Proposition 1, C1C2 hold with probability at least
Based on Theorem 5 and Lemma 16, when C1C2 hold, with probability at least , , where is the sample mean for th positive cluster.
It suffices to show that converges to on convergence rate with certain probability.
By Lemma 12, taking , note that , we have:
Binomial , so Hoeffding’s inequality yields the bound
With probability at least , .
Hence, if CRLM clusters right all the positives, for certain , with probability at least , converges to with convergence rate . For all , we have that: converges to with convergence rate with probability at least .
And the convergence rate of CRLM is with probability at least
3.5 Computational Complexity
For each of its iteration in CRLM, OCRLM is implemented once. Hence, the complexity of CRLM is just k times the complexity of OCRLM. For OCRLM, in the worst case, it suffices to calculate , Hence the computational complexity for CRLM is . Generally speaking, , computational complexity of CRLM is therefore
4 Experiments
In this section, we will conduct experiments to compare our method with other clustering algorithms. First of all, we conduct experiments on synthetic data and show an analysis of the effect of on the clustering results and a method to find to find the number of clusters . Finally, we perform experiments on several kinds of real image data to show the value of our algorithm in real applications.
The algorithms involved in the comparison of the experiments are: Kmeans, DBSCAN, Complete Linkage Clustering (CL), EM, Spectral Clustering (SC) and CRLM.
In experiments, for Kmeans, we employ Kmeans++. For Hierarchical clustering methods, we do experiments with Complete Linkage Clustering based on Euclidean distance. In terms of EM, for Table 2, EM for GMM is implemented. While for other experiments regarding EM, we derive updates with GMMUB by adding a term of likelihood of uniform backgrounds and choose the best result with different initialization.
We choose the method from Ng et al. (2002) in what regards Spectral Clustering.
4.1 Simulation Experiments
In this section we perform experiments on data coming from a GMMUB that satisfies C1C2 and A1A3.
4.1.1 Convergence Plots
In the simulation, we generate our data with the model proposed at the beginning of Section 2. In the meantime, all the assumptions in Section 3 are satisfied. To sample from a uniform distribution within a dimension closed ball, we employ a standard method proposed by Muller (1959).
For proportion parameter for each positive cluster, in experiments, we set them equally to be 0.01. We make . For the radius of the uniform ball, based on our previous theoretical probability bounds, we set it larger enough to make sure all the bounds are close enough to 1. We generate simulated data with different d and k and make comparison plots with different kinds of regular clustering methods along with our method.
To compare the convergence rate for different algorithms, we record with the criteria , where is the true mean for th positive sample and is estimated means for it with certain type of clustering algorithm. Estimated mean for positive cluster j is calculated by taking average of the data samples clustered with the same label by certain algorithm. ’Supervised’ results is generated with true labels. We take the loglog plots with as xaxis and as yaxis and obtain the following comparison of convergence plots.
Based on previous results, we see that results for CRLM can always converge to the results of supervised true labels which finally reaches convergence rate . DBSCAN, KMeans and Complete Linkage Clustering can also converge to the results in some experiments.
4.1.2 Stability of Clustering Results with Respect to
In our algorithm, two parameters to be tuned are , and , estimated number of positive clusters. In this part, we will discuss the selection of and . The and in Algorithm 2 are actually estimated values for true values.
In terms of , in practical, we just set . In fact, if data is exactly the structure of GMMUB and is sufficiently large, the selection for is very flexible. The flexibility increases with increasing value of . To measure the impact of different on clustering results, we introduce three measures of quality of a cluster algorithm using external criterion for two and more clusters : Rand Index, Fmeasure and Purity (Sokolova and Lapalme (2009)) . In these measures, Fmeasure is the most significant measurements to our purpose since it can measure the clustering accuracy much better than Rand Index with imbalanced dataset. In the following experiments, when there are a mixture of positive clusters and one negatives, confusion matrix for a clustering algorithm where the positive for each matrix will be the data points for each positive cluster.
When , we perform clustering with different selection of . We keep that when , . We obtain the experimental upper and lower bound where high FMeasure is obtained. Figure 6 is the experiment result when .
From this plot, when increases , all the measures will go up to 1 first and then go down when it is available to adequate number of data samples. For parameter tuning, it is easy to get the interval where maximum of all the measures is attained even in this case. Since in our GMMUB model, we are focusing on rightly clustering all the positives. Hence, the upper and lower experiment bound for is the interval where high Fmeasure is obtained. Experiments bounds and theoretical bounds with different dimension is shown in Figure 7 The darker gray region is the region for theoretical guarantee which is a sub region of lighter gray region for practical choice of to get a high Fmeasure.
For theoretical bounds , one lower bound is from A3 and three upper bounds are from A1, weight condition on Theorem 3 and probability guarantee in Theorem 3, labeling A3 bound, A1 bound, weight bound, Thm 3 bound respectively. Weight Bound is from contraint of and Thm 3 Bound is derived from obtaining high probability of the result of Theorem 3. Our theoretical guarantee for is the area between curves of A3 bound and Theorem 3 bound. It also shows a lower bound for . With increasing and , the area becomes larger. However, the experimental bounds shows that CRLM can have much more flexible choice of .
The figure 8 shows all the bounds with different , with fixed and . With the choice of based on Lemma 12, the theoretical region and experimental region will decrease with increases. And for Thm 3 Bound, need to be large enough to make probability close to 1 for certain .
When , the results are similar. With a trivial case, given , , the results are in Figure 9. Adding A2 bound and replacing Thm 3 Bound with Thm 5 Bound are the major differences from case of one positive cluster.
To get a theoretical upper and lower bounds, we need to roughly estimate for each positive clusters. Precise estimation is from CRLM. Hence, we simulated a GMMUB with , , , with C1C2 and A1A3 satisfied. We calculate the the distances between the samples and take certain proportion of the shortest distance. We obtain the following histogram which can estimate well in Figure 10.
4.1.3 Estimate of
For estimate of the number of positive clusters , some relevant methods include elbow methods (Thorndike (1953)), Xmeans (Pelleg et al. (2000)) and the silhouette method (Lletı et al. (2004)). For our synthetic data, we can use any of these methods to estimate the total number of positive clusters adding one negatives. However, one advantage for our algorithm is that a simple way to estimate the number can be derived naturally and directly from the algorithm. Supposed D are sufficiently large, with finetuned parameter , run Algorithm 2 with larger than true and record the number of samples that are clustered together in each iteration. Stop after the th iteration when the number become 1 and the estimated number of positives is just .
We make experiments on synthetic data with . In Figure 11, we run the procedure to get an estimated value for k with different number of observations.
It is obvious that using this method to estimate k is efficient and converge to actual value of k. Besides, dimension does not have much impact on numbers of observations needed to find a good estimate for k. The results of experiments are based on sufficient large value for and C1C2. When either of those two conditions is violated, the stop criteria in estimating k should be adjusted to a larger positive integer other than 1.
4.2 Real Data
To show potential of application for CRLM, we also do experiments on real datasets from two different sources. Those original datasets are sets of images from different classes. We employ clustering with various clustering algorithms on those pictures and measures the performance by Fmeasure and Rand Index with the external true labels.
4.2.1 Kimia 216 Dataset And 1070 Shape Database
The Kimia 216 contains 18 classes each consisting of 12 black and white binary shape image. It contains shapes silhouettes for birds, bones, brick, camels, car, children, classic cards, elephants, faces, forks, fountains, glasses, hammers, hearts, keys, rays, turtles and a miscellaneous class. Most of images in Kimia 216 datasets are in 1070 Shape Database. Figure 12 is the images in Kimia 216 dataset.
We did simple preprocessing for original data. We first resize the data with 256256 pixels on each image and vectorize it. After that, we perform PCA and implement different clustering methods on it. Generally speaking, the Kimia 216 datasets can be fitted into a GMM model with 18 Gaussian clusters. Table 2 is the clustering results measured in accuracy .
Methods  Kmeans  DBSCAN  Complete Link  EM  Our (CLRM) 
Average Rand Index (%)  67.99  69.91  60.19  18.06  71.30 
For measure results, our methods ranks first. On the other hand, one could think that all the clustering results from listed four clustering methods are far from being accepted. Though GMM can be a good model for Kimia 216 datasets, the differences among certain groups can be not so distinct. Besides, the similarities between observations from each group may not be such easy to measure in simple distance or density. Figure 13 is the observations being clustered out by Algorithm 2 after each iteration. In terms of similarity defined by Algorithm 2, the samples that are clustered in first 10 iterations can be regarded as top 10 clusters that are separated from other clusters and inner distances are smaller.
For 1070 shape database, we firstly conduct similar clustering analysis with different methods. However, the results are not as expected. 1070 Shape Database contains 1070 shape images with 66 classes while the original class labels are not assigned properly and clearly. On the one hand, the numbers of pics for different classes are not balanced. The class with max number of pictures has over 50 pictures while each minor class only has one picture. On the other hand, some classes are too similar to each other that they are prone to be clustered as one unique cluster. There are several different clusters for some types of airplane and bunny which can be easily clustered as two clusters. .
Due to the drawbacks of the labels for 1070 Shape Database, we reassign the labels and make it similar to our theoretical data structure, GMM with uniform backgrounds. We first do clustering with original labels to obtain clustering labels and measurements for clustering. Then we pick top 6 clusters that are clustered with high accuracy and label them as 6 positive clusters. Figure 14 shows the six clusters selected as positive clusters. For the rest of data samples, they are labeled as negative uniform backgrounds. It is obvious that, with reassignment of ground true labels of 1070 Shape Data, the clustering accuracy for various clustering algorithms can be greatly improved. We take 10 samples from each positive cluster and did two experiments with different numbers of negatives adding into the experiment. In the first experiment, we take 60 negative samples, each one for 60 original clusters. In the second experiment, we take 60 more different negative samples from 60 original clusters excluding the 6 positive clusters. The comparison results are in Table 3.
Kmeans  DBSCAN  CL  EM  Our (CRLM)  

60 positives, 60 negative samples  
Rand Index (%)  51.25  89.17  45.00  22.50  99.17 
Fmeasure (%)  55.06  68.37  41.10  18.96  99.17 
60 positives, 120 negative samples  
Rand Index (%)  65.86  66.11  55  46.67  67.78 
Fmeasure (%)  47.77  73.86  37.43  21.39  54.95 
When there are only 60 negatives from 60 different classes, our lossbased approach (CLRM) outperforms the other methods evaluated in terms of accuracy and Fmeasure. The clustering results indicate that our method obtains high clustering accuracy when the data follows the assumptions used for obtaining the theoretical guarantees. As more negatives are added to the data, it is probable that A1A3 are violated or the negatives are not uniform distributed any more. Consequently, the method has a decreased accuracy. In this case, although our method still has the highest clustering accuracy, it is outperformed by DBSCAN in terms of Fmeasure.
4.2.2 IMDBWIKI Face Dataset and ImageNet Dataset
IMDBWIKI Face Datasets (Rothe et al. (2015)) contains over 500k face images. ImageNet (Russakovsky et al. (2015)) is an image dataset organized according to the WordNet hierarchy. It contains a huge classes for images. To preprocess all the images from IMDBWIKI and ImageNet data sets, we load pretrained CNN, VGGverydeep16 (Simonyan and Zisserman (2014)) to substract 4096 features for each image.
The training data is constructed in the following way. For positives, we take a subset of IMDBWiki and labeled them as one positive clusters. On the other hand, negatives are picked randomly from a subset of ILSVRC 2012 with 1000 classes and 6001200 images in each class. We do clustering analysis to cluster the positives out of the negatives and the results are presented in table 4. For DBSCAN, the classes of output labels can be larger than 2. Here, we set the class with largest number of observations with label 1 and the rest of the observations are labeled as another class. There are two ways of mapping from the output labels to the true labels. The Rand Index and Fmeasure are calculated by taking the maximum of those from these two ways.
Kmeans  DBSCAN  CL  EM  SC  Our (CRLM)  
,  
Rand Index (%)  78.20  89.40  62.40  57.20  55.31  99.60 
FMeasure(%)  85.79  90.27  76.15  62.23  59.60  99.66 
,  
Rand Index (%)  66.20  94.90  67.90  55.40  54.40  99.80 
FMeasure(%)  61.05  91.86  46.92  43.02  42.15  99.67 
,  
Rand Index (%)  71.25  95.85  83.95  53.15  54.55  99.70 
FMeasure(%)  31.43  86.39  26.33  26.88  26.38  98.99 
,  
Rand Index (%)  75.40  96.72  93.48  52.86  51.84  99.76 
FMeasure(%)  13.77  76.91  11.38  12.01  12.29  98.02 
,  
Rand Index (%)  74.88  97.09  96.68  52.44  51.46  99.86 
FMeasure(%)  7.43  53.21  5.84  6.27  6.66  97.70 
,  
Accuracy (%)  75.76  98.90  98.74  51.35  50.45  99.88 
FMeasure(%)  2.57  54.44  1.99  2.19  2.34  94.16 
The results shows a best performance on accuracy with our methods for the first glance. However, note that when the number of negatives increase, high Rand Index can be achieved by just clustering all the positives and negatives as one single cluster. Hence, comparison of FMeasure and Rand Index together can provide a relatively reliable measures of clustering algorithms. CRLM outperforms other methods with high FMeasure and Accuracy close to 1. The construction of training dataset actually satisfies condition of GMMUB. Besides, when much more negative data is inputted into training dataset, although the clustering accuracy doesn’t decrease, the decreasing FMeasure shows that the clustering performance for CRLM is poorer. Among other clustering algorithms, DBSCAN outperforms the others with acceptable accuracy and Fmeasure when n is small. EM and Spectral Clustering share the similar result that cluster the dataset half the positive and half the negatives.
5 Conclusion and Future Work
In this paper, we propose a novel method (CRLM) based on robust loss minimization for clustering Gaussian Mixtures together with an extra mixture component that is a uniform distribution. The basic assumptions for our algorithm are: 1. Isotropic Gaussians for the foreground (positive) clusters. 2. Large radius for the background samples. 3. Sufficient separation between any two positive clusters. Unlike some other clustering methods, our algorithm enjoys strong theoretical guarantees that it finds the correct clusters with high probability, and does not depend on an initialization. Moreover, it can work with a predefined number of clusters or it can estimate the number of clusters.
In synthetic data experiments, we generate certain data of GMM with a Uniform Background where positives are minor with all the assumptions. The clustering results shows robustness convergence results of CRLM since it can always find all the positives as long as there are enough positives samples. Also, we do analysis on robustness of CRLM with regards to and estimation of . For real data analysis, we do experiment based on original dataset and on a subset constructed to make it have similar structure of GMM with a uniform background. All the results of real data analysis witness that CRLM outperforms other classic, regular clustering methods.
However, there are still some drawbacks of CRLM which lead to potential future work to improve it. On the one hand, efficiency of CRLM is founded on its assumption which is a bit difficult to be satisfied. On the other hand, real data analysis results of implementing CRLM and other clustering methods on large scale image dataset are far from being satisfactory. Hence, our future work comes from two aspects. Firstly, we can modify and improve our algorithm to make it compatible with much more general cases. Secondly, some labeled image data can be added into training data and we can do some semisupervised learning work to enhance our experiment results on real image data.
Acknowledgment
The work is supported by DARPA ARO W911NF1610579.
Appendix
The sketch of proof for Proposition 2 is based on distance. Since we can obtain a center of clusters and a distance as results from algorithm 1. On the one hand, we need to show that with high probability, this circle will not include any negative points, which can be derived directly from separation condition. On the other hand, with high probability, the circle centering from outcome of algorithm one will cover all the positive points.
There are several technical lemmas and propositions that will be useful for the proofs.
Proposition 7
If is a uniform sample inside the ball of radius centered at , then the pdf of is
Therefore .
Proof We have the CDF
By taking the derivative of the CDF, we obtain the pdf.
Corollary 8
If is a uniform sample inside the ball of radius centered at , then the pdf of the random variable is
(4) 
and the expected value of L is .
Since, from proposition 1,
Proposition 9
For an isotropic Gaussian random variable the pdf of is i.e.
Thus .
Proof We have the CDF
where is the area of the unit ball in . By taking the derivative of the CDF, we obtain the pdf.
Corollary 10
For an isotropic Gaussian random variable the pdf of is
(5) 
and .
Proof Using and Proposition 9 we get that:
(6) 
so
(7) 
Then .
Corollary 11
Suppose , then the random variable has .
Lemma 12
If and then
Proof The pdf of follows and therefore it follows .
From Lemma 2 in Inglot and Ledwina (2006), when , we have:
Lemma 13
Let . Then if we have
Proof Taking in Lemma 12, we have: