Active Clustering with ModelBased Uncertainty Reduction
Abstract
Semisupervised clustering seeks to augment traditional clustering methods by incorporating side information provided via human expertise in order to increase the semantic meaningfulness of the resulting clusters. However, most current methods are passive in the sense that the side information is provided beforehand and selected randomly. This may require a large number of constraints, some of which could be redundant, unnecessary, or even detrimental to the clustering results. Thus in order to scale such semisupervised algorithms to larger problems it is desirable to pursue an active clustering method—i.e. an algorithm that maximizes the effectiveness of the available human labor by only requesting human input where it will have the greatest impact. Here, we propose a novel online framework for active semisupervised spectral clustering that selects pairwise constraints as clustering proceeds, based on the principle of uncertainty reduction. Using a firstorder Taylor expansion, we decompose the expected uncertainty reduction problem into a gradient and a stepscale, computed via an application of matrix perturbation theory and clusterassignment entropy, respectively. The resulting model is used to estimate the uncertainty reduction potential of each sample in the dataset. We then present the human user with pairwise queries with respect to only the best candidate sample. We evaluate our method using three different image datasets (faces, leaves and dogs), a set of common UCI machine learning datasets and a gene dataset. The results validate our decomposition formulation and show that our method is consistently superior to existing stateoftheart techniques, as well as being robust to noise and to unknown numbers of clusters.
IEEEexample:BSTcontrol
active clustering, semisupervised clustering, image clustering, uncertainty reduction
1 Introduction
Semisupervised clustering plays a crucial role in machine learning and computer vision for its ability to enforce topdown structure while clustering [1, 2, 3, 4, 5, 6]. In these methods, the user is allowed to provide external semantic knowledge—generally in the form of constraints on individual pairs of elements in the data—as side information to the clustering process. These efforts have shown that, when the constraints are selected well[7], incorporating pairwise constraints can significantly improve the clustering results.
In computer vision, there are a variety of domains in which semisupervised clustering has the potential to be a powerful tool, including, for example, facial recognition and plant categorization[11]. First, in surveillance videos, there is significant demand for automated grouping of faces and actions: for instance, recognizing that the same person appears at two different times or in two different places, or that someone performs a particular action in a particular location [12]. These tasks may be problematic for traditional supervised recognition strategies due to difficulty in obtaining training data—expecting humans to label a large set of strangers’ faces or categorize every possible action that might occur in a video is not realistic. However, a human probably can reliably determine whether two face images are of the same person [11] or two recorded actions are similar, making it quite feasible to obtain pairwise constraints in these contexts.
The problem of plant identification is similar in that even untrained nonexpert humans [13] (for instance, on a lowcost crowdsourcing tool such as Amazon’s Mechanical Turk[14]) can probably generally determine if two plants are the same species, even if only an expert could actually provide a semantic label for each of those images. Thus, nonexpert labor, in conjunction with semisupervised clustering, can reduce a large set of uncategorized images into a small set of clusters, which can then be quickly labeled by an expert. The same pattern holds true in a variety of other visual domains, such as identifying animals or specific classes of manmade objects, as well as nonvisual tasks such as document clustering [15].
However, even when using relatively inexpensive human labor, any attempt to apply semisupervised clustering methods to largescale problems must still consider the cost of obtaining large numbers of pairwise constraints. As the number of possible constraints is quadratically related to the number of data elements, the number of possible user queries rapidly approaches a point where only a very small proportion of all constraints can feasibly be queried. Simply querying random constraint pairs from this space will likely generate a large amount of redundant information, and lead to very slow (and expensive) improvement in the clustering results. Worse, Davidson et al. [7] demonstrated that poorly chosen constraints can in some circumstances lead to worse performance than no constraints at all.
To overcome these problems, our community has begun exploring active constraint selection methods [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], which allow semisupervised clustering algorithms to intelligently select constraints based on the structure of the data and/or intermediate clustering results. These active clustering methods can be divided into two categories: samplebased and samplepairbased.
The samplebased methods first select samples of interest, then query pairwise constraints based on the selected sample [16, 18, 19]. Basu et al. [16] propose offline (i.e., not based on intermediate clustering results) active kmeans clustering based on a twostage process that first explores the problem space and performs user queries to initialize and grow sets of samples with known cluster assignments, and then extracts a large constraint set from the known sample sets and does semisupervised clustering. Mallapragada et al. [18] present another active kmeans method based on a minmax criterion, which also utilizes an initial “exploration” phase to determine the basic cluster structure. We have also previously proposed two different samplebased active clustering methods [21, 19]. This paper represents an improvement and extension of these works.
By contrast, the samplepairbased methods [22, 23, 24, 25, 26] directly seek pair constraints to query. Hoi et al. [23] provide a minmax framework to identify the most informative pairs for nonparametric kernel learning and provide encouraging results. However, the complexity of that method (which requires the solution of an approximate semidefinite programming (SDP) problem) is high, limiting both the size of the data and the number of constraints that can be processed. Xu et al.[22] and Wang and Davidson[24] both propose active spectral clustering methods, but both of them are designed for twoclass problems, and poorly suited to the multiclass case. Most recently, Biswas and Jacobs [11] propose a method that seeks pair constraints that maximize the expected change in the clustering result. This proves to be a meaningful and useful criterion, but the proposed method requires recomputing potential clustering results many times for each samplepair selected, and is thus slow.
Both types of current approaches suffer from drawbacks: most current samplebased methods are offline algorithms that select all of their constraints in a single selection phase before clustering, and thus cannot incorporate information from actual clustering results into their decisions. Most pairbased methods are online, but have very high computational complexity due to the nature of the pair selection problem (i.e. the need to rank candidate pairs at every iteration), and thus have severely limited scalability.
In this paper, we overcome the limitations of existing methods and propose a novel samplebased active spectral clustering framework using certainsample sets that performs efficient and effective samplebased constraint selection in an online iterative manner (certainsample sets are sets containing samples with known pairwise relationships to all other items in the certainsample sets). In each iteration of the algorithm, we find the sample that will yield the greatest predicted reduction in clustering uncertainty, and generate pairwise queries based on that sample to pass to the human user and update the certainsample sets for clustering in the next iteration. Usefully, under our framework the number of clusters need not be known at the outset of clustering, but can instead be discovered naturally via human interaction as clustering proceeds (more details in Section 3).
In our framework, we refer to the sample that will yield the greatest expected uncertainty reduction as the most informative sample, and our active clustering algorithm revolves around identifying and querying this sample in each iteration. In order to estimate the uncertainty reduction for each sample, we propose a novel approximated firstorder model which decomposes expected uncertainty reduction into two components: a gradient and a stepscale factor. To estimate the gradient, we adopt matrix perturbation theory to approximate the firstorder derivative of the eigenvectors of the current similarity matrix with respect to the current sample. For the stepscale factor we use one of two entropybased models of the current cluster assignment ambiguity of the sample. We describe our framework and uncertainty reduction formulation fully in Section 3.
We compare our method with baseline and stateofthe art active clustering techniques on three image datasets (face images [9], leaf images [8] and dog images [10]), a set of common UCI machine learning datasets [28] and a gene dataset [29]. Sample images from each set can be seen in Figure 1. Our results (see Section 7) show that given the same number of pairs queried, our method performs significantly better than existing stateoftheart techniques.
2 Background and Related Work
What is clustering uncertainty? Clustering methods are ultimately built on the relationships between pairs of samples. Thus, for any clustering method, if our data perfectly reflects the “true” relationship between each samplepair, then the method should always achieve the same perfect result. In practice, however, data (and distance/similarity metrics) are imperfect and noisy—the relationship between some pairs of samples may be clear, but for others it is highly ambiguous. Moreover, some samples may have predominantly clear relationships to other samples in the data, while others may have predominantly ambiguous relationships. Since our goal in clustering is to make a decision about the assignment of samples to a cluster, despite the inevitable ambiguity, we can view the overall samplerelationship ambiguity in the data as the uncertainty of our clustering result.
We then posit that the advantage of semisupervised clustering is that it eliminates some amount of uncertainty, by removing all ambiguity from pair relationships on which we have a constraint. It thus follows that the goal of active clustering should be to choose constraints that maximally reduce the total sampleassignment uncertainty. In order to achieve this, however, we must somehow measure (or at least estimate) the uncertainty contribution of each sample/samplepair in order to choose the one that we expect to yield the greatest reduction. In this paper, we propose a novel firstorder model with matrix perturbation theory and the concept of local entropy to the contribution of selected sample, more details in Section 3.2.
Why samplebased uncertainty reduction? There are two main reasons for proposing a samplebased approach rather than a samplepairbased one. First, an uncertain pair may be uncertain either because it contains one uncertain sample or because it contains two uncertain samples. In the latter case, because the constraint between these samples will not extrapolate well beyond them, it yields limited information. Second, due to the presence of pair constraints for every samples, pair selection has an inherently higher complexity, which limits the scalability of a pairbased approach.
Relation to active learning. Active query selection has previously seen extensive use in the field of active learning [30, 31]. Huang et al. [32] and Jain and Kapoor [33], for example, both offer methods similar to ours in that they select and query uncertain samples. However, in active learning algorithms the oracle (the human) needs to know the class label of the queried data point. This approach is not applicable to many semisupervised clustering problems, where the oracle can only give reliable feedback about the relationship between pairs of samples (such as the many examples we offered in the Section 1). Though we implicitly label queried samples by comparing them to a set of exemplar samples representing each cluster, we do so strictly via pairwise queries.
Additionally, for the sake of comparison we begin our experiments with an exploration phase that identifies at least one member of each cluster (thus allowing us to treat the clusters we are learning as “classes” as far as the active learning algorithms are concerned), but in real data this may not be a reliable option. There may simply be too many clusters to fully explore them initially, new clusters may appear as additional data is acquired, or certain clusters may be rare and thus not be encountered for some time. In all of these cases, our active clustering framework can adapt by simply increasing the number of clusters. In contrast, most active learning methods must be initialized with at least one sample of each class in the data, and do not allow online modification of the class structure.
3 Active Clustering Framework With CertainSample Sets
Recall that “certainsample sets” are sets such that any two samples in the same certainsample set are constrained to reside in the same cluster, and any two samples from different certainsample sets are guaranteed to be from different clusters. In the groundtruth used in our experiments, each class corresponds to a specific certainsample set. In our framework, we use the concept of certainsample sets to translate a sample selection into a set of pairwise constraint queries.
Given the data set , denote the corresponding pairwise similarity matrix (i.e. the nonnegative symmetric matrix consisting of all , where is the similarity between samples and ). Similarity is computed in some appropriate, problemspecific manner.
Here, we also denote the set of certainsample sets , where is a certainsample set such that and for all , and define an sample set containing all current certain sample. Our semantic constraint information is contained in the set , which consists of all the available pairwise contraints. Each of these constraints may be either “mustlink” (indicating that two samples belong in the same semantic grouping/certainsample set) or “cannotlink” (indicating that they do not). To initialize the algorithm, we randomly select a single sample such that with , and . As , and change over time, we use the notation to indicate each of these and other values at the iteration.
Assuming we begin with no pairwise constraints, if the number of clusters in the problem is not known, set the initial cluster number , otherwise set it to the given number. We then propose the following algorithm (outlined in Figure 2, more details for each step can be found in Sections 3.1–3.3):

Initialization: randomly choose a single sample , assign to the first certain set and initialize the pairwise constraint set as the empty set.

Constrained Spectral Clustering: cluster all sample into groups using the raw data plus the current pairwise constraint set .

Informative Sample Selection: choose the most informative sample based on our uncertainty reduction model.

Pairwise Queries: present a series of pairwise queries on the chosen sample to the oracle until we have enough information to assign the sample to a certainsample set (or create a new certain set for the chosen sample).

Repeat: steps 24 until the oracle is satisfied with the clustering result or the query budget is reached.
It should be noted that, aside from the ability to collect maximally useful constraint information from the human, this algorithm has one other significant advantage: the number of clusters in the problem need not be known at the outset of clustering, but can instead be discovered naturally via human interaction as the algorithm proceeds. Whenever the queried pairwise constraints result in the creation of a new certainsample set, we increment to account for it. This allows the algorithm to naturally overcome a problem faced not just by other active clustering (and active learning) methods, but by clustering methods in general, which typically require a parameter controlling either the size or number of clusters to generate. This is particularly useful in the image clustering domain, where the true number of output clusters (e.g. the number of unique faces in a dataset) is unlikely to be initially available in any realworld application. We have conducted experiments to evaluate this method of model selection; the results, which are encouraging, are presented in Section 7.5.
Recalling the steps of our framework, from here we proceed iteratively through the three main computational steps: clustering with pairwise constraints, informative sample selection and querying pairwise constraints. We now describe them.
3.1 Spectral clustering with pairwise constraints
Spectral clustering is a wellknown unsupervised clustering method[34]. Given the symmetric similarity matrix , denote the Laplacian matrix as , where is the degree matrix such that , where if and 0 otherwise. Spectral clustering partitions the samples into groups by performing kmeans on the first eigenvectors of . The eigenvectors can be found via:
(1) 
To incorporate pairwise constraints into spectral clustering, we adopt a simple and effective method called spectral learning [35]. Whenever we obtain new pairwise constraints, we directly modify the current similarity matrix , producing a new matrix . Specifically, the new affinity matrix is determined via:

Set .

For each pair of mustlinked samples assign the values .

For each pair of cannotlinked samples assign the value .
We then obtain the new Laplacian matrix and proceed with the standard spectral clustering procedure.
3.2 Informative sample selection
In this section, we formulate the problem of finding the most informative sample as one of uncertainty reduction. We ultimately develop and discuss a model for this uncertainty reduction in Section 4.
Define the uncertainty of the dataset in the iteration to be conditioned on the current updated similarity matrix and the current certainsample set . Thus the uncertainty can be expressed as . Therefore our objective function for sample selection is as follows:
(2) 
To the best of our knowledge, there is no direct way of computing uncertainty on the data. In order to optimize this objective function, we consider that querying pairs to make a chosen sample “certain” can remove ambiguity in the clustering solution and thus reduce the uncertainty of the dataset as a whole. So the expected change in the clustering solution that results from making the chosen sample “certain” can be considered as the uncertainty contribution of the sample as a result of selecting and querying that sample.
Thus, we seek samples that will have the greatest impact on the clustering solution. One strategy for finding these constraints (employed in Biswas and Jacobs [11], though with samplepairs rather than samples) is to estimate the likely value of a constraint (i.e. cannot or mustlink) and simulate the effect that constraint will have on the clustering solution. However, this approach is both unrealistic (if the answers given by the oracle could be effectively predicted, the oracle would not be needed) and computationally expensive (in the worst case requiring a simulated clustering operation for each possible constraint at each iteration of the active clusterer).
Thus, we adopt a more indirect method of estimating the impact of a sample query based on matrix perturbation theory and local entropy of selected sample. We present the details of our method in Section 4.
3.3 Samplebased pairwise constraint queries
Before presenting our model for informative sample selection, we briefly describe how we use the selected sample. Because our active selection system is samplebased and our constraints pairbased, once we have selected the most informative sample we must then generate a set of pairwise queries related to that sample. Our goal with these queries is to obtain enough information to add the sample to the correct certainsample set. We generate these queries as follows.
First, for each certain set , choose the single sample within the set that is closest to the selected sample () and record this sample.
Second, since there are certain sets, we will have recorded sample and similarity values. We sort these samples based on their corresponding similarity, then, in order of descending similarity, query the oracle for the relation between the selected sample and until we find a mustlink connection. We then add into the certainsample set containing that . If all of the relations are cannotlink, we create a new certainsample set and add to it. This new certain set is then added to . Regardless, is correspondingly updated by adding . If the value of after querying is greater than , we also update to reflect the newly discovered groundtruth cluster. Since the relation between the new sample and all certain sets in is known, we can now generate new pairwise constraints between the selected sample and all samples in without submitting any further queries to the human.
4 Uncertainty Reduction Model for Informative Sample Selection
As described in Section 3.1, we use spectral learning [35] as our clustering algorithm. In spectral learning [35], the clustering result arises from the values of the first eigenvectors of current similarity matrix. Therefore, the impact of a sample query on the clustering result can be approximately measured by estimating its impact on (the first eigenvectors ):
(3) 
In order to measure , based on a firstorder Taylor expansion, we decompose the change in the eigenvectors into a gradient and a stepscale factor:
(4) 
where represents the assignmentambiguity of , and represents the reduction in this ambiguity after querying . is a firstorder derivative of the changes in the eigenvectors as a result of this ambiguity reduction. We describe how to estimate this gradient and ambiguity reduction in Sections 4.1 and 4.2, respectively.
4.1 Estimating the gradient of the uncertainty reduction
In order to solve (4) we must first evaluate . Since, we know that in spectral learning (Section 3.1) the information obtained from the oracle queries is expressed via changes in the similarity values for the queried point contained in . Given this, changes in ambiguity are always mediated by changes in , so we can approximate via
(5) 
where represents an incremental change in the similarity values of sample .
Thus, we must begin by computing for each , for which we propose a method based on matrix perturbation theory [36]. First note that the graph Laplacian at iteration can be fully reconstructed from the eigenvectors and corresponding eigenvalues via . Then, given a small constant change in a similarity value , the firstorder change of the eigenvector can be calculated as:
(6) 
Note that , where is the length indicator vector of index .
For the chosen sample we take samples , one sampled from each certain set . If we decide to query the oracle for , the relation of to each sample in will become known, and the corresponding in will be updated during spectral learning. Therefore, to estimate the influence of sample on the gradient of the eigenvectors, we can simply sum the influences of the relevant values based on Eq. 6. We thus define our approximate model for the derivative of uncertainty reduction as:
(7) 
Note that we operate only over a subset of certain samples in order to both save on computation and avoid redundancy. We could simply use the entirety of in place of , but this would likely distort the results. Intuitively, the effect of a mustlink constraint is to shift the eigenspace representations of the two constrained samples together. The samples in a certain set should thus have very similar eigenspace representations, so we expect additional constraints between them and to have diminishing returns.
4.2 Estimating the step scale factor for uncertainty reduction
The second component of our uncertainty reduction estimation is —the change in the ambiguity of the sample as a result of querying that sample. This component serves as the step scale factor for the gradient . According to the assumptions in Section 3.3, after a sample is queried the ambiguity resulting from that sample is reduced to 0. This leads to the conclusion that
(8) 
Therefore, the problem of estimating the change in ambiguity of a sample reduces to the problem of estimating the current ambiguity of that sample. While this problem still cannot be solved precisely, we present two reasonable heuristics for estimating the ambiguity of a sample. Both are based on the concept of entropy—specifically, the entropy over probability distributions of local cluster labels (an uncertainty estimation strategy that has shown good results in active learning [30]).
Nonparametric structure model for cluster probability First, consider the current clustering result , where is a cluster and is the number of clusters. We can then define a simple nonparametric model based on similarity matrix for determining the probability of belonging to cluster :
(9) 
Because only local nearest neighbors have large similarity values in relation to a given sample, we can use the nearest neighbors (NN) of each point to efficiently approximate the entropy. These neighbors need only be computed once, so this ambiguity estimation process is fast and scalable. In our experiments, we use .
Parametric model for cluster probability Alternately, we can simply use the eigenspace representation of our data produced by the most recent semisupervised spectral clustering operation to compute a probabilistic clustering solution. We elect to learn a mixture model (MM) on the embedded eigenspace of the current similarity matrix for this purpose:
(10) 
where are the mixing weights and are the component parameters. Then, the probability of each data point given each cluster is computed via:
(11) 
In our experiments, we assume a Gaussian distribution for each component, yielding a Gaussian Mixture Model (GMM).
Entropybased ambiguity model Whether using the parametric or nonparametric cluster probability model, the ambiguity of sample can be defined, based on entropy, as:
(12) 
We then use this value to approximately represent . In combination with the approximate uncertainty gradient computed as in Section 4.1, this allows us to evaluate (4) and effectively estimate the uncertainty reduction for every point . From there, solving our sample selection objective (2) is a simple operation. In Figure 3, we show a qualitative example on a small subset of dog dataset [10]. In the topleft, five groundtruth clusters are shown with their dogbreed labels. The other three panes show clustering with increasing numbers of constraints selected via our method. Notice how clustering initially (with 0 pair constraints) emphasizes dog image appearance and violates many breed boundaries, whereas with 30 and 45 constraints the clusters are increasingly correct.
5 Complexity Analysis
At each iteration, we must select a query sample from among possibilities, applying our uncertainty reduction estimation model to each potential sample. Computing the gradient component of the uncertainty model takes time for each sample, where is number of certain sets and is the number of clusters/eigenvectors. , so the complexity of the uncertainty gradient evaluation at each iteration is . Computing all the step scale factors costs (where is the number of nearest neighbors) if the nonparametric method is used, or for the parametric method. , so regardless the total complexity of the active selection process at each iteration is .
In order to reduce this cost, we adopt a slight approximation. In general, the samples with the largest uncertainty reduction will have both a large step scale and a large gradient. With this mind, we first compute the step scale for each sample (this is cheaper than computing the gradient, particularly if the nonparametric model is used), then only compute the gradient for the samples with the largest step scales. Assuming , this yields an overall complexity of . Note that all results for our method shown in this paper were obtained using this fast approximation. Also note that for large data, the cost of the method will generally be dominated by the spectral clustering itself, which is in the worst case (though potentially significantly cheaper, possibly even [37, 38], depending on the eigendecomposition method used and the sparseness of the similarity matrix).
6 Experimental Setup
6.1 Data
We evaluate our proposed active framework and selection measures on three image datasets (leaves, dogs and faces—see Figure 1), one gene dataset [29] and five UCI machine learning datasets [28]. We seek to demonstrate that our method is generally workable for different types of data/applications with a wide range of cluster numbers.
Face dataset: all face images are extracted from a face dataset called PubFig [9], which is a large, realworld face dataset consisting of 58,797 images of 200 people collected from the Internet. Unlike most other existing face datasets, these images are taken in completely uncontrolled settings with noncooperative subjects. Thus, there is large variation in pose, lighting, expression, scene, camera, and imaging conditions. We use two subsets: Face1 (500 images from 50 different peoples) and Face2 (200 images from 20 different people).
Leaf dataset: all leaf images are iPhone photographs of leaves against a monochrome background, acquired through the Leafsnap app [8]. We use the same subset (1042 images from 62 species) as in [11]. The feature representations and resulting similarity matrices for the leaf and face datasets are all from [11].
Dog dataset: all dog images are from the Stanford Dogs dataset [10], which contains 20,580 images of 120 breeds of dogs. We extract a subset containing 400 images from 20 different breeds (dog400) and compute the features used in [39]. Affinity is measured via a kernel.
Dataset  Size  Dim.  No. Classes 

Balance  625  4  3 
BUPA Liver Disorders  345  6  2 
Diabetes  768  8  2 
Sonar  208  60  2 
Wine  178  13  3 
Cho’s gene  307  100  5 
6.2 Evaluation protocols
We evaluate all cluster solutions via two commonly used cluster evaluation metrics: the Jaccard Coefficient [40] and Vmeasure [41].
The Jaccard Coefficient is defined by , where:

SS: represents the total number of pairs that are assigned to the same cluster in both the clustering results and the groundtruth.

SD: represents the total number of pairs that are assigned to the same cluster in the clustering results, but to different clusters in the groundtruth.

DS: represents the total number of pairs that are assigned to different clusters in the clustering results, but to the same cluster in the groundtruth.
VMeasure is an alternate metric for determining cluster correspondence between a set of groundtruth classes and clusters , which defines entropybased measures for the completeness and homogeneity of the clustering results, and computes the harmonic mean of the two. The homogeneity is:
(13) 
where
(14) 
(15) 
The completeness is:
(16) 
where
(17) 
(18) 
is the number of data samples that are members of class and elements of cluster . The final Vmeasure for a clustering result is then equal to the harmonic mean of homogeneity and completeness:
(19) 
In our case, we weight both measures equally, setting to yield a single accuracy measure.
6.3 Baseline and stateoftheart methods
To evaluate our active clustering framework and proposed active constraint selection strategies, we test the following set of methods, including a number of variations on our own proposed method, as well as a baseline and multiple stateoftheart active clustering and learning techniques. From this point forward we refer to our proposed method as Uncertainty Reducing Active Spectral Clustering (URASC). The variants of URASC:

URASC+N: Proposed model for uncertainty reducing active clustering with gradient and nonparametric step scale estimation.

URASC+P: Proposed model for uncertainty reducing active clustering with gradient and parametric step scale estimation.

URASCGO: Our model without step scale estimation—only the gradient estimation for each sample is used.

URASCNO: Our model without gradient estimation—only the nonparametric step scale is used.

URASCPO: Our model without gradient estimation—only the parametric step scale is used.
Our baselines and comparison methods include stateoftheart pairbased active clustering methods and two active learning methods:

Random: A baseline in which pair constraints are randomly sampled from the available pool and fed to the spectral learning algorithm.

ActiveHACC: [11] An active hierarchical clustering method that seeks pairs that maximize the expected change in the clustering.

CAC1: [25] An active hierarchical clustering method that heuristically seeks constraints between large nearby clusters.

FFQS [16]: An offline active means clustering method that uses certainsample sets to guide constraint selection (as in our method), but selects samples to query either through a farthestfirst strategy or at random.

ASC [24]: A binaryonly pairbased active spectral clustering method that queries pairs that will yield the maximum reduction in expected pair value error.

QUIRE [32]: A binaryonly active learning method that computes sample uncertainty based on the informativeness and representativeness of each sample. We use our certainsample set framework to generate the requested sample labels from pairwise queries.

pKNN+AL [33]: A minmaxbased multiclass active learning method. Again, we use our framework to translate sample label requests into pairwise constraint queries.
7 Results
We run our method and its variants on all of the listed datasets and compare against baselines and competing stateoftheart techniques.
7.1 Variant methods and baseline
In Figure 4, we compare our parametric and nonparametric methods, as well as the three “partial” URASC procedures, on three image sets and two UCI sets at varying numbers of constraints. We show results in terms of both Jaccard coefficient and Vmeasure, and witness similar patterns for each. In all cases, our parametric and nonparametric methods perform relatively similarly, with the nonparametric having a modest lead at most, but not all, constraint counts. More importantly, our methods consistently (and in many cases dramatically) outperform the random baseline, particularly as the number of constraints increases. Our methods always show notable improvement as more constraints are provided—in contrast to the random baseline, which, at best, yields minor improvement. Even on the relatively simple wine dataset, it is clear that randomly selected constraints yield little new information.
Finally, we note that our “complete” methods consistently meet or exceed the performance of the corresponding partial methods. Neither the stepscaleonly methods nor the gradientonly method consistently yield better results, but in every case the combined method performs at least onpar with the better of the two, and in some cases significantly better than either (see the sonar results in particular). These results validate the theoretical conception of our method, showing that the combination of gradient and stepscale is indeed the correct way to represent the active selection problem, and that our method’s performance is being driven by the combined information of both terms.
7.2 Comparison to stateoftheart active learning methods
We next compare our methods to two active learning methods, as representatives of other pairbased techniques (Figure 5). Here we test on three binary UCI datasets in order to provide a reasonable evaluation of the QUIRE method, which is binaryonly.
At least one (and usually both) of our methods outperforms both QUIRE and pKNN+AL in most cases, only definitively losing out at the very low constraint level on the sonar dataset. As with the random baseline before, the gap between our methods and the competition generally increases with the number of constraints. These results suggests that simply plugging active learning methods into a clustering setting is suboptimal—we can achieve better results by formulating a clusteringspecific uncertainty reduction objective.
Also notable is the fact that, between the two active learning methods, QUIRE is clearly the superior (at least on problems where it is applicable). This is significant because, like our method, QUIRE seeks to measure the global impact of a given constraint, while pKNN+AL only models local uncertainty reduction. This lends further support to the idea that the effect of a given query should be considered within the context of the entire clustering problem, not just in terms of local statistics.
7.3 Comparison to stateoftheart active clustering methods
Finally, we test our methods against existing active clustering techniques (as well as the random baseline) and represent the results visually in Figure 6. Not all methods appear in all charts because ASC[24] is applicable only to binary data. Once again, our methods present a clear overall advantage over competing algorithms, and in many cases both our parametric and nonparametric methods far exceed the performance of any others (most dramatically on the Dog dataset).
The only method that comes near to matching our general performance is ActiveHACC, which also seeks to estimate the expected change in the clustering as a result of each potential query. However, this method is much more expensive than ours (due to running a large number of simulated clustering operations for every constraint selection) and fails on the Dog dataset. ASC is also somewhat competitive with our methods, but its binary nature greatly limits its usefulness for solving realworld semisupervised clustering problems.
Between our two methods, there still appears to be no clear winner, though the nonparametric approach appears to be more reliable given the relative failure of the parametric approach on the Leaf and Diabetes sets.
7.4 Comparison with noisy input.
Our previous experiments are all based on the assumption that the oracle reliably returns a correct groundtruth response every time it is queried. Previous works in active clustering have also relied on this assumption [16, 17, 18, 22, 24, 25, 26]. Obviously, this is not, as a general rule, realistic—human oracles may make errors, and in some problems the groundtruth itself may be ambiguous and subjective. Specifically, for the face and leaf datasets used here, Amazon Mechanical Turk experiments [9, 11] have shown that human error is about 1.2% on face queries and 1.9% on leaf queries.
Thus, in order to evaluate our active clustering method in a more realistic setting, we performed a set of experiments with a simulated 2% query error rate on the Face2 and Dog datasets. We plot the results in Figure 7. We find that, while improvement is noticeably slower and noisier than in the previous experiments, our algorithms still demonstrate a significant overall advantage over other active or passive clustering techniques. These results also further emphasize the importance of active query selection in general, as with noise added the net effect of the random queries is actually negative.
7.5 Comparison with unknown numbers of clusters
Since one advantage of our method is its ability to dynamically discover the number of clusters based on query results, we analyze how this approach effects performance over time. We thus run our method on the Face1 (50 groundtruth clusters) and Leaf (62 groundtruth clusters) datasets, with the number of clusters initially set to 2, and increasing as new certainsample sets are discovered. Our results are shown in Figure 8. The results are promising, with the unknown results initially much lower (as expected), but converging over time towards the known results as the cluster structure is discovered. On both datasets tested, the results appear to eventually become indistinguishable.
8 Conclusion
In this paper, we present a novel samplebased online active spectral clustering framework that actively selects pairwise constraint queries with the goal of minimizing the uncertainty of the clustering problem. In order to estimate uncertainty reduction, according to firstorder Taylor expansion, we decompose it into a gradient (estimated via matrix perturbation theory) and stepscale (based on one of two models of local label entropy). We then use pairwise queries to disambiguate the sample with the largest estimated uncertainty reduction. Our experimental results validate this decomposed model of uncertainty and support our theoretical conception of the problem, as well as demonstrating performance significantly superior to existing stateoftheart algorithms. Moreover, our experiments show that our method is robust to noise in the query responses and functions well even if the number of clusters in the problem is initially unknown.
One avenue of future research involves reducing the computational burden of the active selection process by adjusting the algorithm to select multiple query samples at each iteration. The naive approach to this problem—selecting the most uncertain points—may yield highly redundant information, so a more nuanced technique is necessary. With this adjustment, this active spectral clustering method could become a powerful tool for use in largescale online problems, particularly in the increasingly popular crowdsourcing domain.
Acknowledgements
We are grateful for the support in part provided through the following grants: NSF CAREER IIS0845282, ARO YIP W911NF1110090, DARPA Minds Eye W911NF1020062, DARPA CSSG D11AP00245, and NPS N002441110022. Findings are those of the authors and do not reflect the views of the funding agencies.
References
 [1] S. Basu, M. Bilenko, and R. Mooney, “A probabilistic framework for semisupervised clustering,” in SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2004, pp. 59–68.
 [2] Z. Li and J. Liu, “Constrained clustering by spectral kernel learning,” in International Conference on Computer Vision. IEEE, 2009, pp. 421–427.
 [3] Z. Lu and M. CarreiraPerpinán, “Constrained spectral clustering through affinity propagation,” in Conference on Computer Vision and Pattern Recognition. IEEE, 2008, pp. 1–8.
 [4] E. Xing, A. Ng, M. Jordan, and S. Russell, “Distance metric learning with application to clustering with sideinformation,” Advances in Neural Information Processing Systems, pp. 521–528, 2003.
 [5] S. Hoi, R. Jin, and M. Lyu, “Learning nonparametric kernel matrices from pairwise constraints,” in International Conference on Machine Learning. ACM, 2007, pp. 361–368.
 [6] L. Chen and C. Zhang, “Semisupervised variable weighting for clustering,” in SIAM International Conference on Data Mining, 2011.
 [7] I. Davidson, K. Wagstaff, and S. Basu, “Measuring constraintset utility for partitional clustering algorithms,” in European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer, 2006.
 [8] “http://leafsnap.com/.”
 [9] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and simile classifiers for face verification,” in International Conference on Computer Vision. IEEE, 2009, pp. 365–372.
 [10] A. Khosla, N. Jayadevaprakash, B. Yao, and F. Li, “Novel dataset for finegrained image categorization,” in First Workshop on FineGrained Visual Categorization, Conference on Computer Vision and Pattern Recognition, 2011.
 [11] A. Biswas and D. Jacobs, “Active image clustering with pairwise constraints from humans,” International Journal of Computer Vision, 2014.
 [12] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person reidentification,” in Conference on Computer Vision and Pattern Recognition, 2013.
 [13] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J. Kress, I. C. Lopez, and J. V. Soares, “Leafsnap: A computer vision system for automatic plant species identification,” in European Conference on Computer Vision. Springer, 2012, pp. 502–516.
 [14] M. Buhrmester, T. Kwang, and S. D. Gosling, “Amazon’s mechanical turk a new source of inexpensive, yet highquality, data?” Perspectives on Psychological Science, vol. 6, no. 1, pp. 3–5, 2011.
 [15] R. Huang and W. Lam, “An active learning framework for semisupervised document clustering with language modeling,” Data & Knowledge Engineering, vol. 68, no. 1, pp. 49–67, 2009.
 [16] S. Basu, A. Banerjee, and R. Mooney, “Active semisupervision for pairwise constrained clustering,” in International Conference on Data Mining, 2004.
 [17] Y. Fu, B. LI, X. Zhu, and C. Zhang, “Do they belong to the same class: active learning by querying pairwise label homogeneity,” in International Conference on Information and Knowledge Management. ACM, 2011, pp. 2161–2164.
 [18] P. Mallapragada, R. Jin, and A. Jain, “Active query selection for semisupervised clustering,” in International Conference on Pattern Recognition. IEEE, 2008, pp. 1–4.
 [19] C. Xiong, D. M. Johnson, and J. J. Corso, “Online active constraint selection for semisupervised clustering,” in European Conference on Artificial Intelligence, Active and Incremental Learning Workshop, 2012.
 [20] C. Xiong, D. M. Johnson, and J. J. Corso, “Spectral active clustering via purification of the knearest neighbor graph,” in European Conference on Data Mining, 2012.
 [21] C. Xiong, D. M. Johnson, and J. J. Corso, “Uncertainty reduction for active image clustering via a hybrid globallocal uncertainty model,” in AAAI Conference on Artificial Intelligence (LateBreaking Developments), 2013.
 [22] Q. Xu, M. Desjardins, and K. Wagstaff, “Active constrained clustering by examining spectral eigenvectors,” in Discovery Science. Springer, 2005, pp. 294–307.
 [23] S. Hoi and R. Jin, “Active kernel learning,” in International Conference on Machine Learning. ACM, 2008, pp. 400–407.
 [24] X. Wang and I. Davidson, “Active Spectral Clustering,” in International Conference on Data Mining, 2010.
 [25] A. Biswas and D. Jacobs, “Large scale image clustering with active pairwise constraints,” in International Conference on Machine Learning 2011 Workshop on Combining Learning Strategies to Reduce Label Cost, 2011.
 [26] F. Wauthier, N. Jojic, and M. Jordan, “Active spectral clustering via iterative uncertainty reduction,” in SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2012, pp. 1339–1347.
 [27] S. Xiong, J. Azimi, and X. Z. Fern, “Active learning of constraints for semisupervised clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 99, no. PrePrints, p. 1, 2013.
 [28] K. Bache and M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
 [29] R. Cho, M. Campbell, E. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. Wolfsberg, A. Gabrielian, D. Landsman, D. Lockhart et al., “A genomewide transcriptional analysis of the mitotic cell cycle,” Molecular Cell, vol. 2, no. 1, pp. 65–73, 1998.
 [30] B. Settles, “Active learning literature survey,” University of Wisconsin, Madison, 2010.
 [31] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang, “Agnostic active learning without constraints,” Arxiv preprint arXiv:1006.2588, 2010.
 [32] S. Huang, R. Jin, and Z. Zhou, “Active learning by querying informative and representative examples.” Advances in Neural Information Processing Systems, 2010.
 [33] P. Jain and A. Kapoor, “Active learning for large multiclass problems,” in Conference on Computer Vision and Pattern Recognition. IEEE, 2009, pp. 762–769.
 [34] A. Y. Ng, M. I. Jordan, Y. Weiss et al., “On spectral clustering: Analysis and an algorithm,” Advances in Neural Information Processing Systems, vol. 2, pp. 849–856, 2002.
 [35] K. Kamvar, S. Sepandar, K. Klein, D. Dan, M. Manning, and C. Christopher, “Spectral learning,” in International Joint Conferences on Artificial Intelligence. Stanford InfoLab, 2003.
 [36] G. Stewart and J. Sun, Matrix perturbation theory. Academic press New York, 1990, vol. 175.
 [37] X. Chen and D. Cai, “Large scale spectral clustering with landmarkbased representation.” in AAAI Conference on Artificial Intelligence, 2011.
 [38] D. Yan, L. Huang, and M. I. Jordan, “Fast approximate spectral clustering,” in SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2009, pp. 907–916.
 [39] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Conference on Computer Vision and Pattern Recognition. IEEE, 2009.
 [40] T. PangNing, M. Steinbach, and V. Kumar, “Introduction to data mining,” WP Co, 2006.
 [41] A. Rosenberg and J. Hirschberg, “Vmeasure: A conditional entropybased external cluster evaluation measure.” in Conference on Empirical Methods in Natural Language ProcessingConference on Computational Natural Language Learning, 2007, pp. 410–420.