Semi-Supervised Active Clustering with Weak Oracles

Semi-Supervised Active Clustering with Weak Oracles

Taewan Kim Department of Electrical and Computer Engineering,
The University of Texas at Austin, USA
{twankim,jghosh}@utexas.edu
Joydeep Ghosh Department of Electrical and Computer Engineering,
The University of Texas at Austin, USA
{twankim,jghosh}@utexas.edu

Semi-Supervised Active Clustering with Weak Oracles

Taewan Kim Department of Electrical and Computer Engineering,
The University of Texas at Austin, USA
{twankim,jghosh}@utexas.edu
Joydeep Ghosh Department of Electrical and Computer Engineering,
The University of Texas at Austin, USA
{twankim,jghosh}@utexas.edu
Abstract

Semi-supervised active clustering (SSAC) utilizes the knowledge of a domain expert to cluster data points by interactively making pairwise “same-cluster” queries. However, it is impractical to ask human oracles to answer every pairwise query. In this paper, we study the influence of allowing “not-sure” answers from a weak oracle and propose algorithms to efficiently handle uncertainties. Different types of model assumptions are analyzed to cover realistic scenarios of oracle abstraction. In the first model, random-weak oracle, an oracle randomly abstains with a certain probability. We also proposed two distance-weak oracle models which simulate the case of getting confused based on the distance between two points in a pairwise query. For each weak oracle model, we show that a small query complexity is adequate for the effective means clustering with high probability. Sufficient conditions for the guarantee include a -margin property of the data, and an existence of a point close to each cluster center. Furthermore, we provide a sample complexity with a reduced effect of the cluster’s margin and only a logarithmic dependency on the data dimension. Our results allow significantly less number of same-cluster queries if the margin of the clusters is tight, i.e. . Experimental results on synthetic data show the effective performance of our approach in overcoming uncertainties.

1 Introduction

Clustering is one of the most popular approaches for extracting meaningful insights from unlabeled data. However, clustering is also very challenging for a wide variety of reasons Jain et al. (1999). Finding the optimal solution of even the simple -means objective is known to be NP-hard Davidson and Ravi (2005); Mahajan et al. (2009); Vattani (2009); Reyzin (2012). Second, the quality of a clustering algorithm is difficult to evaluate without context.

Semi-supervised clustering is one way to overcome these problems by providing a small amount of additional knowledge related to the task. Various kinds of supervision can help unsupervised clustering: labeled samples, pairwise must-link or cannot-link constraints on elements, and split/merge requests Basu et al. (2002, 2004a); Balcan and Blum (2008). As domain experts have clear understanding about the nature of datasets, generating a small amount of supervised information should not be a challenging task for them. For example, a few pairs of samples among the large number of unlabeled animal images can be provided, and a participant can decide whether each pair must be in the same cluster or not.

Assumptions on the characteristics of a dataset itself can also assist a clustering problem. Constraints related to a margin, or a distance between different clusters, are widely used in theoretical works. Although these are strong assumptions, a margin ensures the existence of an optimal clustering, which coincides with a human expert’s judgment.

The semi-supervised active clustering (SSAC) framework proposed by Ashtiani et al. (2016) combines both margin property and pairwise constraints in the active query setting. A domain expert can help clustering by answering same-cluster queries, which ask whether two samples belong to the same cluster or not. By using an algorithm with two phases, it was shown that the oracle’s clustering can be recovered in polynomial time with high probability. However, their formulation of the same-cluster query has only two choices of answers, yes or no. This might be impractical as a domain expert can also encounter ambiguous situations which are difficult to respond to in a short time.

Therefore, we provide a SSAC framework that can also make a good use of weak supervision by allowing a “not-sure” response. We first analyze the effect of a weak oracle with a random behavior and provide the possibility of discovering the oracle’s clustering with active same-cluster queries. Then several types of weak oracles are defined, and a minor assumption is shown to ensure the recovery of the oracle’s clustering with high probability.

1.1 Our Contributions

We provide novel and efficient semi-supervised active clustering algorithms for center-based clustering task, which can discover the inherent clustering of an imperfect oracle. Our work is motivated by the SSAC algorithm Ashtiani et al. (2016), and the following question: “Is it possible to perform a clustering task efficiently even with a non-ideal domain expert?”. We answer this question by formulating different types of weak oracles and prove that the SSAC algorithm can still work well under uncertainties by using properly modified binary search schemes.

The SSAC algorithm is composed of two phases, estimation of a cluster center and then of the cluster radius. Both phases are affected by not-sure answers and each phase is investigated to have a good estimation. Non-trivial strategies are developed by utilizing the characteristics of weak oracles. Our paper combines discoveries from both phases and provides a unified probabilistic guarantee for the entire algorithm’s success.

Two realistic weak oracle models are introduced in the paper. First, if an oracle answers “not-sure” randomly with at most some fixed probability, we prove that reasonably increased sampling and query sizes can lead to a successful approximation of true cluster centers and radii of clusters. Our result generalizes the SSAC without a not-sure option in a query.

Next, we suggest practical weak-oracle model assumptions based on reasonable cases that may lead to ambiguity in answering a same-cluster query. In particular, we considered two scenarios: (i) a distance between two points from different clusters is too small, and (ii) a distance between two points within the same cluster is too large. If there exists at least one cluster element close enough to the center, an oracle’s clustering can be recovered with high probability. This close point is identified from a good approximation of the cluster center and removes the uncertainty in estimating the radius of a cluster. In fact, this practical strategy is based on the idea to make use of deterministic behaviors of distance-weak oracles, and our assumption on the existence of points close to the center is very natural. Two different distance-based weak oracles are considered, and our algorithm can resolve both types.

Query complexity is obtained by utilizing a matrix concentration inequality Tropp (2012), which relies on the -margin property. Our new theoretical result shows that the SSAC algorithm requires less number of samples compared to the one proved by Ashtiani et al. (2016) when the margin between clusters is tight and the dimension of data is .

Finally, experimental results on synthetic data show the effective performance of our approach in overcoming uncertainties. In particular, our weak oracle model with random behavior is simulated with known ground truth, and the algorithm successfully deals with not-sure answers.

Remark 1.

Proofs for theoretical results are deferred to Appendix B with additional analyses.

1.2 Related Work

Semi-supervised clustering ideas have been actively studied in 2000s Basu et al. (2002, 2004a, 2004b); Kulis et al. (2009). Basu et al. (2002) used seeding, or given cluster assignments on a subset of the data, as a way of supervision. Later, a similar form was considered by Ashtiani and Ben-David (2015). They mapped data to a proper representation space based on the clustering of small random samples and applied -means in the new space.

One of the most popular forms of supervision are pairwise constraints, i.e. must-link/cannot-link type of knowledge. Basu et al. (2004a) introduced the application of these pairwise constraints in a clustering objective function and formulated based on Hidden Markov Random Fields. Then, a probabilistic framework with pairwise constraints were introduced Basu et al. (2004b), which was generalized by Kulis et al. (2009) as a weighted kernel k-means and a graph clustering problems. Our work uses similar same-cluster queries, but interactively queries the oracle.

Active semi-supervised clustering frameworks were investigated in earlier works. Cohn et al. (2003) proposed an iterative solution to the clustering problem using active reactions from users, but provided no theoretical guarantees on the result. Basu et al. (2004a) also suggested an active semi-supervised clustering algorithm similar to our approach with an additional step of finding good pairs from the dataset, which improves the initial guess of clusters. Our result differs from their work as we consider uncertainties in queries and provide different types of probabilistic guarantees.

Mazumdar and Saha (2016) also tackle a clustering problem with the support of oracles and side information. Distance between points in our work can be one example of side information. However, the main motivation is different from our work as we focused on not-sure answers, where they consider incorrect answers.

In this paper, we assume that the problem satisfies a center-based clustering framework. Balcan and Liang (2016) studied algorithms to deal with perturbations on the center-based objectives including -center proximity. On the other hand, our work relies on the -margin property of data and perturbation resilience is provided as a form of high probability guarantee of success.

The most related work to this paper is Ashtiani et al. (2016), which first introduced the SSAC framework. They presented the probability of recovering an oracle’s clustering with an additional proof on the hardness of the problem. Instead of analyzing a NP-hardness proved by them, this paper focuses more on the performance of SSAC algorithm with weak oracles to deal with practical uncertainty issues.

2 Problem Setting

2.1 Background

The SSAC framework was originally developed based on two important assumptions: a center-based clustering and a -margin property Ashtiani et al. (2016). For the purpose of theoretical analysis, the domain of data is assumed to be the Euclidean space, and each center of a clustering is defined as a mean of elements in the corresponding cluster, i.e. . Then, an optimal solution of the -means clustering satisfies the conditions for center-based clustering111This holds for all Bregman divergences Banerjee et al. (2005)..

Definition 1 (Center-based clustering).

Let with . A clustering is a center-based clustering of with clusters, if there exists a set of centers satisfying the following condition: and where is a distance measure.

Also, a -margin property ensures the existence of an optimal clustering. Figure 1 visually depicts the -margin property to help understanding the characteristic of it.

Definition 2 (-margin property - Clusterability).

Let be a center-based clustering of with clusters and corresponding centers . satisfies the -margin property if the following condition is true:

Figure 1: Visual representation of the -margin property.

2.2 Problem Formulation

A semi-supervised clustering algorithm is applied on data satisfying the -margin property with the oracle’s clustering , which is supported by a weak oracle that receives weak same-cluster queries.

Definition 3 (Weak Same-cluster Query).

A weak same-cluster query asks whether two data points belong to the same cluster and receives one of three responses from an oracle with a clustering .

In our framework, the cluster-assignment process uses weak same-cluster queries and therefore only depends on pairwise information provided by weak oracles. In short, points with known cluster assignments from different clusters are used to determine the assignment of a given point. If an oracle outputs yes or no answer for at least pairwise weak queries, we can perfectly discover the cluster assignment of the point. Also, one yes answer among the weak same-cluster queries directly gives the cluster it belongs to. See Appendix B.1 for the detailed pairwise cluster-assignment process. The term “cluster-assignment query” is also used instead of “weak pairwise cluster-assignment query” throughout the paper.

Definition 4 (Weak Pairwise Cluster-assignment Query).

A weak pairwise cluster-assignment query identifies the cluster index of a given data point by asking weak same-cluster queries , where . One of responses is inferred from an oracle with . is a permutation defined on which is determined during the assignment process accordingly.

3 SSAC with Random-Weak Oracles

3.1 Random-Weak Oracle

One way of modeling the performance of weak oracles is to define the maximum probability of answering not-sure. We call it as a random-weak oracle, which is a natural assumption and mathematical abstraction for theoretical studies such as a binary erasure channel in information theory. This fundamental assumption is meaningful as domain experts can make mistakes or encounter hard samples with a certain frequency. In addition, some realistic scenarios can be covered by this model where there exists a chance of losing signals or not receiving answers. For example, if a restriction in time for answering a query is considered to increase the speed of an algorithm, even a perfect domain expert can miss answering a same-cluster query because of the time limit. This situation can be well depicted by the random-weak oracle model by replacing the role of a not-sure option with an event of missing an answer.

Definition 5 (Random-Weak Oracle).

An oracle is said to be random-weak with a parameter , if with probability at most for given two points .

0:  Dataset , an oracle for weak query , target number of clusters , sampling numbers , and a parameter .
1:  
2:  for  to  do
3:     - Phase 1:
4:            // Draw samples from
5:     for  do
6:               // Pairwise cluster-assignment query
7:     end for
8:     ,
9:     - Phase 2:
10:            // Increasing order of
11:     Select BinarySearch algorithm based on the type of a weak oracle BinarySearch()      // Same-cluster query
12:     
13:  end for
13:  A clustering of the set
Algorithm 1 SSAC for Weak Oracles

Two parts of the SSAC algorithm should be reconsidered to analyze the influence of not-sure answers from the oracle. First, number of sampled elements for cluster-assignment queries must be sufficient to accurately approximate the cluster center. Intuitively, more samples or queries are required if our semi-supervision has a chance of failure. The second step of the algorithm estimates a radius from the sample mean to recover the oracle’s cluster based on distances. A binary search technique plays an important role to minimize the query complexity in logarithmic scale. However, weak oracles can lead to a situation of having failure in the intermediate search step. Therefore, we provide a simple extension of a binary search with repetitive weak same-cluster queries, Algorithm 2, to mitigate the effect of uncertainties in queries. Our first main result shows the perfect recovery of the oracle’s clustering on the random-weak model.

0:  Sorted dataset in increasing order of , an oracle for weak query , target cluster , set of assignment-known points , empirical mean , and a sampling number .
1:  Standard binary search algorithm with the following rules
2:  - Search():
3:  Sample points from .
4:  Weak same-cluster query , for all
5:  if  is in cluster  then Set left bound index as
6:  else if  is not in Cluster  then Set right bound index as
7:  else if not-sure based on queries then Return fail      // See Appendix C to handle failure
8:  end if
9:  - Stop:  Found the smallest index such that is not in
9:  
Algorithm 2 Random-Weak BinarySearch
Theorem 1.

For given data and a distance metric , let be a center-based clustering with the -margin property. Let and . If parameters for the sampling satisfy and , then combination of Algorithm 1 and 2 outputs the oracle’s clustering with probability at least .

To prove Theorem 1, we first show that a good approximate cluster center can be obtained with high probability, which leads to a simple recovery of points within the radius. Then, the probability of success in binary search steps is evaluated. Refer Appendix B for detailed proofs, a query complexity, and runtime.

Remark 2.

The sampling number in theorem 1 generalizes the result of Ashtiani et al. (2016). If queries are not weak, i.e. , we can achieve the sampling complexity , and their bound can be recovered by using a dimension independent concentration inequality. Section 3.2 explains the advantage of our approach.

Remark 3.

Number of samples in Theorem 1 depend on and , both of which are unknown in real settings. Although there is no explicit way to calculate the margin , can be approximated with the ratio of answer from an oracle. Proper parameters for the sampling, , can also be obtained through trial and error.

Remark 4.

Since our algorithm utilizes only pairwise feedback from oracles, it subsumes a vast range of general and practical assumptions on oracles. A key motivation of our weak oracle models is uncertainty caused by obscure characteristics in a pair of samples. Therefore, even if some answers for same-cluster queries in a cluster-assignment step are not-sure, remaining answers are not necessarily determined to be not-sure. Also, some answers can provide hints to discover the cluster-assignment of a given point in practice. However, our theoretical analysis on the random-weak oracle model provides a lower bound of possible realistic situations, and more practical models for the motivation are covered in Section 4.

3.2 Comparison to Dimension Independent Result

A sampling number provided by Ashtiani et al. (2016) is , which is required to guarantee a good approximation of a cluster center with high probability. This result is founded on the dimension independent concentration inequality Ashtiani and Ghodsi (2015). However, can be extremely large if the margin between clusters is tight, i.e. for some small . Our result decreased the influence of by using Vector Hoeffding’s Inequality (See Theorem 5 in Appendix A) to obtain sample complexity when the oracle is not weak. In particular, if the dimension of data is , our approach ensures smaller query complexity.

4 SSAC with Distance-Weak Oracles

In the previous section, an oracle is assumed to have an arbitrary behavior for answering weak same-cluster queries. One advantage of such an assumption is a wide coverage over different realistic situations. However, it is more reasonable to evaluate the performance of domain experts reflecting the range of knowledge or inherent ambiguities of the given pairs of samples. The cause of not-sure answer for the same-cluster query can be investigated based on a distance between the elements in a feature space. Two cases for having indefinite answers are considered in this work: (i) points from different clusters are too close, and (ii) points within the same cluster are too far. The first situation happens a lot in the real world. For instance, distinguishing wolves from dogs is not an easy task if a data sample like a Siberian Husky is provided as visual features. The second case is also rational, because it might be difficult to compare characteristics of two points within the same cluster if they have quite dissimilar features.

0:  Sorted dataset in increasing order of , a distance-weak oracle for weak query , target cluster , set of assignment-known points , and empirical mean .
1:  Select a point and use it for same-cluster queries
2:  - Search():
3:  if  then Set left bound index as
4:  else Set right bound index as       // or not-sure
5:  end if
6:  - Stop:  Found the smallest index such that is not in
6:  
Algorithm 3 Distance-Weak BinarySearch

4.1 Local Distance-Weak Oracle

We define the first weak-oracle model sensitive to distance, a local distance-weak oracle, in a formal way to include two vague situations described before. Condition (a) and (b) in Definition 6 are mathematical expression of two depicted cases (i) and (ii) respectively. These confusing cases for local distance-weak oracle are visually depicted in Figure 2 for better explanation.

Figure 2: Visual representation of two not-sure cases for the local distance-weak oracle. (Left) Two points from the different clusters are too close. (Right) Two points from the same clusters are too far.
Definition 6 (Local Distance-Weak Oracle).

An oracle having a clustering for data is said to be local distance-weak with parameters and , if for any given two points satisfying one of the following conditions:

  • , where

  • , where

One way to overcome the local distance-weakness is to provide at least one good point in a query. If one of the points and for the query is close enough to the center of a cluster, a local distance-weak oracle does not get confused in answering. This situation is realistic because one representative data sample of a cluster might be a good baseline when comparing to other elements. The next theorem is founded on this intuition, and we show that a modified version of SSAC will succeed if at least one representative sample per cluster is suitable for the weak oracle. In the proof, we first show the effect of a point close to the center on weak queries. Then the possibility of having a close empirical mean is provided by defining good sets and calculating data-driven probability of failure from it. Last, an assignment-known point is identified to remove the uncertainty of same-cluster queries used in the binary search step.

Theorem 2.

For given data and a distance metric , let be a center-based clustering with the -margin property. Let , , , , and . If a cluster contains at least one point satisfying for all , then combination of Algorithm 1 and 3 outputs the oracle’s clustering with probability at least by asking weak same-cluster queries to a local distance-weak oracle.

4.2 Global Distance-Weak Oracle

A global distance-weak oracle fails to answer depending on the distance of each point to its respective cluster center. In this case, both elements and should be in the covered range of an oracle if they don’t belong to the same cluster. This represents an oracle that is weaker when one of points is out of its knowledge. We assume to preserve the characteristic of a distance-weakness within a same cluster, i.e. the second condition of the local distance-weak oracle.

Figure 3: Visual representation of two not-sure cases for the global distance-weak oracle. The red box indicates the difference with the local distance-weak oracle.
Definition 7 (Global Distance-Weak Oracle).

An oracle having a clustering for data is said to be global distance-weak with parameter , if for any given two points satisfying one of the following conditions:

  • or where

  • , where

The problem of a global distance-weak oracle compared to the local distance-weak model is the increased ambiguity in distinguishing elements from different clusters. Nevertheless, once we get a good estimate of the center, one good point can be still found to support same-cluster queries in the binary search step. Therefore, Algorithm 1 and 3 can guarantee the recovery of oracle’s cluster with high probability by utilizing a global distance-weak oracle.

Theorem 3.

For given data and a distance metric , let be a center-based clustering with the -margin property. Let , , , and . If a cluster contains at least one point satisfying for all , then combination of Algorithm 1 and 3 outputs the oracle’s clustering with probability at least , by asking weak same-cluster queries to a global distance-weak oracle.

Remark 5.

Our novel approach (to use the closest point from the estimated center) make the binary search steps avoid simple repetitive samplings. In fact, this practical strategy is based on the idea to make use of deterministic behaviors of distance-weak oracles.

Remark 6.

Although different binary search algorithms are developed for each weak oracle model, it is possible to unify Algorithm 2 and 3. First, process a same-cluster query using , the closest point from . Then, more queries can be provided to the weak oracle with additional samples from if gives a not-sure answer. In fact, this unified binary search algorithm strengthens the coverage of our approach because it can handle both random and distance-weak oracles at once. (See Appendix C for the detailed algorithm.)

5 Experimental Results

In practice, simulating active queries with a domain expert and evaluating probabilistic results is not easy as one can “game” the system. Therefore, simple cases on synthetic data are simulated where the true cluster assignments are known, and the oracle follows the random-weak model.222The source code is available online. https://github.com/twankim/weaksemi

5.1 Data Generation

For simulated dataset, points of each cluster are generated from isotropic Gaussian distribution. We assume that there exists a ground truth oracle’s clustering, and the goal is to recover it where labels are partially provided via weak same-cluster queries. Various parameters are considered in generating clusters: number of samples , dimension of data , number of clusters , and standard deviation of each Gaussian distribution . For visual representation, 2-dimensional data points are considered, and other parameters are set to , , and . Data points satisfy -margin property with condition .

5.2 Evaluation

Each round of the evaluation is composed of experiments with different parameter settings on ; is the probability of successful response. The unified binary search algorithm is used which handles uncertainty by regarding ‘not-sure’ as ‘in different clusters’; hence is fixed as . Parameters are varied as and in each round, and rounds are repeated.

Two evaluation metrics are considered: is the ratio of correctly recovered data points averaged over points, and is the total number of failures occurred at cluster-assignments. The best permutation for the cluster labels is investigated based on the distances between estimated centers and true centers for the evaluation. Formal definitions of the evaluation metrics are stated below. represents the indicator function, and represent true and estimated cluster labels respectively. As similar number of points are generated per cluster, a mean accuracy averaged over clusters is not considered.

5.3 Results

(a) of SSAC algorithm
(b) of SSAC algorithm
Figure 4: Separable case with a narrow margin. , . (\subreffig:res_acc_main) Averaged over experiments. x-axis: (Number of samples), y-axis: . (\subreffig:res_fail) Total sum over experiments. x-axis: , y-axis:

To focus on scenarios with narrow margins, and are chosen. Figure 4 shows in percentage and on different parameter pairs . An accuracy of recovering the oracle’s clustering increases as increases. This shows the importance of enough number of samples to succeed in clustering even with uncertainties caused by an weak oracle. In fact, even small number of samples are sufficient in practice.

Failures of the SSAC algorithm can happen as it is a probabilistic algorithm. When is really small, the possibility of failure increases as we have only few chances to ask cluster-assignment queries. For example, if , only points are sampled. Then, if all 6 cluster-assignment queries fail, Phase 1 fails. This leads to the recovery of less than clusters because the SSAC algorithm repeats Phase 1 and Phase 2 for times. However, such situations rarely occur if is large enough. Also, failure in binary search can happen, but we observed that only 2 out of 5000 rounds suffered from it with .

(a) of SSAC algorithm
(b) of SSAC algorithm
Figure 5: Non-separable case. , . (\subreffig:res_acc_nonsep_main) Averaged over experiments. x-axis: (Number of samples), y-axis: . (\subreffig:res_fail_nonsep) Total sum over experiments. x-axis: , y-axis:

Results on the non-separable case, and , are also provided in Figure 5. Even if it does not get good theoretical guarantees, our algorithm still gives a reasonable clustering. See Appendix D for additional results on different settings and scatter plots of the clusterings.

6 Conclusion and Future Work

This paper presents approaches for utilizing a weak oracle in the semi-supervised active clustering (SSAC) framework. Specifically, we suggest two different types of domain experts that can output an answer “not-sure” for the same-cluster query. First, we consider a random-weak oracle that does not know the answer with at most some fixed probability. Secondly, two distance-based weak oracle models are considered to simulate realistic situations. For both of these models, probabilistic guarantees on discovering the oracle’s clustering, with small dependency on the margin, are provided based on our devised binary search algorithms. In distance-based models, a single element close enough to the cluster center is able to mitigate ambiguous supervision. As our weak-oracle assumptions are designed to reflect practical scenarios, application to the real world clustering tasks with actual domain experts would be an interesting research topic. Another future direction is an extension of the framework to accommodate other distance functions or metric learning approaches.

References

  • Ashtiani and Ben-David [2015] Hassan Ashtiani and Shai Ben-David. Representation learning for clustering: a statistical framework. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 82–91. AUAI Press, 2015.
  • Ashtiani and Ghodsi [2015] Hassan Ashtiani and Ali Ghodsi. A dimension-independent generalization bound for kernel supervised principal component analysis. In Proceedings of The 1st International Workshop on “Feature Extraction: Modern Questions and Challenges”, NIPS, pages 19–29, 2015.
  • Ashtiani et al. [2016] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Advances In Neural Information Processing Systems, pages 3216–3224, 2016.
  • Balcan and Blum [2008] Maria-Florina Balcan and Avrim Blum. Clustering with interactive feedback. In International Conference on Algorithmic Learning Theory, pages 316–328. Springer, 2008.
  • Balcan and Liang [2016] Maria Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. SIAM Journal on Computing, 45(1):102–155, 2016.
  • Banerjee et al. [2005] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
  • Basu et al. [2002] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML-2002. Citeseer, 2002.
  • Basu et al. [2004a] Sugato Basu, Arindam Banerjee, and Raymond J Mooney. Active semi-supervision for pairwise constrained clustering. In Proceedings of the 2004 SIAM international conference on data mining, pages 333–344. SIAM, 2004a.
  • Basu et al. [2004b] Sugato Basu, Mikhail Bilenko, and Raymond J Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 59–68. ACM, 2004b.
  • Cohn et al. [2003] David Cohn, Rich Caruana, and Andrew McCallum. Semi-supervised clustering with user feedback. Constrained Clustering: Advances in Algorithms, Theory, and Applications, 4(1):17–32, 2003.
  • Davidson and Ravi [2005] Ian Davidson and SS Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In Proceedings of the 2005 SIAM international conference on data mining, pages 138–149. SIAM, 2005.
  • Jain et al. [1999] Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.
  • Kulis et al. [2009] Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semi-supervised graph clustering: a kernel approach. Machine learning, 74(1):1–22, 2009.
  • Mahajan et al. [2009] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means problem is np-hard. In International Workshop on Algorithms and Computation, pages 274–285. Springer, 2009.
  • Mazumdar and Saha [2016] Arya Mazumdar and Barna Saha. Clustering via crowdsourcing. arXiv preprint arXiv:1604.01839, 2016.
  • Reyzin [2012] Lev Reyzin. Data stability in clustering: A closer look. In International Conference on Algorithmic Learning Theory, pages 184–198. Springer, 2012.
  • Tropp [2012] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
  • Vattani [2009] Andrea Vattani. The hardness of k-means clustering in the plane. Manuscript, accessible at http://cseweb.ucsd.edu/~avattani/papers/kmeans_hardness.pdf, 617, 2009.

Appendix A Concentration Inequality for Random Vectors

To achieve high probability guarantees, we apply the Vector Hoeffding’s inequality. Proof of Theorem 5 uses a Transpose Dilation technique on the Matrix Hoeffding results for symmetric matrices Tropp [2012].

Lemma 4 (Matrix Hoeffding’s Inequality Tropp [2012]).

Let be i.i.d. random, symmetric matrices with dimension , and let be fixed symmetric matrices. Assume that each random matrix satisfies,

Then, for all ,

Definition 8 (Transpose Dilation).

Given a matrix , transpose dilation of is defined as a function :

Well known property about the transpose dilation is the fact that it preserves spectral information of the input matrix Tropp [2012], i.e. . In short, for each singular value of , there exist two corresponding eigenvalues and of .

Theorem 5 (Vector Hoeffding’s Inequality).

Let be random vectors with dimension , and be a sequence of positive values. Assume that each random vector satisfies:

Then, for any ,

Proof.

The overall proof is motivated by dilation technique introduced by Tropp [2012] so that we can apply concentration inequalities for symmetric random matrices to random vectors.

Let , where is a transpose dilation defined in Definition 8. By the definition of transpose dilation, , and . Combining these with the fact that preserves spectral information gives,

This equality indicates that -norm of the sum of vectors can be transformed to the spectral norm, or the largest eigenvalue, of the sum of matrices constructed by transpose dilation.

Now let’s bound the square of the random matrix .

This gives

Since a random vector is bounded as , we can say that for any , where represents a identity matrix. Finally, we can define the following constant:

Therefore, by applying directly to the matrix Hoeffding’s inequality, we have,

Appendix B Proofs and Supplementary Analyses

In this section, proofs for theoretical results on both random-weak oracles and distance-weak oracles are provided. Also, supplementary analyses like query complexities and feasible ranges of parameters for distance-weak oracles are presented.

First, we state Lemma 6 which assists theoretical results by introducing a characteristic of points close enough to the cluster center.

Lemma 6 (Lemma 5 of Ashtiani et al. [2016]).

For given data and a distance metric , let be a center-based clustering with the -margin property, and be the set of centers (mean of each cluster) of . Let be a point close to the center such that , where . Then if holds, points in the cluster are closer to than the points of other clusters, i.e.,

b.1 Proofs for Random-Weak Oracles

Analysis on Weak Pairwise Cluster-assignment Query
A single weak pairwise cluster-assignment query is composed of weak same-cluster queries on different pairs ), where is a given point and is an assignment-known point from each cluster . Therefore, if an oracle outputs yes or no answer for at least weak queries, we can perfectly discover the cluster assignment of . This probability is lower bounded by as . So, we can conclude that the probability of having not-sure answer for a given on a cluster-assignment query is at most .

Also, if we only have clusters defined during the process, cluster assignment of can be identified if weak-oracle gives yes or no answers for all same-cluster queries. In detail, if one yes answer is provided among weak queries, can be assigned to the corresponding cluster. And no answers is handled by assigning a new cluster for . Then the probability of failure in identifying a cluster-assignment is at most in this case. Accordingly, we use as an upper bound for the failure probability of a cluster-assignment query to consider the worst case for further analysis on sampling complexity.

Lemma 7.

For given data and a distance metric , let be a center-based clustering with the -margin property, and be the set of centers (mean of each cluster) of . Define and as in Algorithm 1 with . If number of samples including not-sure cluster assignment is at least , then the probability of is at most , where .

Proof.

Let without loss of generality (). Let be i.i.d. random vectors having values with probability for any . represents a point randomly sampled from the cluster . Also, let be i.i.d. random variables having with probability and with probability , which are independent of ’s. Note that indicate the cluster-assignment queries where an oracle succeeds in the assignment with probability . Then a sample mean using only assignment-known data points from samples can be represented as follows:

Now, define a new random vector for any . Then, , and its norm is bounded as by definition. By combining for the chance of having not-sure samples, we can achieve an upper bound of the probability of the sample mean being not close to the true mean.

The forth equality holds as and are independent, and ’s are i.i.d. random vectors. Then the first inequality can be shown by applying Theorem 5, or Vector Hoeffding’s inequality. As , we can conclude that if a number of samples from the cluster including not-sure ones is at least , then .

Also, the last equation is a decreasing function of and therefore can be upper bounded by replacing with the lower bound of the cluster-assignment success probability. This concludes the proof because we showed that , and the sufficient number of samples including not-sure for the guarantee becomes . ∎

The sampling number stated in Lemma 7 is a generalized version of the original same-cluster query case. If queries are not weak, i.e. , and the target probability is , you can achieve the sampling complexity , which is required to have a close empirical mean using a perfect oracle.

Theorem 1.

For given data and a distance metric , let be a center-based clustering with the -margin property. Let and . If parameters for the sampling satisfy and , then combination of Algorithm 1 and 2 outputs the oracle’s clustering with probability at least .

Proof.

For , the phase 1 of Algorithm 1 samples points from the set . Let be a cluster corresponds to the sample set . Then, at least number of cluster-assignment queries, including not-sure outcomes, are processed related to cluster by the pigeonhole principle. Let’s elaborate on this claim. If we sample data, there exists both cluster assignment-known points and not-sure ones. By matching not-sure data to each proportional to , it can be concluded that at least points are sampled from one class with the chance of failure. Then Lemma 6 and 7 ensures that a sample mean constructed by the algorithm satisfies the property for all and with probability at least .

In Phase 2, a binary search algorithm can estimate the radius of a cluster from with same-cluster queries if an oracle is perfect. However, the binary search fails if at least one search step receives not-sure output from the weak oracle. The worst case probability of a failure in the search step can be calculated as . By applying union bound and use , we can conclude that Algorithm 2 recovers the correct with probability at least if . Note that and condition becomes . This shows the generality of our result as a same-cluster query case with a perfect oracle requires only 1 query per search.

By combining the above two results, we can say that the output of each iteration in the Algorithm 1 is a perfect recovery of with probability at least . Again, union bound concludes the proof as iteration runs times, i.e. the SSAC algorithm with the modified binary search recovers a clustering of the oracle with probability at least . ∎

Sufficient complexities of same-cluster queries and running time excluding queries for random-weak oracles can be calculated based on Theorem 1.

Corollary 8.

Let the setting be as in Theorem 1 and parameters and are set to be minimum sufficient values. Then the query and computational complexity for the combination of Algorithm 1 and 2 are as follows:

  • -weak same-cluster queries:

  • Running time excluding queries:

Proof.

For each iteration with given sampling parameters , phase 1 requires weak same-cluster queries, and phase 2 takes queries. Also, distance calculation and sorting in phase 2 can be done in and respectively per each iteration. ∎

b.2 Proofs for Distance-Weak Oracles

Before we prove the results on distance-weak oracles, additional bounds on the pairs of data points are stated. Proof of Proposition 9 is straightforward by using the definition and the triangle inequality.

Proposition 9.

If a clustering of data satisfies the -margin property and has a maximum radius , the following conditions hold:

  • , for all

  • , for all

Proof.

For from different clusters,

Similarly, , which gives (a).

Also, if are from the same cluster ,

which proves (b). ∎

These inequalities imply feasible ranges of parameters and for the quality of distance-weak oracles, and . Now let’s prove our main theoretical results on ditance-weak oracles.

Theorem 2.

For given data and a distance metric , let be a center-based clustering with the -margin property. Let , , , , and . If a cluster contains at least one point satisfying for all , then combination of Algorithm 1 and 3 outputs the oracle’s clustering with probability at least by asking weak same-cluster queries to a local distance-weak oracle.

Proof.

First, we show that the local distance-weak oracle always gives yes or no answer if a given weak-query includes any located close enough to the center . Let be a data point satisfying , and an oracle is local distance-weak. If a weak query contains , then,

Moreover, if , then,

Therefore, two sufficient conditions for a local distance-weak oracle stated in Definition 6 are violated. Hence, any weak same-cluster query including can be answered by the oracle without any uncertainty. Note that additional margin of is not used at this point, which will give a higher chance of estimating a good empirical mean.

Let’s define for each cluster as a set of data points close to center ,

We know that if