SemiSupervised Active Clustering with Weak Oracles
SemiSupervised Active Clustering with Weak Oracles
Abstract
Semisupervised active clustering (SSAC) utilizes the knowledge of a domain expert to cluster data points by interactively making pairwise “samecluster” queries. However, it is impractical to ask human oracles to answer every pairwise query. In this paper, we study the influence of allowing “notsure” answers from a weak oracle and propose algorithms to efficiently handle uncertainties. Different types of model assumptions are analyzed to cover realistic scenarios of oracle abstraction. In the first model, randomweak oracle, an oracle randomly abstains with a certain probability. We also proposed two distanceweak oracle models which simulate the case of getting confused based on the distance between two points in a pairwise query. For each weak oracle model, we show that a small query complexity is adequate for the effective means clustering with high probability. Sufficient conditions for the guarantee include a margin property of the data, and an existence of a point close to each cluster center. Furthermore, we provide a sample complexity with a reduced effect of the cluster’s margin and only a logarithmic dependency on the data dimension. Our results allow significantly less number of samecluster queries if the margin of the clusters is tight, i.e. . Experimental results on synthetic data show the effective performance of our approach in overcoming uncertainties.
1 Introduction
Clustering is one of the most popular approaches for extracting meaningful insights from unlabeled data. However, clustering is also very challenging for a wide variety of reasons Jain et al. (1999). Finding the optimal solution of even the simple means objective is known to be NPhard Davidson and Ravi (2005); Mahajan et al. (2009); Vattani (2009); Reyzin (2012). Second, the quality of a clustering algorithm is difficult to evaluate without context.
Semisupervised clustering is one way to overcome these problems by providing a small amount of additional knowledge related to the task. Various kinds of supervision can help unsupervised clustering: labeled samples, pairwise mustlink or cannotlink constraints on elements, and split/merge requests Basu et al. (2002, 2004a); Balcan and Blum (2008). As domain experts have clear understanding about the nature of datasets, generating a small amount of supervised information should not be a challenging task for them. For example, a few pairs of samples among the large number of unlabeled animal images can be provided, and a participant can decide whether each pair must be in the same cluster or not.
Assumptions on the characteristics of a dataset itself can also assist a clustering problem. Constraints related to a margin, or a distance between different clusters, are widely used in theoretical works. Although these are strong assumptions, a margin ensures the existence of an optimal clustering, which coincides with a human expert’s judgment.
The semisupervised active clustering (SSAC) framework proposed by Ashtiani et al. (2016) combines both margin property and pairwise constraints in the active query setting. A domain expert can help clustering by answering samecluster queries, which ask whether two samples belong to the same cluster or not. By using an algorithm with two phases, it was shown that the oracle’s clustering can be recovered in polynomial time with high probability. However, their formulation of the samecluster query has only two choices of answers, yes or no. This might be impractical as a domain expert can also encounter ambiguous situations which are difficult to respond to in a short time.
Therefore, we provide a SSAC framework that can also make a good use of weak supervision by allowing a “notsure” response. We first analyze the effect of a weak oracle with a random behavior and provide the possibility of discovering the oracle’s clustering with active samecluster queries. Then several types of weak oracles are defined, and a minor assumption is shown to ensure the recovery of the oracle’s clustering with high probability.
1.1 Our Contributions
We provide novel and efficient semisupervised active clustering algorithms for centerbased clustering task, which can discover the inherent clustering of an imperfect oracle. Our work is motivated by the SSAC algorithm Ashtiani et al. (2016), and the following question: “Is it possible to perform a clustering task efficiently even with a nonideal domain expert?”. We answer this question by formulating different types of weak oracles and prove that the SSAC algorithm can still work well under uncertainties by using properly modified binary search schemes.
The SSAC algorithm is composed of two phases, estimation of a cluster center and then of the cluster radius. Both phases are affected by notsure answers and each phase is investigated to have a good estimation. Nontrivial strategies are developed by utilizing the characteristics of weak oracles. Our paper combines discoveries from both phases and provides a unified probabilistic guarantee for the entire algorithm’s success.
Two realistic weak oracle models are introduced in the paper. First, if an oracle answers “notsure” randomly with at most some fixed probability, we prove that reasonably increased sampling and query sizes can lead to a successful approximation of true cluster centers and radii of clusters. Our result generalizes the SSAC without a notsure option in a query.
Next, we suggest practical weakoracle model assumptions based on reasonable cases that may lead to ambiguity in answering a samecluster query. In particular, we considered two scenarios: (i) a distance between two points from different clusters is too small, and (ii) a distance between two points within the same cluster is too large. If there exists at least one cluster element close enough to the center, an oracle’s clustering can be recovered with high probability. This close point is identified from a good approximation of the cluster center and removes the uncertainty in estimating the radius of a cluster. In fact, this practical strategy is based on the idea to make use of deterministic behaviors of distanceweak oracles, and our assumption on the existence of points close to the center is very natural. Two different distancebased weak oracles are considered, and our algorithm can resolve both types.
Query complexity is obtained by utilizing a matrix concentration inequality Tropp (2012), which relies on the margin property. Our new theoretical result shows that the SSAC algorithm requires less number of samples compared to the one proved by Ashtiani et al. (2016) when the margin between clusters is tight and the dimension of data is .
Finally, experimental results on synthetic data show the effective performance of our approach in overcoming uncertainties. In particular, our weak oracle model with random behavior is simulated with known ground truth, and the algorithm successfully deals with notsure answers.
Remark 1.
Proofs for theoretical results are deferred to Appendix B with additional analyses.
1.2 Related Work
Semisupervised clustering ideas have been actively studied in 2000s Basu et al. (2002, 2004a, 2004b); Kulis et al. (2009). Basu et al. (2002) used seeding, or given cluster assignments on a subset of the data, as a way of supervision. Later, a similar form was considered by Ashtiani and BenDavid (2015). They mapped data to a proper representation space based on the clustering of small random samples and applied means in the new space.
One of the most popular forms of supervision are pairwise constraints, i.e. mustlink/cannotlink type of knowledge. Basu et al. (2004a) introduced the application of these pairwise constraints in a clustering objective function and formulated based on Hidden Markov Random Fields. Then, a probabilistic framework with pairwise constraints were introduced Basu et al. (2004b), which was generalized by Kulis et al. (2009) as a weighted kernel kmeans and a graph clustering problems. Our work uses similar samecluster queries, but interactively queries the oracle.
Active semisupervised clustering frameworks were investigated in earlier works. Cohn et al. (2003) proposed an iterative solution to the clustering problem using active reactions from users, but provided no theoretical guarantees on the result. Basu et al. (2004a) also suggested an active semisupervised clustering algorithm similar to our approach with an additional step of finding good pairs from the dataset, which improves the initial guess of clusters. Our result differs from their work as we consider uncertainties in queries and provide different types of probabilistic guarantees.
Mazumdar and Saha (2016) also tackle a clustering problem with the support of oracles and side information. Distance between points in our work can be one example of side information. However, the main motivation is different from our work as we focused on notsure answers, where they consider incorrect answers.
In this paper, we assume that the problem satisfies a centerbased clustering framework. Balcan and Liang (2016) studied algorithms to deal with perturbations on the centerbased objectives including center proximity. On the other hand, our work relies on the margin property of data and perturbation resilience is provided as a form of high probability guarantee of success.
The most related work to this paper is Ashtiani et al. (2016), which first introduced the SSAC framework. They presented the probability of recovering an oracle’s clustering with an additional proof on the hardness of the problem. Instead of analyzing a NPhardness proved by them, this paper focuses more on the performance of SSAC algorithm with weak oracles to deal with practical uncertainty issues.
2 Problem Setting
2.1 Background
The SSAC framework was originally developed based on two important assumptions: a centerbased clustering and a margin property Ashtiani et al. (2016). For the purpose of theoretical analysis, the domain of data is assumed to be the Euclidean space, and each center of a clustering is defined as a mean of elements in the corresponding cluster, i.e. . Then, an optimal solution of the means clustering satisfies the conditions for centerbased clustering^{1}^{1}1This holds for all Bregman divergences Banerjee et al. (2005)..
Definition 1 (Centerbased clustering).
Let with . A clustering is a centerbased clustering of with clusters, if there exists a set of centers satisfying the following condition: and where is a distance measure.
Also, a margin property ensures the existence of an optimal clustering. Figure 1 visually depicts the margin property to help understanding the characteristic of it.
Definition 2 (margin property  Clusterability).
Let be a centerbased clustering of with clusters and corresponding centers . satisfies the margin property if the following condition is true:
2.2 Problem Formulation
A semisupervised clustering algorithm is applied on data satisfying the margin property with the oracle’s clustering , which is supported by a weak oracle that receives weak samecluster queries.
Definition 3 (Weak Samecluster Query).
A weak samecluster query asks whether two data points belong to the same cluster and receives one of three responses from an oracle with a clustering .
In our framework, the clusterassignment process uses weak samecluster queries and therefore only depends on pairwise information provided by weak oracles. In short, points with known cluster assignments from different clusters are used to determine the assignment of a given point. If an oracle outputs yes or no answer for at least pairwise weak queries, we can perfectly discover the cluster assignment of the point. Also, one yes answer among the weak samecluster queries directly gives the cluster it belongs to. See Appendix B.1 for the detailed pairwise clusterassignment process. The term “clusterassignment query” is also used instead of “weak pairwise clusterassignment query” throughout the paper.
Definition 4 (Weak Pairwise Clusterassignment Query).
A weak pairwise clusterassignment query identifies the cluster index of a given data point by asking weak samecluster queries , where . One of responses is inferred from an oracle with . is a permutation defined on which is determined during the assignment process accordingly.
3 SSAC with RandomWeak Oracles
3.1 RandomWeak Oracle
One way of modeling the performance of weak oracles is to define the maximum probability of answering notsure. We call it as a randomweak oracle, which is a natural assumption and mathematical abstraction for theoretical studies such as a binary erasure channel in information theory. This fundamental assumption is meaningful as domain experts can make mistakes or encounter hard samples with a certain frequency. In addition, some realistic scenarios can be covered by this model where there exists a chance of losing signals or not receiving answers. For example, if a restriction in time for answering a query is considered to increase the speed of an algorithm, even a perfect domain expert can miss answering a samecluster query because of the time limit. This situation can be well depicted by the randomweak oracle model by replacing the role of a notsure option with an event of missing an answer.
Definition 5 (RandomWeak Oracle).
An oracle is said to be randomweak with a parameter , if with probability at most for given two points .
Two parts of the SSAC algorithm should be reconsidered to analyze the influence of notsure answers from the oracle. First, number of sampled elements for clusterassignment queries must be sufficient to accurately approximate the cluster center. Intuitively, more samples or queries are required if our semisupervision has a chance of failure. The second step of the algorithm estimates a radius from the sample mean to recover the oracle’s cluster based on distances. A binary search technique plays an important role to minimize the query complexity in logarithmic scale. However, weak oracles can lead to a situation of having failure in the intermediate search step. Therefore, we provide a simple extension of a binary search with repetitive weak samecluster queries, Algorithm 2, to mitigate the effect of uncertainties in queries. Our first main result shows the perfect recovery of the oracle’s clustering on the randomweak model.
Theorem 1.
To prove Theorem 1, we first show that a good approximate cluster center can be obtained with high probability, which leads to a simple recovery of points within the radius. Then, the probability of success in binary search steps is evaluated. Refer Appendix B for detailed proofs, a query complexity, and runtime.
Remark 2.
Remark 3.
Number of samples in Theorem 1 depend on and , both of which are unknown in real settings. Although there is no explicit way to calculate the margin , can be approximated with the ratio of answer from an oracle. Proper parameters for the sampling, , can also be obtained through trial and error.
Remark 4.
Since our algorithm utilizes only pairwise feedback from oracles, it subsumes a vast range of general and practical assumptions on oracles. A key motivation of our weak oracle models is uncertainty caused by obscure characteristics in a pair of samples. Therefore, even if some answers for samecluster queries in a clusterassignment step are notsure, remaining answers are not necessarily determined to be notsure. Also, some answers can provide hints to discover the clusterassignment of a given point in practice. However, our theoretical analysis on the randomweak oracle model provides a lower bound of possible realistic situations, and more practical models for the motivation are covered in Section 4.
3.2 Comparison to Dimension Independent Result
A sampling number provided by Ashtiani et al. (2016) is , which is required to guarantee a good approximation of a cluster center with high probability. This result is founded on the dimension independent concentration inequality Ashtiani and Ghodsi (2015). However, can be extremely large if the margin between clusters is tight, i.e. for some small . Our result decreased the influence of by using Vector Hoeffding’s Inequality (See Theorem 5 in Appendix A) to obtain sample complexity when the oracle is not weak. In particular, if the dimension of data is , our approach ensures smaller query complexity.
4 SSAC with DistanceWeak Oracles
In the previous section, an oracle is assumed to have an arbitrary behavior for answering weak samecluster queries. One advantage of such an assumption is a wide coverage over different realistic situations. However, it is more reasonable to evaluate the performance of domain experts reflecting the range of knowledge or inherent ambiguities of the given pairs of samples. The cause of notsure answer for the samecluster query can be investigated based on a distance between the elements in a feature space. Two cases for having indefinite answers are considered in this work: (i) points from different clusters are too close, and (ii) points within the same cluster are too far. The first situation happens a lot in the real world. For instance, distinguishing wolves from dogs is not an easy task if a data sample like a Siberian Husky is provided as visual features. The second case is also rational, because it might be difficult to compare characteristics of two points within the same cluster if they have quite dissimilar features.
4.1 Local DistanceWeak Oracle
We define the first weakoracle model sensitive to distance, a local distanceweak oracle, in a formal way to include two vague situations described before. Condition (a) and (b) in Definition 6 are mathematical expression of two depicted cases (i) and (ii) respectively. These confusing cases for local distanceweak oracle are visually depicted in Figure 2 for better explanation.
Definition 6 (Local DistanceWeak Oracle).
An oracle having a clustering for data is said to be local distanceweak with parameters and , if for any given two points satisfying one of the following conditions:

, where

, where
One way to overcome the local distanceweakness is to provide at least one good point in a query. If one of the points and for the query is close enough to the center of a cluster, a local distanceweak oracle does not get confused in answering. This situation is realistic because one representative data sample of a cluster might be a good baseline when comparing to other elements. The next theorem is founded on this intuition, and we show that a modified version of SSAC will succeed if at least one representative sample per cluster is suitable for the weak oracle. In the proof, we first show the effect of a point close to the center on weak queries. Then the possibility of having a close empirical mean is provided by defining good sets and calculating datadriven probability of failure from it. Last, an assignmentknown point is identified to remove the uncertainty of samecluster queries used in the binary search step.
Theorem 2.
For given data and a distance metric , let be a centerbased clustering with the margin property. Let , , , , and . If a cluster contains at least one point satisfying for all , then combination of Algorithm 1 and 3 outputs the oracle’s clustering with probability at least by asking weak samecluster queries to a local distanceweak oracle.
4.2 Global DistanceWeak Oracle
A global distanceweak oracle fails to answer depending on the distance of each point to its respective cluster center. In this case, both elements and should be in the covered range of an oracle if they don’t belong to the same cluster. This represents an oracle that is weaker when one of points is out of its knowledge. We assume to preserve the characteristic of a distanceweakness within a same cluster, i.e. the second condition of the local distanceweak oracle.
Definition 7 (Global DistanceWeak Oracle).
An oracle having a clustering for data is said to be global distanceweak with parameter , if for any given two points satisfying one of the following conditions:

or where

, where
The problem of a global distanceweak oracle compared to the local distanceweak model is the increased ambiguity in distinguishing elements from different clusters. Nevertheless, once we get a good estimate of the center, one good point can be still found to support samecluster queries in the binary search step. Therefore, Algorithm 1 and 3 can guarantee the recovery of oracle’s cluster with high probability by utilizing a global distanceweak oracle.
Theorem 3.
For given data and a distance metric , let be a centerbased clustering with the margin property. Let , , , and . If a cluster contains at least one point satisfying for all , then combination of Algorithm 1 and 3 outputs the oracle’s clustering with probability at least , by asking weak samecluster queries to a global distanceweak oracle.
Remark 5.
Our novel approach (to use the closest point from the estimated center) make the binary search steps avoid simple repetitive samplings. In fact, this practical strategy is based on the idea to make use of deterministic behaviors of distanceweak oracles.
Remark 6.
Although different binary search algorithms are developed for each weak oracle model, it is possible to unify Algorithm 2 and 3. First, process a samecluster query using , the closest point from . Then, more queries can be provided to the weak oracle with additional samples from if gives a notsure answer. In fact, this unified binary search algorithm strengthens the coverage of our approach because it can handle both random and distanceweak oracles at once. (See Appendix C for the detailed algorithm.)
5 Experimental Results
In practice, simulating active queries with a domain expert and evaluating probabilistic results is not easy as one can “game” the system. Therefore, simple cases on synthetic data are simulated where the true cluster assignments are known, and the oracle follows the randomweak model.^{2}^{2}2The source code is available online. https://github.com/twankim/weaksemi
5.1 Data Generation
For simulated dataset, points of each cluster are generated from isotropic Gaussian distribution. We assume that there exists a ground truth oracle’s clustering, and the goal is to recover it where labels are partially provided via weak samecluster queries. Various parameters are considered in generating clusters: number of samples , dimension of data , number of clusters , and standard deviation of each Gaussian distribution . For visual representation, 2dimensional data points are considered, and other parameters are set to , , and . Data points satisfy margin property with condition .
5.2 Evaluation
Each round of the evaluation is composed of experiments with different parameter settings on ; is the probability of successful response. The unified binary search algorithm is used which handles uncertainty by regarding ‘notsure’ as ‘in different clusters’; hence is fixed as . Parameters are varied as and in each round, and rounds are repeated.
Two evaluation metrics are considered: is the ratio of correctly recovered data points averaged over points, and is the total number of failures occurred at clusterassignments. The best permutation for the cluster labels is investigated based on the distances between estimated centers and true centers for the evaluation. Formal definitions of the evaluation metrics are stated below. represents the indicator function, and represent true and estimated cluster labels respectively. As similar number of points are generated per cluster, a mean accuracy averaged over clusters is not considered.
5.3 Results
To focus on scenarios with narrow margins, and are chosen. Figure 4 shows in percentage and on different parameter pairs . An accuracy of recovering the oracle’s clustering increases as increases. This shows the importance of enough number of samples to succeed in clustering even with uncertainties caused by an weak oracle. In fact, even small number of samples are sufficient in practice.
Failures of the SSAC algorithm can happen as it is a probabilistic algorithm. When is really small, the possibility of failure increases as we have only few chances to ask clusterassignment queries. For example, if , only points are sampled. Then, if all 6 clusterassignment queries fail, Phase 1 fails. This leads to the recovery of less than clusters because the SSAC algorithm repeats Phase 1 and Phase 2 for times. However, such situations rarely occur if is large enough. Also, failure in binary search can happen, but we observed that only 2 out of 5000 rounds suffered from it with .
6 Conclusion and Future Work
This paper presents approaches for utilizing a weak oracle in the semisupervised active clustering (SSAC) framework. Specifically, we suggest two different types of domain experts that can output an answer “notsure” for the samecluster query. First, we consider a randomweak oracle that does not know the answer with at most some fixed probability. Secondly, two distancebased weak oracle models are considered to simulate realistic situations. For both of these models, probabilistic guarantees on discovering the oracle’s clustering, with small dependency on the margin, are provided based on our devised binary search algorithms. In distancebased models, a single element close enough to the cluster center is able to mitigate ambiguous supervision. As our weakoracle assumptions are designed to reflect practical scenarios, application to the real world clustering tasks with actual domain experts would be an interesting research topic. Another future direction is an extension of the framework to accommodate other distance functions or metric learning approaches.
References
 Ashtiani and BenDavid [2015] Hassan Ashtiani and Shai BenDavid. Representation learning for clustering: a statistical framework. In Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence, pages 82–91. AUAI Press, 2015.
 Ashtiani and Ghodsi [2015] Hassan Ashtiani and Ali Ghodsi. A dimensionindependent generalization bound for kernel supervised principal component analysis. In Proceedings of The 1st International Workshop on “Feature Extraction: Modern Questions and Challenges”, NIPS, pages 19–29, 2015.
 Ashtiani et al. [2016] Hassan Ashtiani, Shrinu Kushagra, and Shai BenDavid. Clustering with samecluster queries. In Advances In Neural Information Processing Systems, pages 3216–3224, 2016.
 Balcan and Blum [2008] MariaFlorina Balcan and Avrim Blum. Clustering with interactive feedback. In International Conference on Algorithmic Learning Theory, pages 316–328. Springer, 2008.
 Balcan and Liang [2016] Maria Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. SIAM Journal on Computing, 45(1):102–155, 2016.
 Banerjee et al. [2005] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
 Basu et al. [2002] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semisupervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML2002. Citeseer, 2002.
 Basu et al. [2004a] Sugato Basu, Arindam Banerjee, and Raymond J Mooney. Active semisupervision for pairwise constrained clustering. In Proceedings of the 2004 SIAM international conference on data mining, pages 333–344. SIAM, 2004a.
 Basu et al. [2004b] Sugato Basu, Mikhail Bilenko, and Raymond J Mooney. A probabilistic framework for semisupervised clustering. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 59–68. ACM, 2004b.
 Cohn et al. [2003] David Cohn, Rich Caruana, and Andrew McCallum. Semisupervised clustering with user feedback. Constrained Clustering: Advances in Algorithms, Theory, and Applications, 4(1):17–32, 2003.
 Davidson and Ravi [2005] Ian Davidson and SS Ravi. Clustering with constraints: Feasibility issues and the kmeans algorithm. In Proceedings of the 2005 SIAM international conference on data mining, pages 138–149. SIAM, 2005.
 Jain et al. [1999] Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.
 Kulis et al. [2009] Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semisupervised graph clustering: a kernel approach. Machine learning, 74(1):1–22, 2009.
 Mahajan et al. [2009] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar kmeans problem is nphard. In International Workshop on Algorithms and Computation, pages 274–285. Springer, 2009.
 Mazumdar and Saha [2016] Arya Mazumdar and Barna Saha. Clustering via crowdsourcing. arXiv preprint arXiv:1604.01839, 2016.
 Reyzin [2012] Lev Reyzin. Data stability in clustering: A closer look. In International Conference on Algorithmic Learning Theory, pages 184–198. Springer, 2012.
 Tropp [2012] Joel A Tropp. Userfriendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
 Vattani [2009] Andrea Vattani. The hardness of kmeans clustering in the plane. Manuscript, accessible at http://cseweb.ucsd.edu/~avattani/papers/kmeans_hardness.pdf, 617, 2009.
Appendix A Concentration Inequality for Random Vectors
To achieve high probability guarantees, we apply the Vector Hoeffding’s inequality. Proof of Theorem 5 uses a Transpose Dilation technique on the Matrix Hoeffding results for symmetric matrices Tropp [2012].
Lemma 4 (Matrix Hoeffding’s Inequality Tropp [2012]).
Let be i.i.d. random, symmetric matrices with dimension , and let be fixed symmetric matrices. Assume that each random matrix satisfies,
Then, for all ,
Definition 8 (Transpose Dilation).
Given a matrix , transpose dilation of is defined as a function :
Well known property about the transpose dilation is the fact that it preserves spectral information of the input matrix Tropp [2012], i.e. . In short, for each singular value of , there exist two corresponding eigenvalues and of .
Theorem 5 (Vector Hoeffding’s Inequality).
Let be random vectors with dimension , and be a sequence of positive values. Assume that each random vector satisfies:
Then, for any ,
Proof.
The overall proof is motivated by dilation technique introduced by Tropp [2012] so that we can apply concentration inequalities for symmetric random matrices to random vectors.
Let , where is a transpose dilation defined in Definition 8. By the definition of transpose dilation, , and . Combining these with the fact that preserves spectral information gives,
This equality indicates that norm of the sum of vectors can be transformed to the spectral norm, or the largest eigenvalue, of the sum of matrices constructed by transpose dilation.
Now let’s bound the square of the random matrix .
This gives
Since a random vector is bounded as , we can say that for any , where represents a identity matrix. Finally, we can define the following constant:
Therefore, by applying directly to the matrix Hoeffding’s inequality, we have,
∎
Appendix B Proofs and Supplementary Analyses
In this section, proofs for theoretical results on both randomweak oracles and distanceweak oracles are provided. Also, supplementary analyses like query complexities and feasible ranges of parameters for distanceweak oracles are presented.
First, we state Lemma 6 which assists theoretical results by introducing a characteristic of points close enough to the cluster center.
Lemma 6 (Lemma 5 of Ashtiani et al. [2016]).
For given data and a distance metric , let be a centerbased clustering with the margin property, and be the set of centers (mean of each cluster) of . Let be a point close to the center such that , where . Then if holds, points in the cluster are closer to than the points of other clusters, i.e.,
b.1 Proofs for RandomWeak Oracles
Analysis on Weak Pairwise Clusterassignment Query
A single weak pairwise clusterassignment query is composed of weak samecluster queries on different pairs ), where is a given point and is an assignmentknown point from each cluster . Therefore, if an oracle outputs yes or no answer for at least weak queries, we can perfectly discover the cluster assignment of . This probability is lower bounded by as . So, we can conclude that the probability of having notsure answer for a given on a clusterassignment query is at most .
Also, if we only have clusters defined during the process, cluster assignment of can be identified if weakoracle gives yes or no answers for all samecluster queries. In detail, if one yes answer is provided among weak queries, can be assigned to the corresponding cluster. And no answers is handled by assigning a new cluster for . Then the probability of failure in identifying a clusterassignment is at most in this case. Accordingly, we use as an upper bound for the failure probability of a clusterassignment query to consider the worst case for further analysis on sampling complexity.
Lemma 7.
For given data and a distance metric , let be a centerbased clustering with the margin property, and be the set of centers (mean of each cluster) of . Define and as in Algorithm 1 with . If number of samples including notsure cluster assignment is at least , then the probability of is at most , where .
Proof.
Let without loss of generality (). Let be i.i.d. random vectors having values with probability for any . represents a point randomly sampled from the cluster . Also, let be i.i.d. random variables having with probability and with probability , which are independent of ’s. Note that indicate the clusterassignment queries where an oracle succeeds in the assignment with probability . Then a sample mean using only assignmentknown data points from samples can be represented as follows:
Now, define a new random vector for any . Then, , and its norm is bounded as by definition. By combining for the chance of having notsure samples, we can achieve an upper bound of the probability of the sample mean being not close to the true mean.
The forth equality holds as and are independent, and ’s are i.i.d. random vectors. Then the first inequality can be shown by applying Theorem 5, or Vector Hoeffding’s inequality. As , we can conclude that if a number of samples from the cluster including notsure ones is at least , then .
Also, the last equation is a decreasing function of and therefore can be upper bounded by replacing with the lower bound of the clusterassignment success probability. This concludes the proof because we showed that , and the sufficient number of samples including notsure for the guarantee becomes . ∎
The sampling number stated in Lemma 7 is a generalized version of the original samecluster query case. If queries are not weak, i.e. , and the target probability is , you can achieve the sampling complexity , which is required to have a close empirical mean using a perfect oracle.
Theorem 1.
Proof.
For , the phase 1 of Algorithm 1 samples points from the set . Let be a cluster corresponds to the sample set . Then, at least number of clusterassignment queries, including notsure outcomes, are processed related to cluster by the pigeonhole principle. Let’s elaborate on this claim. If we sample data, there exists both cluster assignmentknown points and notsure ones. By matching notsure data to each proportional to , it can be concluded that at least points are sampled from one class with the chance of failure. Then Lemma 6 and 7 ensures that a sample mean constructed by the algorithm satisfies the property for all and with probability at least .
In Phase 2, a binary search algorithm can estimate the radius of a cluster from with samecluster queries if an oracle is perfect. However, the binary search fails if at least one search step receives notsure output from the weak oracle. The worst case probability of a failure in the search step can be calculated as . By applying union bound and use , we can conclude that Algorithm 2 recovers the correct with probability at least if . Note that and condition becomes . This shows the generality of our result as a samecluster query case with a perfect oracle requires only 1 query per search.
By combining the above two results, we can say that the output of each iteration in the Algorithm 1 is a perfect recovery of with probability at least . Again, union bound concludes the proof as iteration runs times, i.e. the SSAC algorithm with the modified binary search recovers a clustering of the oracle with probability at least . ∎
Sufficient complexities of samecluster queries and running time excluding queries for randomweak oracles can be calculated based on Theorem 1.
Corollary 8.
Proof.
For each iteration with given sampling parameters , phase 1 requires weak samecluster queries, and phase 2 takes queries. Also, distance calculation and sorting in phase 2 can be done in and respectively per each iteration. ∎
b.2 Proofs for DistanceWeak Oracles
Before we prove the results on distanceweak oracles, additional bounds on the pairs of data points are stated. Proof of Proposition 9 is straightforward by using the definition and the triangle inequality.
Proposition 9.
If a clustering of data satisfies the margin property and has a maximum radius , the following conditions hold:

, for all

, for all
Proof.
For from different clusters,
Similarly, , which gives (a).
Also, if are from the same cluster ,
which proves (b). ∎
These inequalities imply feasible ranges of parameters and for the quality of distanceweak oracles, and . Now let’s prove our main theoretical results on ditanceweak oracles.
Theorem 2.
For given data and a distance metric , let be a centerbased clustering with the margin property. Let , , , , and . If a cluster contains at least one point satisfying for all , then combination of Algorithm 1 and 3 outputs the oracle’s clustering with probability at least by asking weak samecluster queries to a local distanceweak oracle.
Proof.
First, we show that the local distanceweak oracle always gives yes or no answer if a given weakquery includes any located close enough to the center . Let be a data point satisfying , and an oracle is local distanceweak. If a weak query contains , then,
Moreover, if , then,
Therefore, two sufficient conditions for a local distanceweak oracle stated in Definition 6 are violated. Hence, any weak samecluster query including can be answered by the oracle without any uncertainty. Note that additional margin of is not used at this point, which will give a higher chance of estimating a good empirical mean.
Let’s define for each cluster as a set of data points close to center ,
We know that if