A Quantum AnnealingBased Approach to Extreme Clustering
Abstract
Clustering, or grouping, dataset elements based on similarity can be used not only to classify a dataset into a few categories, but also to approximate it by a relatively large number of representative elements. In the latter scenario, referred to as extreme clustering, datasets are enormous and the number of representative clusters is large. We have devised a distributed method that can efficiently solve extreme clustering problems using quantum annealing. We prove that this method yields optimal clustering assignments under a separability assumption, and show that the generated clustering assignments are of comparable quality to those of assignments generated by common clustering algorithms, yet can be obtained a full order of magnitude faster.
T. Jaschek, M. Bucyk, and J. S. Oberoi
1 Introduction
Traditionally, clustering approaches have been developed and customized for tasks where the resultant number of clusters is not particularly high. In such cases, algorithms such as means++ [arthur2007k], BIRCH [birch], DBSCAN [ester1996density], and spectral clustering produce highquality solutions in a reasonably short amount of time. This is because these traditional algorithms scale well with respect to the dataset cardinality . However, in most cases, the computational complexity of these algorithms, in terms of the number of clusters, is either exponential or higherorder polynomial. Another common issue is that some of the algorithms require vast amounts of memory.
The demand for clustering algorithms capable of solving problems with larger values of is continually increasing. Presentday examples involve deciphering the content of billions of web pages by grouping them into millions of labelled categories [nayak2014clustering, de2015parallel], identifying similarities among billions of images using nearestneighbour detection [wang2013duplicate, liu2007clustering, woodley2018parallel]. This domain of clustering, where and are both substantially large, is referred to as extreme clustering [kobren2017hierarchical]. Although there is great value in perfecting this type of clustering, very little effort towards this end has been made by the machine learning community. Our algorithm is, in fact, such an effort. Its output is a clustering tree, which can be used to generated multiple clustering assignments (or “levels”) with varying degrees of accuracy (i.e., coarseness or fineness) of the approximation. Generating such a tree is not uncommon for clustering algorithms. Consider, for example, hierarchical clustering algorithms which generate binary clustering trees. Clustering trees are useful tools for dealing with realworld data visualization problems. Our algorithm, the Big Data Visualization Tool, or BiDViT, provides this functionality.
BiDViT employs a novel approach to clustering problems, which is based on the maximum weighted independent set (MWIS) problem in a graph induced by the original dataset and a parameter we call the radius of interest or neighbourhood parameter, which determines a relation of proximity. The use of such a parameter has been successfully employed in densitybased spatial clustering of applications with noise (DBSCAN) [ester1996density]. The MWIS problem can be transformed into a quadratic unconstrained binary optimization (QUBO) problem, the formulation accepted by a quantum annealer. An alternative way to address the underlying problem is to use a heuristic algorithm to approximate solutions to the MWIS problem. Quantum annealing and simulated annealing have been applied in centroidbased clustering [kumar2018quantum, merendino2013simulated] and in densitybased clustering [kurihara2014quantum]. However, the approaches studied are not capable of addressing problems in the extreme clustering domain.
We prove that, under a separability assumption on the ground truth clustering assignment of the original dataset, our method identifies the ground truth labels when parameters are selected that are within the bounds determined by that assumption. We provide runtime and solution quality values for both versions of our algorithm, with respect to internal evaluation schemes such as the Calinski–Harabasz and the Davies–Bouldin scores. Our results suggest that BiDViT yields clustering assignments of a quality comparable to that of assignments generated by common clustering algorithms, yet does so a full order of magnitude faster.
2 The Coarsening Method
Our algorithm is based on a combinatorial clustering method we call coarsening. The key idea behind coarsening is to approximate a set by a subset such that, for any point , there exists a such that , for some parameter . In this case, we say that is dense in and call the radius of interest. This concept is not restricted to subsets of Euclidean spaces and can be generalized to an arbitrary metric space . For example, our coarsening method can be used for clustering assignments on finite subsets of Riemannian manifolds with respect to their geodesic distance, for instance, in clustering GPS data on the surface of the Earth when analyzing population density. In what follows, we assume that is a dataset consisting of dimensional data points, equipped with a metric . Finding an arbitrary dense subset of does not necessarily yield a helpful approximation. For example, itself is always dense in . However, enforcing the additional constraint that any two points in the subset must be separated by a distance of at least yields moreinteresting approximations, often leading to a reduction in the number of data points (one of our primary objectives). We call such a set separated. Fig. 1 shows a point cloud and an dense, separated subset. The theorem that follows shows that a maximal separated set of is necessarily dense in . Let denote the open metric ball with respect to , with centre and radius .
Let be a maximal separated subset of in the sense of set inclusion. Then the following properties must be satisfied.

We have the inclusion .

For every , it holds that .

The sets for are pairwise disjoint.
In particular, is a minimal dense subset of .
Note that is equivalent to being dense in and that, in combination with , is equivalent to being a minimal with respect to this property. To prove , let be a maximal separated subset of and assume, in contradiction, that is not dense in . Then we could find such that , for every . Hence, would be separated, which is in contradiction to the maximality of . To prove ), we fix a point . Since is separated, for any and, thus, is not dense in . Property ) follows from the triangle inequality.\qed
Note that a maximal separated subset does not refer to an separated subset with fewer than or equally as many elements as all other separated subsets but, rather, to an separated subset that is no longer separated when a single data point is added. Contrary to Sec. 2, a minimal dense subset does not need to be separated. Consider the set , and let be the Euclidean distance on . Then, is dense in but not separated. Also note that an separated subset is not necessarily an coreset, which is a weighted subset whose weighted means cost approximates the means cost of the original set with up to an accuracy of [har2004coresets, balcan2013distributed].
In the following, we assume that is equipped with a weight function . We call the weight of and gather all weights in a weight vector . It will be clear from the context whether we refer to a weight function or a weight vector. The weight of a set is given by . We have already argued that maximal separated subsets yield reasonable approximations. However, such subsets are not unique. We are thus interested in finding an optimal one, that is, one that captures most of the weight of the original dataset. In other words, we are interested in solving the optimization problem
(P0) 
If we impose unit weights, the solution set to this optimization problem will consist of the maximal separated subsets of with a maximum number of elements among all such subsets. The term “maximal” refers to set inclusion and the “maximum” refers to set cardinality. Since for all , a solution to creftype P0 will always be a maximal separated subset and, therefore, by Sec. 2, dense. In Sec. 3.6, we show that this problem is equivalent to solving an MWIS problem for a weighted graph , depending solely on the dataset , the Euclidean metric , and the radius of interest . Thus, the computational task of finding a maximal separated subset of maximum weight is NPhard [lucas2014ising, karp1972reducibility].
Every subset gives rise to a clustering assignment . This assignment is given by
(1) 
Data points that are equidistant to multiple representative points are assigned to only one of them, uniformly at random. Typically, larger values of result in smaller cardinalities of . The following corollary summarizes properties of when is separated, and can be readily verified.
Let be the clustering assignment generated from a maximal separated set . Then, the following properties are satisfied:

The clusters in are nonempty and pairwise disjoint.

The cluster diameter is uniformly bounded by , i.e., .

For all , it holds that .
Notice that these properties are not satisfied by every clustering assignment, for example, the ones generated by means clustering. They are desirable in specific applications, such as image quantization, where a tight bound on the absolute approximation error is desired. However, they are undesirable if the ground truth clusters have diameters larger than . More details on the clustering assignment are provided in Sec. 3.
One could argue that prior to identifying a maximum weighted independent set and using it to generate a clustering assignment, a dataset should be normalized. However, normalization is a transformation that would result in chunks not being defined by metric balls, but rather by ellipsoids. In particular, such a transformation would change the metric . We assume that the metric already is the best indicator of proximity. In general, one can apply any homeomorphism to a dataset , apply our clustering algorithm to the set , and obtain a clustering assignment by applying to the individual clusters.
A common assumption in the clustering literature is separability—not to be mistaken with separability—of the dataset with respect to a clustering . The dataset is called separable with respect to a clustering if
(2) 
that is, if the maximum intracluster distances are strictly smaller than the minimum intercluster distances. The following theorem shows that, if is chosen correctly, our coarsening method yields the clustering assignment .
Let be separable with respect to a clustering . Then, for any
(3) 
our coarsening methods yields the correct clustering assignment.
To simplify our notation, we denote the lower and upper bounds of the interval in (3) by and , respectively. By the separability assumption, this interval is nonempty. One can see that, for any admissible choice of , any two points from different clusters are separated. Indeed, for and , it holds that . Furthermore, if a point in a cluster is selected, then no other point in the same cluster can be selected, as . Therefore, every solution to creftype P0 is a union of exactly one point from each cluster. Using the separability of with respect to , we can see that the clustering assignment induced by creftype 1 is coincident with . \qed
In practice, the separability assumption is rarely satisfied, and it is challenging to select as above (as this assumes some knowledge about the clustering assignment). However, Sec. 2 shows that our coarsening method is of research value, and can potentially yield optimal clustering assignments.
We have developed two methods, which we refer to as the heuristic method and the quantum method, to address the NPhard task of solving creftype P0. The heuristic method loosens the condition of having a maximum weight; it can be seen as a greedy approach to creftype P0. In contrast, the quantum method explores all different maximal separated subsets simultaneously, yielding one that has maximum weight. The quantum method is based on the formulation of a QUBO problem, which can be solved efficiently using a quantum annealer like the DWave 2000Q [DWave] or a digital annealer such as the one developed by Fujitsu [Fujitsu].
3 The Algorithm
Let denote a dataset of dimensional data points. Note that, mathematically speaking, a dataset is not a set but rather a multiset, that is, repetitions are allowed. Our algorithm consists of two parts: data partitioning and data coarsening, the latter of which can be further subdivided into chunk coarsening and chunk collapsing.
3.1 Data Partitioning
In general, the computational complexity of distancebased clustering methods is proportional to the square of the dataset cardinality, as all pairwise distances must be computed. This bottleneck can be overcome by dividing the dataset and employing distributed approaches [balcan2013distributed, malkomes2015fast, har2004coresets], yielding a result different from the one we would obtain when applying clustering methods on the entire dataset. However, its slight imprecision results in a significant computational speedup.
A partition of is a collection of nonempty disjoint sets such that . Elements of partitions are typically referred to as blocks, parts, or cells; however, we refer to them as chunks. The partitioning is intended to be homogeneous: every extracted chunk has an equal number of data points (there might be minor differences when the cardinality of the chunk to be divided is odd). The upper bound on the number of points desired in a chunk is referred to as the maximum chunk cardinality . To determine , one should take into account the number of available processors, their data handling capacity, or, in the case of a quantum annealer, the number of fully connected qubits.
To break the data into chunks, we employ a modified version of the wellknown “median cut” algorithm, which is frequently used in colour quantization. First, we select an axis of maximum variance. We then bisect the dataset along the selected axis, say , at the median of in such a way as to obtain two data chunks and whose cardinalities differ by at most one (in the case where is odd) and which satisfy and . We cannot simply assign and , as these sets might differ drastically in cardinality. For example, when , this assignment would imply that and .
By using and in the role of , this process can be repeated iteratively, until the number of data points in the chunk to be divided is less than or equal to the maximum chunk cardinality , yielding a binary tree of data chunks. After iterations, this leaves us with chunks such that where the union is disjoint. Fig. 1 provides a visualization.
3.2 Chunk Coarsening
The goal of a data coarsening step is, for each chunk, to find representative data points such that their union can replace the original point cloud, while maintaining the original data distribution as accurately as possible.
Let be a chunk and be the radius of interest. In what follows, we assume that all the data points are pairwise different. Practically, this can be achieved by removing duplicates and cumulatively incrementing the weight of the representative point we wish to keep by the weight of the discarded duplicates. The radius of interest induces a weighted graph , where is the vertex set, the edge set is given by the relation defined by if and only if for all , and the weight function is the restriction of to . For each data point , we denote its weight by .
For each data point , we introduce a binary decision variable that encodes whether is used in a possible set . Furthermore, we define the neighbourhood matrix (or similarity matrix) of the graph by if , and otherwise. Problem creftype P0 can then be posed as a quadratically constrained quadratic program (QCQP) given by
(P1) 
Here, the inner summation of the constraint does not need to run over all indices, due to the symmetry of . The matrix form of creftype P1 is given by maximizing subject to the constraint , where is the upper triangular matrix of having all zeroes along the diagonal. As explained in Sec. 3.6, creftype P1 is equivalent to the NPhard MWIS problem for , and thus is computationally intractable for large problem sizes. Note that creftype P1 can be written as the 0–1 integer linear program (ILP)
(P2) 
We present two methods we have devised to address creftype P1.
3.2.1 The Heuristic Method
We wish to emphasize that the heuristic method does not provide us with a solution to creftype P1. Rather, the aim of this method is to obtain an separated subset with a high—but not necessarily the maximum—weight . The seeking of approximate solutions to the MWIS problem is a wellstudied subject [balaji2009approximating, hifi1997genetic, kako2009approximation]. Typically, researchers employ greedy algorithms, LPbased algorithms (using the relaxation of creftype P2), or semidefinite programming (SDP) algorithms; see [kako2009approximation] for an analysis.
We employ a classic greedy algorithm due to its simplicity and low computational complexity. In each step, we add the data point that locally is the best choice in the sense that the ratio of the weight of its neighbourhood to its own weight is as small as possible. Prior to the execution of the step, we remove the point and its neighbours from the set of candidates. Pseudocode of the greedy algorithm is provided in Algorithm 1. Before we state a theoretical result on the approximation ratio of this algorithm, we define the weighted degree of a vertex in a weighted graph and the weighted average degree of as and respectively, where is the neighbourhood of vertex [kako2009approximation].
Algorithm 1 has an approximation ratio of , i.e.,
(4) 
for any solution to creftype P0 and any output of the algorithm. Moreover, the bound in (4) is tight.
A proof is given in [kako2009approximation, Thm. 6]. \qed
3.2.2 The Quantum Method
In contrast to the heuristic method, the QUBO approach provides an actual (i.e., nonapproximate) solution to creftype P1. We reformulate the problem by transforming the QCQP into a QUBO problem.
Using the Lagrangian penalty method, we incorporate the constraint into the objective function by adding a penalty term. For a sufficiently large penalty multiplier , the solution set of creftype P1 is equivalent to that of
(P3) 
One can show that, for every solution to creftype P3 satisfies the separation constraint [abbott2018hybrid, Thm. 1]. Instead, we use individual penalty terms , as this may lead to a QUBO problem with much smaller coefficients, which results in improved performance when solving the problem using a quantum annealer. Expressing creftype P3 as a minimization, instead of a maximization, problem and using matrix notation yields the problem
(P4) 
where if , if and , and otherwise. Solutions to creftype P4 can be approximated using heuristics such as simulated annealing [nolte2000note], path relinking [lu2010hybrid], tabu search [lu2010hybrid], and parallel tempering [zhu2016borealis]. Before solving creftype P4, it is advisable to reduce its size and difficulty by making use of logical implications among the coefficients [glover2018logical]. This involves fixing every variable that corresponds to a node that has no neighbours to one, as it necessarily is included in an dense subset.
The following theorem show that creftype P1 is equivalent to creftype P4 for a suitable choice of , for .
Let for all . Then, for any solution to creftype P4, the corresponding set is separated. In particular, the solution sets of creftype P1 and creftype P4 coincide.
We generalize the proof of [abbott2018hybrid, Thm. 1] and show that every solution to creftype P4 satisfies the separation constraint . Assuming, in contradiction, that the opposite were to be the case, we could find a solution and indices and such that and . Let denote the th standard unit vector, and let . Then,
(5)  
(6) 
where , defined by , orders the index accordingly. This technicality is necessary, as we defined only for . As , we have , and thus
(7) 
Therefore, as , it holds that , which is absurd, as, by assumption, is a solution to creftype P4.
We now show that the solution sets of creftype P1 and creftype P4 coincide. Note that creftype P1 is equivalent to the optimization problem
(P5) 
Let and be solutions to creftype P5 and creftype P4, respectively. We denote the objective functions by and , where is the matrix defined by for , and zero otherwise, and the term denotes the Hadamard product of the matrices and , given by elementwise multiplication. Then, as for , by the observation above, both and satisfy the separation constraint. Since and are coordinatewise nonnegative and for , it holds that
(8) 
thus, if satisfies the separation constraint, then . Using this observation, and that and minimize and , respectively, we have
(9) 
Hence, the inequalities in creftype 9 must actually be equalities; thus, the solution sets of the optimization problems coincide. \qed
Problem creftype P4 can be transformed to an Ising spin model by mapping to . This form is desirable because the ground state of the Hamiltonian of an Ising spin model can be determined efficiently with a quantum annealer.
3.3 Chunk Collapsing
Having identified a maximal separated subset , we collapse the vertices into , meaning we update the weight of each according to the weights of all that satisfy . We aim to assign each to a unique by generating a socalled Voronoi decomposition (depicted in Fig. 1) of each chunk , which is a partition, where each point is assigned to the closest point within a subset . More precisely, we define the sets as in creftype 1, for each . By construction, contains all vertices that will be collapsed into , in particular, itself. We then assign the coarsened chunk a new weight function defined by
(10) 
In practice, to prevent having very large values for the individual weights, one might wish to add a linear or logarithmic scaling to this weight assignment. In our experiments, we did not add such a scaling.
3.4 Iterative Implementation of BiDViT
BiDViT repeats the procedure of data partitioning, chunk coarsening, and chunk collapsing with an increasing radius of interest, until the entire dataset collapses to a single data point. We call these iterations BiDViT levels. The increase of between BiDViT levels is realized by multiplying by a constant factor, denoted by and specified by the user. In our implementation we have introduced a node
class that has three attributes: coordinates
, weight
, and parents
. We initialize BiDViT by creating a node_list
containing the nodes corresponding to the weighted dataset (if no weights are provided then the weights are assumed to be the multiplicity of the data points). After each iteration, we remove the nodes that collapsed into representative nodes from the node_list
and keep only the remaining representative nodes. However, we append the removed nodes to the parents of the representative node. The final node_list
is a data tree, that is, it consists of only one node, and we can move upwards in the hierarchy by accessing its parents (and their parents and so on); see Fig. 2. Two leaves of the data tree share a label with respect to a specific BiDViT level, say , if they have collapsed into centroids which, possibly after multiple iterations, have collapsed into the same centroid at the th level of the tree. For the sake of reproducibility, we provide pseudocode (see Algorithm 2).
It is worth noting that, at each level, instead of proceeding with the identified representative data points, one can use the cluster centroids, allowing moreaccurate data coarsening and label assignment.
3.5 Complexity Analysis
Our analysis shows that every interation of the heuristic version of BiDViT has a computational complexity of . Note that .
The order of complexity of the partitioning procedure is . To see this, note that there are at most partitioning stages and in the th stage we split chunks , where . Let denote the number of data points in chunk . Finding the dimension of maximum variance has a complexity of and determining the median of this dimension can be achieved in via the “median of medians” algorithm. Having computed the median, one can construct two chunks of equal size in linear time. Since , a partitioning step is . Any division of a chunk is independent of the other chunks at a given stage; thus, this procedure can benefit from distributed computing.
The order of complexity for the collapsing process is , as computing the neighbourhood matrix of a chunk is and the heuristic selection procedure is . The number of chunks is bounded from above by . This yields a complexity of . As data coarsening in each chunk is independent, with parallel processors available the complexity reduces to .
3.6 Relation to the MWIS Problem
The process of identifying a maximal separated set of maximum weight is equivalent to solving the MWIS problem for the weighted graph . Let be a weighted graph. A set of vertices is called independent in if no two of its vertices are adjacent or, equivalently, if is a clique in the complement graph. This corresponds to the separation constraint mentioned earlier, where two vertices are adjacent whenever they are less than a distance of apart. The MWIS problem can be expressed as
(P6) 
and is NPcomplete for a general weighted graph [karp1972reducibility], yet, for specific graphs, there exist polynomialtime algorithms [mandal2006maximum, kohler2016linear]. Note that the QUBO formulation of the MWIS problem in [abbott2018hybrid, hernandez2016novel] is related to the one in creftype P4.
If all weights are positive, a maximum weighted independent set is necessarily a maximal independent set. A maximal independent set is a dominating set, that is, a subset of such that every is adjacent to some . This corresponds to our observation that every maximal separated subset is dense.
4 Results
The datasets used to demonstrate the efficiency and robustness of our approach are the MNIST dataset of handwritten digits [lecun2010mnist], a twodimensional version of MNIST obtained by using SNE [maaten2008visualizing], two synthetic grid datasets, and a dataset called Covertype containing data on forests in Colorado [Dua:2017]. The synthetic grid datasets are the unions of 100 samples (in the 2D case) and 1000 samples (in the 3D case) drawn from with means and a variance of for in the 2D case and the natural extension in the 3D case. Dataset statistics are provided in Table 1. In addition to our technical experiments, explained in the following sections, a practical application of BiDViT for image qunatization is illustrated in Fig. 7. All experiments were performed using a 2.5 GHz Intel Core i7 processor and 16 GB of RAM.
4.1 LowRange Clustering Domain
Although BiDViT has been specifically designed for extreme clustering, it yields accurate assignments for low values of . Fig. 3 shows the clustering assignment of BiDViT on the 2D grid dataset and on MNIST. The results are obtained by manually selecting a BiDViT level. In the grid dataset, every cluster is identified correctly. In the MNIST dataset, all clusters are recognized, except one. However, as our algorithm is based on metric balls, and some datasets might not conform to such categorization, there are datasets for which it cannot accurately assign clusters. This is true for most clustering algorithms, as they are able to recognize only specific shapes.
4.2 Extreme Clustering Capability
To evaluate the performance of BiDViT on highdimensional datasets in the extreme clustering range, we used the Calinski–Harabasz score [calinski1974dendrite] and the Davies–Bouldin score [davies1979cluster]. These clustering metrics are internal evaluation schemes, that is, their values depend solely on the clustered data, not requiring the ground truth label assignment for the dataset. Such schemes must be viewed as heuristic methods: their optimal values do not guarantee optimal clusters but provide a reasonable measure of clustering quality. Detailed analyses have been conducted on the advantages and shortcomings of internal clustering measures [liu2010understanding, jain2008innovation]. In the extreme clustering scenario, where the objective is to obtain an accurate approximation of the entire dataset instead of categorizing its elements, no true labels are given and thus external evaluation schemes (ones based on the distance to a ground truth clustering assignment) do not qualify as success measures.
Let denote a total of detected clusters within a dataset with data points. The Calinski–Harabasz score of a clustering is defined as a weighted ratio of the intercluster squared deviations to the sum of the intracluster squared deviations. More precisely, is given by
(11) 
where , for are the cluster centroids, and is their mean. High values of are indicative of a high clustering quality. The Davies–Bouldin score is the average maximum value of the ratios of the pairwise sums of the intracluster deviation to the intercluster deviation. The score is defined as
(12) 
where Low values of indicate accurate clustering.
Fig. 4 shows and of clustering assignments obtained with BiDViT and Mini Batch means clustering [sculley2010web] for different values of on the Covertype dataset. Due to their high computational complexity with respect to , many common clustering algorithms could not be applied. Remarkably, values are quite similar, indicating that the cluster assignments generated by BiDViT are of comparable quality even though the runtime of our algorithm is significantly shorter. For , our algorithm outperforms the others for lower values of , and is comparable for large values. One explanation for the slightly weaker performance of BiDViT with respect to is that BiDViT aims to minimize the nonsquared distances, whereas rewards clustering methods that minimize squared distances. Similarly, this explains BiDViT’s advantage for .
4.3 Runtime Comparison
In our experiments, we observed that, with respect to the total runtime, even the heuristic version of BiDViT restricted to a single core outperforms common clustering methods in the extreme clustering domain. Fig. 5 shows the runtime required by different clustering algorithms for the Covertype dataset. For the implementation of methods other than BiDViT, we used the publicly available sklearn.clustering module for Python. To generate the plots, we ran the entire BiDViT procedure, then applied classical algorithms for the same values of . The results suggest that, in the extreme clustering domain, the runtime of BiDViT is an order of magnitude faster than that of the agglomerative methods against which it was compared, and multiple orders of magnitude faster than that of means and Mini Batch means clustering. The dataset cardinality was restricted to 20,000 points to obtain results for other methods, whereas BiDViT is capable of handling the entire dataset comprising 581,000 points.
Name  Description  Cardinality  Dimension 

MNIST  handwritten images  K  
MNIST2D  SNE of the above  K  
Covertype  forest data  K  
grid2D  synthetically generated  K  
grid3D  synthetically generated  K 
Algorithm  Runtime on Dataset (seconds)  

specified parameters  Covertype  grid3D 
PERCHC  
PERCHC  
PERCHC  –  
BiDViT (heuristic)  
BiDViT (heuristic)  
We then compared the runtime of BiDViT to PERCH (“Purity Enhancing Rotations for Cluster Hierarchies”), a hierarchical algorithm for extreme clustering [kobren2017hierarchical], to our knowledge the only other algorithm designed to solve extreme clustering problems. We restricted both algorithms to using a single core. Table 1 shows that BiDViT performs an order of magnitude faster than PERCH. However, they solve somewhat different problems: whereas BiDViT aims to gradually coarsen a dataset by finding separated, dense subsets, PERCH maximizes the dendrogram purity, a measure of the clustering tree’s consistency [kobren2017hierarchical]. The clustering tree generated by PERCH is binary and thus enormous, allowing for much finer incremental distinctions between clustering assignments. In contrast, the tree generated by BiDViT is more compact, as multiple data points can collapse into the same representative point. When comparing dendrogram purities, we expect PERCH to outperform BiDViT; when comparing Davies–Bouldin scores at a given level, we expect the opposite. We did not test these hypotheses, as dendrogram purity is an external evaluation scheme, that is, it requires a clustering assignment to use for comparison, which is not available in unsupervised machine learning.
4.4 Results for the Quantum Version of BiDViT
We tested a prototype of BiDViT on a DWave 2000Q quantum annealer, a machine that has 2048 qubits and 5600 couplers. According to DWave Systems, the computer uses 128,000 Josephson junctions and was the most complex superconducting integrated circuit built to date when introduced in January of 2017 [DWave].
Before solving the QUBO problems, we applied preprocessing techniques, reducing their size and difficulty [glover2018logical]. This proved effective and eliminated a great many variables. In most cases, we observed a size reduction of over 60%.
For the quantum version of BiDViT, we observed higherquality solutions and a significant speedup for BiDViT, when compared to common clustering methods. Both observations are based on results shown in Fig. 6.
However, the heuristic version of BiDViT and the common clustering algorithms were executed on a classical device that has a limited computational capacity, whereas the DWave 2000Q is a highly specialized device. Running these algorithms on a highperformance computer might lead to an equivalent degree of speedup.
5 Conclusion
We have developed an efficient algorithm capable of performing extreme clustering. Our complexity analysis and numerical experiments show that if the dataset cardinality and the desired number of clusters are both large, the runtime of BiDViT is at least an order of magnitude faster than that of classical algorithms, while yielding a solution of comparable quality. With advances in quantum annealing hardware, one can expect further speedups in our algorithm and size of dataset that can be processed.
Independent of BiDViT, our coarsening method, based on identifying an dense, separated subset, is valuable in its own right—it is a novel approach to clustering which is not limited solely to the extreme clustering domain.
Further investigation of our coarsening approach is justified, as we have identified a domain for the radius of interest (in Sec. 2) such that, under a separability assumption, every solution to creftype P0 (i.e., every maximum weighted separated subset) yields the optimal clustering assignment.
Acknowledgements
We thank Saeid Allahdadian, Nick Condé, Daniel Crawford, and Austin Wallace for contributing to an earlier version of the algorithm. We thank Maliheh Aramon, Pooja Pandey, and Brad Woods for helpful discussions on optimization theory. The implementation of the QUBO preprocessing techniques was performed jointly with Brad Woods and Nick Condé. Inderpreet Singh contributed to the figure on image quantization. Victoria Wong assisted with graphical editing of two figures. Partial funding for this work was provided by the Mitacs Accelarate internship initiative.
References
Supplementary Information to the Paper, “A Quantum AnnealingBased Approach to Extreme Clustering”
Appendix A: An Alternative Coarsening Method
In certain situations, a user might not want the approximating set to be separated but instead might be interested in finding an dense subset with a minimum number of elements, or, more generally, the minimum cost for some cost function . Finding such a set can be realized in a very similar way to the quantum method of BiDViT. In fact, the only modifications needed would be to Sec. 3.2 in the paper, where we introduce the concept of chunk coarsening.
Let , and let and , for , be defined as in Sec. 3.2. Analogously to the weight vector , we define a cost vector by for each . The problem of finding an dense subset of minimum cost can then be expressed as follows:
(P7) 
The constraints in creftype P7 enforce the condition that for each solution (corresponding to a subset), every point in is represented by at least one of the points from the selected subset. The subset will not necessarily be separated, but it will be dense.
In the same way that finding an separated subset of maximum weight corresponds to the MWIS problem, finding an dense subset of minimum cost corresponds to the minimum weighted dominating set (MWDS) problem, which is equivalent to a weighted version of the minimal set covering (MSC) problem. Consider a set and subsets and , where is some set of indices, such that . The MSC problem then consists of finding a subset such that the property is satisfied, and is of minimum cardinality with respect to this property. For example, if , , , and , then the solution to the MSC problem is given by , as none of the subsets cover , but the union does. The general MSC problem is known to be NPhard [karp1972reducibility]. By defining for in the above setting, one can see that we have solved a weighted version of the MSC problem.
To transform creftype P7 into a QUBO problem, we convert the inequality constraints to equality constraints by adding integer slack variables. Note that the th constraint is satisfied if and only if there exists some such that . In fact, given that , we can see that the must satisfy the bounds
(13) 
Thus, by dualizing the equality constraints, creftype P7 can be expressed as a QUBO problem
(P8) 
We will now describe how substituting a binary encoding for each of the , for , in creftype P8 yields the desired QUBO formulation. For each , the possible states of can be encoded by binary variables , where . The encoding has the form
(14) 
where are fixed coefficients that depend solely on the bounds of creftype 13. If we were to select for , then, if , could assume states that do not satisfy these bounds. We can avoid this situation by manipulating the coefficient of the final bit such that . This may lead to a situation where there are multiple valid encodings for the same integer, but it will always hold that
(15) 
where for . Substituting the binary encoding into creftype P8 yields the following QUBO formulation:
(P9) 
One can show that the solution set of this QUBO problem is equivalent to the one for creftype P7 for . We have not investigated whether this bound is sharp. Note that our QUBO formulation is similar to the one described in [lucas2014ising], but uses a different encoding.
The number of binary variables in the QUBO formulation of this problem depends on the binary encoding of . If the vertex degree in is uniformly bounded from above by a constant , then each can be encoded with fewer than binary variables. Therefore, the number of variables in the QUBO polynomial will be at most . In the worst case, that is, when there is vertex that is a neighbour of every other vertex, the polynomial would still comprise fewer than variables.