A Quantum Annealing-Based Approach to Extreme Clustering

# A Quantum Annealing-Based Approach to Extreme Clustering

Tim Jaschek    Marko Bucyk 11QB Information Technologies (1QBit), Vancouver, BC, Canada 12Dept. of Mathematics, University of British Columbia, Vancouver, BC, Canada 2    and Jaspreet S. Oberoi 11QB Information Technologies (1QBit), Vancouver, BC, Canada 111QB Information Technologies (1QBit), Vancouver, BC, Canada 13School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada 3
###### Abstract

Clustering, or grouping, dataset elements based on similarity can be used not only to classify a dataset into a few categories, but also to approximate it by a relatively large number of representative elements. In the latter scenario, referred to as extreme clustering, datasets are enormous and the number of representative clusters is large. We have devised a distributed method that can efficiently solve extreme clustering problems using quantum annealing. We prove that this method yields optimal clustering assignments under a separability assumption, and show that the generated clustering assignments are of comparable quality to those of assignments generated by common clustering algorithms, yet can be obtained a full order of magnitude faster.

extreme clustering, distributed computing, quantum computing, maximum weighted independent set, unsupervised learning
\tocauthor

T. Jaschek, M. Bucyk, and J. S. Oberoi

footnotetext: T. Jaschek and J. S. Oberoi have contributed equally to this research. Address correspondence to: tim.jaschek@1qbit.com

## 1 Introduction

Traditionally, clustering approaches have been developed and customized for tasks where the resultant number of clusters is not particularly high. In such cases, algorithms such as -means++ [arthur2007k], BIRCH [birch], DBSCAN [ester1996density], and spectral clustering produce high-quality solutions in a reasonably short amount of time. This is because these traditional algorithms scale well with respect to the dataset cardinality . However, in most cases, the computational complexity of these algorithms, in terms of the number of clusters, is either exponential or higher-order polynomial. Another common issue is that some of the algorithms require vast amounts of memory.

The demand for clustering algorithms capable of solving problems with larger values of is continually increasing. Present-day examples involve deciphering the content of billions of web pages by grouping them into millions of labelled categories [nayak2014clustering, de2015parallel], identifying similarities among billions of images using nearest-neighbour detection [wang2013duplicate, liu2007clustering, woodley2018parallel]. This domain of clustering, where and are both substantially large, is referred to as extreme clustering [kobren2017hierarchical]. Although there is great value in perfecting this type of clustering, very little effort towards this end has been made by the machine learning community. Our algorithm is, in fact, such an effort. Its output is a clustering tree, which can be used to generated multiple clustering assignments (or “levels”) with varying degrees of accuracy (i.e., coarseness or fineness) of the approximation. Generating such a tree is not uncommon for clustering algorithms. Consider, for example, hierarchical clustering algorithms which generate binary clustering trees. Clustering trees are useful tools for dealing with real-world data visualization problems. Our algorithm, the Big Data Visualization Tool, or BiDViT, provides this functionality.

BiDViT employs a novel approach to clustering problems, which is based on the maximum weighted independent set (MWIS) problem in a graph induced by the original dataset and a parameter we call the radius of interest or neighbourhood parameter, which determines a relation of proximity. The use of such a parameter has been successfully employed in density-based spatial clustering of applications with noise (DBSCAN) [ester1996density]. The MWIS problem can be transformed into a quadratic unconstrained binary optimization (QUBO) problem, the formulation accepted by a quantum annealer. An alternative way to address the underlying problem is to use a heuristic algorithm to approximate solutions to the MWIS problem. Quantum annealing and simulated annealing have been applied in centroid-based clustering [kumar2018quantum, merendino2013simulated] and in density-based clustering [kurihara2014quantum]. However, the approaches studied are not capable of addressing problems in the extreme clustering domain.

We prove that, under a separability assumption on the ground truth clustering assignment of the original dataset, our method identifies the ground truth labels when parameters are selected that are within the bounds determined by that assumption. We provide runtime and solution quality values for both versions of our algorithm, with respect to internal evaluation schemes such as the Calinski–Harabasz and the Davies–Bouldin scores. Our results suggest that BiDViT yields clustering assignments of a quality comparable to that of assignments generated by common clustering algorithms, yet does so a full order of magnitude faster.

## 2 The Coarsening Method

Our algorithm is based on a combinatorial clustering method we call coarsening. The key idea behind coarsening is to approximate a set by a subset such that, for any point , there exists a such that , for some parameter . In this case, we say that is -dense in and call the radius of interest. This concept is not restricted to subsets of Euclidean spaces and can be generalized to an arbitrary metric space . For example, our coarsening method can be used for clustering assignments on finite subsets of Riemannian manifolds with respect to their geodesic distance, for instance, in clustering GPS data on the surface of the Earth when analyzing population density. In what follows, we assume that is a dataset consisting of -dimensional data points, equipped with a metric . Finding an arbitrary -dense subset of does not necessarily yield a helpful approximation. For example, itself is always -dense in . However, enforcing the additional constraint that any two points in the subset must be separated by a distance of at least yields more-interesting approximations, often leading to a reduction in the number of data points (one of our primary objectives). We call such a set -separated. Fig. 1 shows a point cloud and an -dense, -separated subset. The theorem that follows shows that a maximal -separated set of is necessarily -dense in . Let denote the open metric ball with respect to , with centre and radius .

{theorem}

Let be a maximal -separated subset of in the sense of set inclusion. Then the following properties must be satisfied.

• We have the inclusion .

• For every , it holds that .

• The sets for are pairwise disjoint.

In particular, is a minimal -dense subset of .

{proof}

Note that is equivalent to being -dense in and that, in combination with , is equivalent to being a minimal with respect to this property. To prove , let be a maximal -separated subset of and assume, in contradiction, that is not -dense in . Then we could find such that , for every . Hence, would be -separated, which is in contradiction to the maximality of . To prove ), we fix a point . Since is -separated, for any and, thus, is not -dense in . Property ) follows from the triangle inequality.\qed

Note that a maximal -separated subset does not refer to an -separated subset with fewer than or equally as many elements as all other -separated subsets but, rather, to an -separated subset that is no longer -separated when a single data point is added. Contrary to  Sec. 2, a minimal -dense subset does not need to be -separated. Consider the set , and let be the Euclidean distance on . Then, is -dense in but not -separated. Also note that an -separated subset is not necessarily an -coreset, which is a weighted subset whose weighted -means cost approximates the -means cost of the original set with up to an accuracy of [har2004coresets, balcan2013distributed].

In the following, we assume that is equipped with a weight function . We call the weight of and gather all weights in a weight vector . It will be clear from the context whether we refer to a weight function or a weight vector. The weight of a set is given by . We have already argued that maximal -separated subsets yield reasonable approximations. However, such subsets are not unique. We are thus interested in finding an optimal one, that is, one that captures most of the weight of the original dataset. In other words, we are interested in solving the optimization problem

 maximizeS⊆Xω(S)subject toS is ε-separated. (P0)

If we impose unit weights, the solution set to this optimization problem will consist of the maximal -separated subsets of with a maximum number of elements among all such subsets. The term “maximal” refers to set inclusion and the “maximum” refers to set cardinality. Since for all , a solution to creftype P0 will always be a maximal -separated subset and, therefore, by Sec. 2, -dense. In Sec. 3.6, we show that this problem is equivalent to solving an MWIS problem for a weighted graph , depending solely on the dataset , the Euclidean metric , and the radius of interest . Thus, the computational task of finding a maximal -separated subset of maximum weight is NP-hard [lucas2014ising, karp1972reducibility].

Every subset gives rise to a clustering assignment . This assignment is given by

 Cx={y∈X:d(x,y)≤d(x′,y) for all x′∈U}. (1)

Data points that are equidistant to multiple representative points are assigned to only one of them, uniformly at random. Typically, larger values of result in smaller cardinalities of . The following corollary summarizes properties of when is -separated, and can be readily verified.

{corollary}

Let be the clustering assignment generated from a maximal -separated set . Then, the following properties are satisfied:

• The clusters in are non-empty and pairwise disjoint.

• The cluster diameter is uniformly bounded by , i.e., .

• For all , it holds that .

Notice that these properties are not satisfied by every clustering assignment, for example, the ones generated by -means clustering. They are desirable in specific applications, such as image quantization, where a tight bound on the absolute approximation error is desired. However, they are undesirable if the ground truth clusters have diameters larger than . More details on the clustering assignment are provided in Sec. 3.

One could argue that prior to identifying a maximum weighted independent set and using it to generate a clustering assignment, a dataset should be normalized. However, normalization is a transformation that would result in chunks not being defined by metric balls, but rather by ellipsoids. In particular, such a transformation would change the metric . We assume that the metric already is the best indicator of proximity. In general, one can apply any homeomorphism to a dataset , apply our clustering algorithm to the set , and obtain a clustering assignment by applying to the individual clusters.

A common assumption in the clustering literature is separability—not to be mistaken with -separability—of the dataset with respect to a clustering . The dataset is called separable with respect to a clustering if

 maxx,y∈Ci1≤i≤kd(x,y)

that is, if the maximum intra-cluster distances are strictly smaller than the minimum inter-cluster distances. The following theorem shows that, if is chosen correctly, our coarsening method yields the clustering assignment .

{theorem}

Let be separable with respect to a clustering . Then, for any

 ε∈⎛⎜ ⎜⎝maxx,y∈Ci1≤i≤kd(x,y),minx∈Ci,y∈Cj1≤i≠j≤kd(x,y)⎤⎥ ⎥⎦, (3)

our coarsening methods yields the correct clustering assignment.

{proof}

To simplify our notation, we denote the lower and upper bounds of the interval in (3) by and , respectively. By the separability assumption, this interval is non-empty. One can see that, for any admissible choice of , any two points from different clusters are -separated. Indeed, for and , it holds that . Furthermore, if a point in a cluster is selected, then no other point in the same cluster can be selected, as . Therefore, every solution to creftype P0 is a union of exactly one point from each cluster. Using the separability of with respect to , we can see that the clustering assignment induced by creftype 1 is coincident with . \qed

In practice, the separability assumption is rarely satisfied, and it is challenging to select as above (as this assumes some knowledge about the clustering assignment). However, Sec. 2 shows that our coarsening method is of research value, and can potentially yield optimal clustering assignments.

We have developed two methods, which we refer to as the heuristic method and the quantum method, to address the NP-hard task of solving creftype P0. The heuristic method loosens the condition of having a maximum weight; it can be seen as a greedy approach to creftype P0. In contrast, the quantum method explores all different maximal -separated subsets simultaneously, yielding one that has maximum weight. The quantum method is based on the formulation of a QUBO problem, which can be solved efficiently using a quantum annealer like the D-Wave 2000Q [DWave] or a digital annealer such as the one developed by Fujitsu [Fujitsu].

## 3 The Algorithm

Let denote a dataset of -dimensional data points. Note that, mathematically speaking, a dataset is not a set but rather a multiset, that is, repetitions are allowed. Our algorithm consists of two parts: data partitioning and data coarsening, the latter of which can be further subdivided into chunk coarsening and chunk collapsing.

### 3.1 Data Partitioning

In general, the computational complexity of distance-based clustering methods is proportional to the square of the dataset cardinality, as all pairwise distances must be computed. This bottleneck can be overcome by dividing the dataset and employing distributed approaches [balcan2013distributed, malkomes2015fast, har2004coresets], yielding a result different from the one we would obtain when applying clustering methods on the entire dataset. However, its slight imprecision results in a significant computational speed-up.

A partition of is a collection of non-empty disjoint sets such that . Elements of partitions are typically referred to as blocks, parts, or cells; however, we refer to them as chunks. The partitioning is intended to be homogeneous: every extracted chunk has an equal number of data points (there might be minor differences when the cardinality of the chunk to be divided is odd). The upper bound on the number of points desired in a chunk is referred to as the maximum chunk cardinality . To determine , one should take into account the number of available processors, their data handling capacity, or, in the case of a quantum annealer, the number of fully connected qubits.

To break the data into chunks, we employ a modified version of the well-known “median cut” algorithm, which is frequently used in colour quantization. First, we select an axis of maximum variance. We then bisect the dataset along the selected axis, say , at the median of in such a way as to obtain two data chunks and whose cardinalities differ by at most one (in the case where is odd) and which satisfy and . We cannot simply assign and , as these sets might differ drastically in cardinality. For example, when , this assignment would imply that and .

By using and in the role of , this process can be repeated iteratively, until the number of data points in the chunk to be divided is less than or equal to the maximum chunk cardinality , yielding a binary tree of data chunks. After iterations, this leaves us with chunks such that where the union is disjoint. Fig. 1 provides a visualization.

### 3.2 Chunk Coarsening

The goal of a data coarsening step is, for each chunk, to find representative data points such that their union can replace the original point cloud, while maintaining the original data distribution as accurately as possible.

Let be a chunk and be the radius of interest. In what follows, we assume that all the data points are pairwise different. Practically, this can be achieved by removing duplicates and cumulatively incrementing the weight of the representative point we wish to keep by the weight of the discarded duplicates. The radius of interest  induces a weighted graph , where is the vertex set, the edge set is given by the relation  defined by if and only if for all , and the weight function is the restriction of to . For each data point , we denote its weight by .

For each data point , we introduce a binary decision variable that encodes whether is used in a possible set . Furthermore, we define the neighbourhood matrix (or similarity matrix) of the graph by if , and otherwise. Problem creftype P0 can then be posed as a quadratically constrained quadratic program (QCQP) given by

 maximizes∈{0,1}nn∑i=1siwisubject ton∑i=1∑j>isiN(ε)ijsj=0. (P1)

Here, the inner summation of the constraint does not need to run over all indices, due to the symmetry of . The matrix form of creftype P1 is given by maximizing subject to the constraint , where is the upper triangular matrix of having all zeroes along the diagonal. As explained in Sec. 3.6, creftype P1 is equivalent to the NP-hard MWIS problem for , and thus is computationally intractable for large problem sizes. Note that creftype P1 can be written as the 0–1 integer linear program (ILP)

 maximizes∈{0,1}nn∑i=1siwisubject tosi+sj≤1,for i,j such that ¯¯¯¯¯N(ε)ij=1. (P2)

We present two methods we have devised to address creftype P1.

#### 3.2.1 The Heuristic Method

We wish to emphasize that the heuristic method does not provide us with a solution to creftype P1. Rather, the aim of this method is to obtain an -separated subset with a high—but not necessarily the maximum—weight . The seeking of approximate solutions to the MWIS problem is a well-studied subject [balaji2009approximating, hifi1997genetic, kako2009approximation]. Typically, researchers employ greedy algorithms, LP-based algorithms (using the relaxation of creftype P2), or semi-definite programming (SDP) algorithms; see [kako2009approximation] for an analysis.

We employ a classic greedy algorithm due to its simplicity and low computational complexity. In each step, we add the data point that locally is the best choice in the sense that the ratio of the weight of its neighbourhood to its own weight is as small as possible. Prior to the execution of the step, we remove the point and its neighbours from the set of candidates. Pseudocode of the greedy algorithm is provided in Algorithm 1. Before we state a theoretical result on the approximation ratio of this algorithm, we define the weighted degree of a vertex in a weighted graph and the weighted average degree of as and respectively, where is the neighbourhood of vertex [kako2009approximation].

{theorem}

Algorithm 1 has an approximation ratio of , i.e.,

 ω(S)≤(¯¯¯¯¯¯¯¯¯¯¯degw(G)+1)−1ω(S∗), (4)

for any solution to creftype P0 and any output of the algorithm. Moreover, the bound in (4) is tight.

{proof}

A proof is given in [kako2009approximation, Thm. 6]. \qed

#### 3.2.2 The Quantum Method

In contrast to the heuristic method, the QUBO approach provides an actual (i.e., non-approximate) solution to creftype P1. We reformulate the problem by transforming the QCQP into a QUBO problem.

Using the Lagrangian penalty method, we incorporate the constraint into the objective function by adding a penalty term. For a sufficiently large penalty multiplier , the solution set of creftype P1 is equivalent to that of

 maximizes∈{0,1}nn∑i=1siwi−λn∑i=1∑j>isiN(ε)ijsj. (P3)

One can show that, for every solution to creftype P3 satisfies the separation constraint [abbott2018hybrid, Thm. 1]. Instead, we use individual penalty terms , as this may lead to a QUBO problem with much smaller coefficients, which results in improved performance when solving the problem using a quantum annealer. Expressing creftype P3 as a minimization, instead of a maximization, problem and using matrix notation yields the problem

 minimizes∈{0,1}nsTQs, (P4)

where if , if and , and otherwise. Solutions to creftype P4 can be approximated using heuristics such as simulated annealing [nolte2000note], path relinking [lu2010hybrid], tabu search [lu2010hybrid], and parallel tempering [zhu2016borealis]. Before solving creftype P4, it is advisable to reduce its size and difficulty by making use of logical implications among the coefficients [glover2018logical]. This involves fixing every variable that corresponds to a node that has no neighbours to one, as it necessarily is included in an -dense subset.

The following theorem show that creftype P1 is equivalent to creftype P4 for a suitable choice of , for .

{theorem}

Let for all . Then, for any solution to creftype P4, the corresponding set is -separated. In particular, the solution sets of creftype P1 and creftype P4 coincide.

{proof}

We generalize the proof of [abbott2018hybrid, Thm. 1] and show that every solution to creftype P4 satisfies the separation constraint . Assuming, in contradiction, that the opposite were to be the case, we could find a solution and indices and such that and . Let denote the -th standard unit vector, and let . Then,

 vTQv =sTQs−n∑j>ksjQkj−n∑i

where , defined by , orders the index accordingly. This technicality is necessary, as we defined only for . As , we have , and thus

 vTQv≤sTQs−λkℓ+wk. (7)

Therefore, as , it holds that , which is absurd, as, by assumption, is a solution to creftype P4.

We now show that the solution sets of creftype P1 and creftype P4 coincide. Note that creftype P1 is equivalent to the optimization problem

 minimizes∈{0,1}n, −sTwsubject tosT¯¯¯¯¯N(ε)s=0. (P5)

Let and be solutions to creftype P5 and creftype P4, respectively. We denote the objective functions by and , where is the matrix defined by for , and zero otherwise, and the term denotes the Hadamard product of the matrices and , given by element-wise multiplication. Then, as for , by the observation above, both and satisfy the separation constraint. Since and are coordinate-wise non-negative and for , it holds that

 sT¯¯¯¯¯N(ε)s=0⇔sT(Λ∘¯¯¯¯¯N(ε))s=0, (8)

thus, if satisfies the separation constraint, then . Using this observation, and that and minimize and , respectively, we have

 p1(s1)≤p1(s2)=p2(s2)≤p2(s1)=p1(s1). (9)

Hence, the inequalities in creftype 9 must actually be equalities; thus, the solution sets of the optimization problems coincide. \qed

Problem creftype P4 can be transformed to an Ising spin model by mapping to . This form is desirable because the ground state of the Hamiltonian of an Ising spin model can be determined efficiently with a quantum annealer.

### 3.3 Chunk Collapsing

Having identified a maximal -separated subset , we collapse the vertices into , meaning we update the weight of each according to the weights of all that satisfy . We aim to assign each to a unique by generating a so-called Voronoi decomposition (depicted in Fig. 1) of each chunk , which is a partition, where each point is assigned to the closest point within a subset . More precisely, we define the sets as in creftype 1, for each . By construction, contains all vertices that will be collapsed into , in particular, itself. We then assign the coarsened chunk a new weight function defined by

 wS(x)=ω(Cx)=∑y∈Cxw(y). (10)

In practice, to prevent having very large values for the individual weights, one might wish to add a linear or logarithmic scaling to this weight assignment. In our experiments, we did not add such a scaling.

### 3.4 Iterative Implementation of BiDViT

BiDViT repeats the procedure of data partitioning, chunk coarsening, and chunk collapsing with an increasing radius of interest, until the entire dataset collapses to a single data point. We call these iterations BiDViT levels. The increase of between BiDViT levels is realized by multiplying by a constant factor, denoted by and specified by the user. In our implementation we have introduced a node class that has three attributes: coordinates, weight, and parents. We initialize BiDViT by creating a node_list containing the nodes corresponding to the weighted dataset (if no weights are provided then the weights are assumed to be the multiplicity of the data points). After each iteration, we remove the nodes that collapsed into representative nodes from the node_list and keep only the remaining representative nodes. However, we append the removed nodes to the parents of the representative node. The final node_list is a data tree, that is, it consists of only one node, and we can move upwards in the hierarchy by accessing its parents (and their parents and so on); see Fig. 2. Two leaves of the data tree share a label with respect to a specific BiDViT level, say , if they have collapsed into centroids which, possibly after multiple iterations, have collapsed into the same centroid at the -th level of the tree. For the sake of reproducibility, we provide pseudocode (see Algorithm 2).

It is worth noting that, at each level, instead of proceeding with the identified representative data points, one can use the cluster centroids, allowing more-accurate data coarsening and label assignment.

### 3.5 Complexity Analysis

Our analysis shows that every interation of the heuristic version of BiDViT has a computational complexity of . Note that .

The order of complexity of the partitioning procedure is . To see this, note that there are at most partitioning stages and in the -th stage we split chunks , where . Let denote the number of data points in chunk . Finding the dimension of maximum variance has a complexity of and determining the median of this dimension can be achieved in via the “median of medians” algorithm. Having computed the median, one can construct two chunks of equal size in linear time. Since , a partitioning step is . Any division of a chunk is independent of the other chunks at a given stage; thus, this procedure can benefit from distributed computing.

The order of complexity for the collapsing process is , as computing the neighbourhood matrix of a chunk is and the heuristic selection procedure is . The number of chunks is bounded from above by . This yields a complexity of . As data coarsening in each chunk is independent, with parallel processors available the complexity reduces to .

### 3.6 Relation to the MWIS Problem

The process of identifying a maximal -separated set of maximum weight is equivalent to solving the MWIS problem for the weighted graph . Let be a weighted graph. A set of vertices is called independent in if no two of its vertices are adjacent or, equivalently, if is a clique in the complement graph. This corresponds to the separation constraint mentioned earlier, where two vertices are adjacent whenever they are less than a distance of apart. The MWIS problem can be expressed as

 maximizeS⊆Vω(S)subject toS is independent, (P6)

and is NP-complete for a general weighted graph [karp1972reducibility], yet, for specific graphs, there exist polynomial-time algorithms [mandal2006maximum, kohler2016linear]. Note that the QUBO formulation of the MWIS problem in [abbott2018hybrid, hernandez2016novel] is related to the one in creftype P4.

If all weights are positive, a maximum weighted independent set is necessarily a maximal independent set. A maximal independent set is a dominating set, that is, a subset of such that every is adjacent to some . This corresponds to our observation that every maximal -separated subset is -dense.

## 4 Results

The datasets used to demonstrate the efficiency and robustness of our approach are the MNIST dataset of handwritten digits [lecun2010mnist], a two-dimensional version of MNIST obtained by using -SNE [maaten2008visualizing], two synthetic grid datasets, and a dataset called Covertype containing data on forests in Colorado [Dua:2017]. The synthetic grid datasets are the unions of 100 samples (in the 2D case) and 1000 samples (in the 3D case) drawn from with means and a variance of for in the 2D case and the natural extension in the 3D case. Dataset statistics are provided in Table 1. In addition to our technical experiments, explained in the following sections, a practical application of BiDViT for image qunatization is illustrated in Fig. 7. All experiments were performed using a 2.5 GHz Intel Core i7 processor and 16 GB of RAM.

### 4.1 Low-Range Clustering Domain

Although BiDViT has been specifically designed for extreme clustering, it yields accurate assignments for low values of . Fig. 3 shows the clustering assignment of BiDViT on the 2D grid dataset and on MNIST. The results are obtained by manually selecting a BiDViT level. In the grid dataset, every cluster is identified correctly. In the MNIST dataset, all clusters are recognized, except one. However, as our algorithm is based on metric balls, and some datasets might not conform to such categorization, there are datasets for which it cannot accurately assign clusters. This is true for most clustering algorithms, as they are able to recognize only specific shapes.

### 4.2 Extreme Clustering Capability

To evaluate the performance of BiDViT on high-dimensional datasets in the extreme clustering range, we used the Calinski–Harabasz score [calinski1974dendrite] and the Davies–Bouldin score [davies1979cluster]. These clustering metrics are internal evaluation schemes, that is, their values depend solely on the clustered data, not requiring the ground truth label assignment for the dataset. Such schemes must be viewed as heuristic methods: their optimal values do not guarantee optimal clusters but provide a reasonable measure of clustering quality. Detailed analyses have been conducted on the advantages and shortcomings of internal clustering measures [liu2010understanding, jain2008innovation]. In the extreme clustering scenario, where the objective is to obtain an accurate approximation of the entire dataset instead of categorizing its elements, no true labels are given and thus external evaluation schemes (ones based on the distance to a ground truth clustering assignment) do not qualify as success measures.

Let denote a total of detected clusters within a dataset with data points. The Calinski–Harabasz score of a clustering is defined as a weighted ratio of the inter-cluster squared deviations to the sum of the intra-cluster squared deviations. More precisely, is given by

 SCH(C1,…,Cnc)=(n−1nc−1)∑nck=1|Ck|∥ck−c∥22∑nck=1∑x∈Ck∥x−ck∥22, (11)

where , for are the cluster centroids, and is their mean. High values of are indicative of a high clustering quality. The Davies–Bouldin score is the average maximum value of the ratios of the pairwise sums of the intra-cluster deviation to the inter-cluster deviation. The score is defined as

 SDB(C1,…,Cnc)=1ncnc∑k=1maxj≠kSk+Sj∥ck−cj∥2, (12)

where Low values of indicate accurate clustering.

Fig. 4 shows and of clustering assignments obtained with BiDViT and Mini Batch -means clustering [sculley2010web] for different values of on the Covertype dataset. Due to their high computational complexity with respect to , many common clustering algorithms could not be applied. Remarkably, values are quite similar, indicating that the cluster assignments generated by BiDViT are of comparable quality even though the runtime of our algorithm is significantly shorter. For , our algorithm outperforms the others for lower values of , and is comparable for large values. One explanation for the slightly weaker performance of BiDViT with respect to is that BiDViT aims to minimize the non-squared distances, whereas rewards clustering methods that minimize squared distances. Similarly, this explains BiDViT’s advantage for .

### 4.3 Runtime Comparison

In our experiments, we observed that, with respect to the total runtime, even the heuristic version of BiDViT restricted to a single core outperforms common clustering methods in the extreme clustering domain. Fig. 5 shows the runtime required by different clustering algorithms for the Covertype dataset. For the implementation of methods other than BiDViT, we used the publicly available sklearn.clustering module for Python. To generate the plots, we ran the entire BiDViT procedure, then applied classical algorithms for the same values of . The results suggest that, in the extreme clustering domain, the runtime of BiDViT is an order of magnitude faster than that of the agglomerative methods against which it was compared, and multiple orders of magnitude faster than that of -means and Mini Batch -means clustering. The dataset cardinality was restricted to 20,000 points to obtain results for other methods, whereas BiDViT is capable of handling the entire dataset comprising 581,000 points.

We then compared the runtime of BiDViT to PERCH (“Purity Enhancing Rotations for Cluster Hierarchies”), a hierarchical algorithm for extreme clustering [kobren2017hierarchical], to our knowledge the only other algorithm designed to solve extreme clustering problems. We restricted both algorithms to using a single core. Table 1 shows that BiDViT performs an order of magnitude faster than PERCH. However, they solve somewhat different problems: whereas BiDViT aims to gradually coarsen a dataset by finding -separated, -dense subsets, PERCH maximizes the dendrogram purity, a measure of the clustering tree’s consistency [kobren2017hierarchical]. The clustering tree generated by PERCH is binary and thus enormous, allowing for much finer incremental distinctions between clustering assignments. In contrast, the tree generated by BiDViT is more compact, as multiple data points can collapse into the same representative point. When comparing dendrogram purities, we expect PERCH to outperform BiDViT; when comparing Davies–Bouldin scores at a given level, we expect the opposite. We did not test these hypotheses, as dendrogram purity is an external evaluation scheme, that is, it requires a clustering assignment to use for comparison, which is not available in unsupervised machine learning.

### 4.4 Results for the Quantum Version of BiDViT

We tested a prototype of BiDViT on a D-Wave 2000Q quantum annealer, a machine that has 2048 qubits and 5600 couplers. According to D-Wave Systems, the computer uses 128,000 Josephson junctions and was the most complex superconducting integrated circuit built to date when introduced in January of 2017 [DWave].

Before solving the QUBO problems, we applied preprocessing techniques, reducing their size and difficulty [glover2018logical]. This proved effective and eliminated a great many variables. In most cases, we observed a size reduction of over 60%.

For the quantum version of BiDViT, we observed higher-quality solutions and a significant speed-up for BiDViT, when compared to common clustering methods. Both observations are based on results shown in Fig. 6.

However, the heuristic version of BiDViT and the common clustering algorithms were executed on a classical device that has a limited computational capacity, whereas the D-Wave 2000Q is a highly specialized device. Running these algorithms on a high-performance computer might lead to an equivalent degree of speed-up.

## 5 Conclusion

We have developed an efficient algorithm capable of performing extreme clustering. Our complexity analysis and numerical experiments show that if the dataset cardinality and the desired number of clusters are both large, the runtime of BiDViT is at least an order of magnitude faster than that of classical algorithms, while yielding a solution of comparable quality. With advances in quantum annealing hardware, one can expect further speed-ups in our algorithm and size of dataset that can be processed.

Independent of BiDViT, our coarsening method, based on identifying an -dense, -separated subset, is valuable in its own right—it is a novel approach to clustering which is not limited solely to the extreme clustering domain.

Further investigation of our coarsening approach is justified, as we have identified a domain for the radius of interest (in Sec. 2) such that, under a separability assumption, every solution to creftype P0 (i.e., every maximum weighted -separated subset) yields the optimal clustering assignment.

## Acknowledgements

We thank Saeid Allahdadian, Nick Condé, Daniel Crawford, and Austin Wallace for contributing to an earlier version of the algorithm. We thank Maliheh Aramon, Pooja Pandey, and Brad Woods for helpful discussions on optimization theory. The implementation of the QUBO preprocessing techniques was performed jointly with Brad Woods and Nick Condé. Inderpreet Singh contributed to the figure on image quantization. Victoria Wong assisted with graphical editing of two figures. Partial funding for this work was provided by the Mitacs Accelarate internship initiative.

## Supplementary Information to the Paper, “A Quantum Annealing-Based Approach to Extreme Clustering”

### Appendix A: An Alternative Coarsening Method

In certain situations, a user might not want the approximating set to be -separated but instead might be interested in finding an -dense subset with a minimum number of elements, or, more generally, the minimum cost for some cost function . Finding such a set can be realized in a very similar way to the quantum method of BiDViT. In fact, the only modifications needed would be to Sec. 3.2 in the paper, where we introduce the concept of chunk coarsening.

Let , and let and , for , be defined as in Sec. 3.2. Analogously to the weight vector , we define a cost vector by for each . The problem of finding an -dense subset of minimum cost can then be expressed as follows:

 minimizes∈{0,1}nn∑i=1sicisubject ton∑j=1N(ε)ijsj≥1,i=1,…n. (P7)

The constraints in creftype P7 enforce the condition that for each solution (corresponding to a subset), every point in is represented by at least one of the points from the selected subset. The subset will not necessarily be -separated, but it will be -dense.

In the same way that finding an -separated subset of maximum weight corresponds to the MWIS problem, finding an -dense subset of minimum cost corresponds to the minimum weighted dominating set (MWDS) problem, which is equivalent to a weighted version of the minimal set covering (MSC) problem. Consider a set and subsets and , where is some set of indices, such that . The MSC problem then consists of finding a subset such that the property is satisfied, and is of minimum cardinality with respect to this property. For example, if , , , and , then the solution to the MSC problem is given by , as none of the subsets cover , but the union does. The general MSC problem is known to be NP-hard [karp1972reducibility]. By defining for in the above setting, one can see that we have solved a weighted version of the MSC problem.

To transform creftype P7 into a QUBO problem, we convert the inequality constraints to equality constraints by adding integer slack variables. Note that the -th constraint is satisfied if and only if there exists some such that . In fact, given that , we can see that the must satisfy the bounds

 0≤ξi≤(n∑j=1N(ε)ij)−1,fori=1,…,n. (13)

Thus, by dualizing the equality constraints, creftype P7 can be expressed as a QUBO problem

 minimizes∈{0,1}n0≤ξ≤(N(ε)1)−1n∑i=1sici+λn∑i=1[(n∑j=1N(ε)ijsj)−1−ξi]2. (P8)

We will now describe how substituting a binary encoding for each of the , for , in creftype P8 yields the desired QUBO formulation. For each , the possible states of can be encoded by binary variables , where . The encoding has the form

 ξi=⌊mi⌋∑k=0b(i)kγ(i)k,for i=1,…,n, (14)

where are fixed coefficients that depend solely on the bounds of creftype 13. If we were to select for , then, if , could assume states that do not satisfy these bounds. We can avoid this situation by manipulating the coefficient of the final bit such that . This may lead to a situation where there are multiple valid encodings for the same integer, but it will always hold that

 0≤⌊mi⌋∑k=0b(i)kγ(i)k≤(n∑j=1N(ε)ij)−1, (15)

where for . Substituting the binary encoding into creftype P8 yields the following QUBO formulation:

 minimizeb(i)∈{0,1}⌊mi⌋+1s∈{0,1}nn∑i=1sici+λ⎡⎣(n∑j=1N(ε)ijsj)−1−⌊mi⌋∑k=0b(i)kγ(i)k⎤⎦2. (P9)

One can show that the solution set of this QUBO problem is equivalent to the one for creftype P7 for . We have not investigated whether this bound is sharp. Note that our QUBO formulation is similar to the one described in [lucas2014ising], but uses a different encoding.

The number of binary variables in the QUBO formulation of this problem depends on the binary encoding of . If the vertex degree in is uniformly bounded from above by a constant , then each can be encoded with fewer than binary variables. Therefore, the number of variables in the QUBO polynomial will be at most . In the worst case, that is, when there is vertex that is a neighbour of every other vertex, the polynomial would still comprise fewer than variables.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters