Technical Report: KNN Joins Using a Hybrid Approach: Exploiting CPU/GPU Workload Characteristics
Abstract
This paper studies finding the nearest neighbors (KNN) of all points in a dataset. Typical solutions to KNN searches use indexing to prune the search, which reduces the number of candidate points that may be within the set of the nearest points of each query point. In high dimensionality, index searches degrade, making the KNN selfjoin a prohibitively expensive operation in some scenarios. Furthermore, there are a significant number of distance calculations needed to determine which points are nearest to each query point. To address these challenges, we propose a hybrid CPU/GPU approach. Since the CPU and GPU are considerably different architectures that are best exploited using different algorithms, we advocate for splitting the work between both architectures based on the characteristic workloads defined by the query points in the dataset. As such, we assign dense regions to the GPU, and sparse regions to the CPU to most efficiently exploit the relative strengths of each architecture. Critically, we find that the relative performance gains over the reference implementation across four realworld datasets are a function of the data properties (size, dimensionality, distribution), and number of neighbors, .
figurec
I Introduction
The performance of dataintensive computations such as nearest neighbor (KNN) searches are often limited by the memory bottleneck. The high aggregate memory bandwidth of graphics processing units (GPUs) (e.g., 900 GiB/s on the Nvidia Volta [1]) results in roughly an orderofmagnitude increase in memory bandwidth over the CPU. Therefore, GPUs are wellsuited for dataintensive workloads. While the hostGPU interconnect is a bottleneck, new interconnects such as NVLink [2] will ameliorate this problem.
We study the KNN selfjoin problem, which is outlined as follows: given a database of points, find all of the nearest neighbors of each point. We focus on the selfjoin because it is a common scenario in scientific data processing workflows (e.g., within an astronomy catalog, find the closest five objects of all objects within a feature space [3]). KNN searches are used in many applications, such as the kmeans [4], and Chameleon [5] clustering algorithms. Consequently, KNN searches have been well studied [6, 7, 8], including algorithms designed for the GPU [9]. GPU algorithms typically fall into three classes in descending order of prevalence: those minimally involving the host (e.g., onGPU KNN searches [9]); those that closely integrate an algorithm between the host and GPU to concurrently exploit both resources (e.g., hybrid CPU/GPU index searches [10]); and those that split the work between both resources that use algorithms and optimizations tailored to their specific architectures. Algorithms of type and are most prevalent. However, type above is the least common, as it requires splitting the total workload as a function of the architecture, algorithm, and workload characteristics. We focus on above — a hybrid CPU/GPU approach that assigns query points to the CPU or GPU to find their respective KNN.
An advantage of our proposed approach is that each query point (or unit of work), can be executed on the architecture most favorable to its characteristic workload, thus improving performance over CPU or GPUonly approaches. Our algorithm, HybridKNNJoin, leverages an efficient distance similarity join algorithm for the GPU to process high density regions, and a parallel CPU KNN algorithm for processing low density regions. A key advantage of our algorithm is that new advances in CPU or GPUonly approaches can be substituted into the hybrid framework to further improve performance. In this context, we make the following contributions:

We propose a hybrid CPU/GPU approach for solving the KNN selfjoin problem that combines a distance similarity join for the GPU with a multicore CPU KNN algorithm.

We present a method for dividing the query points between CPU and GPU architectures as a function of data dependent properties (e.g., distribution, dimensionality, and size).

The GPU component of our HybridKNNJoin algorithm solves the KNN problem using range queries. We show how to select a query distance, , such that the GPU join task is likely to find at least neighbors for each query point.

We optimize task granularity on the GPU by ensuring that resources remain saturated despite changing workloads based on the total workload assigned to the GPU.

We achieve performance gains over the parallel reference implementation on four realworld datasets.
Ii Background
Related categories of work include: KNN algorithms, range queries and join operations, and indexing techniques for the CPU and GPU. We give an overview of each below, but note that due to the extensive range of literature, we cannot possibly discuss all of these areas in a sufficient level of detail.
KNN Searches and Join Algorithms – KNN searches are a fundamental unsupervised algorithm for machine leaning and data mining. Consequently, there have been many works on optimizing the KNN search and join [6, 7, 11, 12, 13, 14, 8, 9] and we only describe a sample of the literature below.
An Rtree is used to find the KNN in [6] that uses a branchandbound recursive algorithm that first gets an estimate of the KNN and then performs backtracking on subtrees to find the exact neighbors. Backtracking in treebased solutions is used to ensure that at least nearest neighbors are found, such as in the GPU KNN algorithm [9].
While the ELSH [12] algorithm performs range queries within a search distance, and is not designed for KNN, it can be used to find the nearest neighbor by constructing several data structures corresponding to increasing search radii, querying them in ascending order by distance, and selecting the first nonempty result [15].
The Approximate Nearest Neighbors (ANN) algorithm can be used to efficiently find both the approximate and the exact neighbors [7]. Approximate solutions are motivated by prohibitively expensive highdimensional searches, and therefore, an approximate solution may reach a tradeoff between tolerable incorrectness and computation time. Related to ANN is the Fast Library for ANN (FLANN) [8], which uses several algorithms, and in particular, achieves good performance using a randomized kdforest, where the trees are searched in parallel. While FLANN outperforms ANN for one scenario in [8], the comparison was between a parallel (FLANN) and sequential algorithm (ANN). Since ANN is an optimized algorithm, we parallelize and compare our work to ANN, and as we will discuss later, we incorporate it into HybridKNNJoin.
Distributed Memory KNN Searches – Distributedmemory approaches have been proposed to improve the performance of KNN searches. For instance, MapReduce [16] implementations for KNN joins [17, 18] have been proposed. The authors in [17] optimize the mapping function to prune distance calculations, which reduces the cost of the shuffling operation and computation. The authors in [18] propose exact and approximate KNN join solutions, where they show that in their approximate solution, only a linear number of reducers are needed, which is needed to achieve good scalability. In contrast to scaling out the computation using distributed memory, we scale up the computation using the GPU.
Indexing Techniques – Central to our approach is using an appropriate index for the architecture. Indexes for the CPU have been designed to be workefficient, such as indextrees (e.g., kdtrees [19], quadtrees [20], and Rtrees [21]), and they are constructed as a function of the data distribution. In contrast, there are dataoblivious methods, such as statically partitioned grids [22].
With the proliferation of general purpose computing on graphics processing units (GPGPU) there has been debate whether the community should use the treebased approaches on the GPU, or use dataoblivious methods. The disadvantage of indextrees is that they contain many branch instructions, which can reduce the parallel efficiency of the GPU due to the SIMD architecture. An Rtree for the GPU was presented in [23], which was optimized to reduce thread divergence. Later, the same authors showed that it is better to perform the tree traversal on the CPU and perform the scanning of the leaf nodes on the GPU [10]. While the approach in [10] requires both the CPU and GPU, it demonstrates that the GPU can be best leveraged through the use of regularized instructions, and thus low thread divergence. Since dataoblivious approaches may be better suited to the GPU, we use a nonhierarchical indexing technique for our GPU join operation.
Range Queries and Joins – Our hybrid approach uses range queries on the GPU to perform KNN searches. A join operation with a distance predicate can be implemented as several range queries. The multicore CPU join algorithm [24] uses a nonmaterialized grid, and exploits the data distribution to efficiently perform a similarity join over a search distance, , and the algorithm was shown to outperform the ELSH [12], and LSS [25] algorithms. A GPU selfjoin was presented in [22] that was shown to be efficient on lowdimensional data, and then the same authors advanced several optimizations for targeting high dimensional data in [26]. These works were shown to generally outperform the work of [24]. We leverage some of the optimizations in these previous selfjoin works [22, 26] as they are effective for executing range queries that can be used to solve KNN searches on the GPU.
Iii Problem Statement
The KNN selfjoin is outlined as follows. Let be a database of dimensional points (or feature vectors) denoted as , where . For each point in the database, , we find its nearest neighbors, excluding the point itself. To compute the distance between two points, and , we use the Euclidean distance as follows: , where denotes the point’s coordinate in dimension . We assume an inmemory scenario where the entire database fits within the global memory of a GPU, and the entire result set (the nearest neighbors of each point) fits within main memory on the host; however, the entire result set may exceed GPU global memory capacity. The KNN selfjoin is denoted as . However, the KNN selfjoin problem and optimizations are also directly applicable to the case where there are two datasets and that are joined, .
Iv Recap of Previous SelfJoin Work
The GPU component of HybridKNNJoin leverages two previous works [22, 26]. The performance of join operations and range queries are a function of dimensionality. In low dimensionality, points are typically found in close proximity of each other at small distances; however, at higher dimensionality, the hypervolume increases, which causes the points to be located at larger distances from each other. Therefore, in low dimensionality, after searching an index, there are many candidate neighbors to filter to determine which ones are within . In high dimensionality, index searches become more exhaustive, and search performance may approximate that of a brute force search. This wellknown problem is known as the curse of dimensionality [27].
Gowanlock & Karsin presented an efficient GPU selfjoin implementation [22], that was limited to dimensions. The authors modified an efficient indexing scheme and used a batching scheme from [28], in addition to advancing a technique to reduce the number of duplicate computations. Later, Gowanlock & Karsin implemented a selfjoin that targeted the challenges of high dimensionality [26]. This included a point coordinate reordering strategy to optimize the discriminatory power of the index, an approach that decreased the number of indexed dimensions, and a method used to decrease the number of point comparisons, which was particularly useful when indexing fewer than dimensions. Both low and highdimensional approaches were shown to outperform the stateoftheart multicore parallel approach of [24] across many experimental scenarios. In what follows, we give an overview of the selfjoin components from [22, 26], that we utilize to efficiently solve the KNNjoin.
Iva Indexing Technique
We use a gridbased indexing scheme for the GPU from previous work, and present a brief overview below (see [28, 22] for more details). To summarize, a grid with cells of length is utilized. To accommodate highdimensional search spaces, the index only stores nonempty grid cells. If the entire bounding hypervolume were to be partitioned into grid cells, the number of cells may exceed the memory capacity of the GPU. The nonempty cell ids are stored in a lookup array . The points that are found within each grid cell are stored in as minimum and maximum ranges in a point lookup array . This lookup array contains the index values of the points in the database . A range query search around a query point proceeds as follows: the cell of the query point is computed from the point’s coordinates; the adjacent cell id ranges of the point in each dimension are obtained (e.g., in 2D there are 9 total grid cells that may contain points within ); the linear cell coordinate is computed from an adjacent dimensional cell, and this linearized id is searched for in using a binary search to determine if the cell is nonempty; assuming a nonempty cell is found, the cell is located in , which contains minimum and maximum ranges in a point lookup array ; these ranges are used to find the points in , and distance calculations are performed between the query point and all of the points in the cell; the results of those points within of the query point are obtained, and the search proceeds to the next adjacent cell in step . The space complexity of the index is . This small fraction of the total memory capacity of the GPU allows for larger datasets and result set sizes to be processed.
IvB Batching Scheme
We give a brief overview of the GPU batching scheme in [22] and [28]. The size of the total result set for a join operation, which contains the neighbors of each point within a distance , can be significantly larger than the size of the dataset, . Thus, to process large datasets or values of , a batching scheme is needed to incrementally process the join, by querying a fraction of , returning the result, and then querying another fraction of , and so on, until all points have been processed. Since multiple batches are needed, we select a number of batches to execute by first computing the join on a fraction of the points in , and return the total number of points within (a single integer), which yields an estimate, , of the total result set size. Given a buffer size of (the size of a buffer to store the result set of a batch), we compute the total number of batches to be . When executing the selfjoin, we never have a buffer overflow, and do not need to rely on failurerestart strategies that can waste computation. Furthermore, we use 3 CUDA steams, corresponding to a minimum of , which overlaps the execution of the batches to exploit the bidirectional bandwidth over PCIe, and allow for concurrent hostside tasks with GPU computation. This hides communication and hostside operation overheads. In our experiments, we use (each CUDA stream is assigned a buffer of size for the result set).
IvC Index Dimensionality Reduction
In [26], a technique is proposed to index dimensions of the data, i.e., if and , then the 6 dimensional data is indexed and projected in 3 dimensions. By indexing dimensions, the index search for nearby points is less expensive, but results in larger candidate sets to filter, as the search is less selective. When filtering the candidate set, the distance is computed using the dimensional coordinate of each point, and the correctness of the result is not compromised. This tradeoff is particularly important in high dimensions, as index searches become more exhaustive. Without this optimization, the selfjoin would have been prohibitive on the high dimensional datasets in [26].
IvD Reordering Data by Variance
To maximize the discriminatory power of the index, it is important to consider the distribution of the data in each dimension. As an example, if there are dimensions, and the data in , , and is uniformly distributed within the ranges [0, 1], [0, 0.01], and [0.2, 0.6], respectively, then the dimensions that yield the largest discriminatory power are and . Thus, if we index instead of dimensions (Section IVC), then we should index on dimensions and , and leave unindexed, as it provides little discriminatory power (all of the points are found within a small region). We use the approach in [26] that reorders the data by variance in each dimension, such that we select dimensions to index that optimize the efficacy of index searches. Similarly, one could perform a principle component analysis, or use a histogrambased approach [24] to optimize index search performance. All of these techniques exploit the statistical properties of the data to improve performance, which is necessary when processing high dimensional data. We refer to this optimization as reorder.
IvE Short Circuiting the Distance Calculation
As the distance between points is being computed, the computation will quit early if the distance exceeds . This optimization is important in high dimensions, as the number of operations needed to compute the distance between points increases with dimensionality [26]. We refer to this optimization as shortc.
V HybridKNNJoin and Optimizations
Va Overview: Splitting Work Between Architectures
As discussed in Section I, we focus on a hybrid CPU/GPU approach that performs the KNN search that maps to be processed by the CPU or GPU.
A range query finds all points, , within a search distance, , of a query point. Thus, to construct a KNNjoin using a range query, there are several facets of the problem to consider. The search distance is required to ensure that the nearest points from a query point are found. For a given search that returns neighbors, the distances between points are compared to determine which of the points are nearest to the query point. However, while a range query will return all points within , there is no guarantee that all (or any) of the points will have neighbors. In principle, the selection of could be large such that all points have at least nearest neighbors; however, this would lead to significant computational overhead, as some points in the dataset may find a large fraction of the entire dataset necessitating a significant number of distance calculations.
Figure 1 shows an example of a spatially partitioned space with query points shown in red. In Figure 1 (a), there are many nearby neighbors; therefore, given a value of , there are a significant number of distance calculations and filtering needed to find the nearest neighbors. However, in Figure 1 (b), the query point is located in a sparse region. Thus, a large range query would be needed to find at least neighbors. Spatially partitioning the data using a grid in Figure 1 (a) is reasonable, as it is likely neighbors will be found by checking adjacent cells (e.g., assume ). In contrast, in Figure 1 (b), the grid is not effective. Had a grid been used, the adjacent cells would not contain any nearby points. In this case, a dataaware index (e.g., kdtree [19] partitioning shown in Figure 1 (b)) is better suited to finding data in sparse regions. Furthermore, as there are fewer points nearby the query point in Figure 1 (b), there is a low degree of candidate point filtering overhead.
Given this illustrative example, the GPU and associated indexing scheme in Section IVA is good for processing the scenario in Figure 1 (a) due to the large amount of filtering overhead needed (the massive parallelism of the GPU is wellsuited to distance calculations), and low index search overhead; whereas the scenario in Figure 1 (b) is good for finding the KNN on the CPU due to the low degree of filtering overhead and associated dataaware indexing scheme for low density regions.
Algorithm designs for the CPU and GPU are often at crosspurposes with each other. To minimize the response time, or to be timeefficient, CPU algorithms are often designed to be workefficient, whereas an algorithm for the GPU may be less workefficient, but be more timeefficient than an algorithm for the CPU. For example, the CPU range queries that are needed for the KNN search often use treebased indexes to reduce the number of distance comparisons between nearby points, at the expense of many branch instructions needed for tree traversals. This is workefficient, as the number of distance calculations are minimized. In contrast, the GPU can be less workefficient, performing many more distance calculations than the CPU, but use an index with fewer branch conditions and lower discriminatory power than the CPU. For a given query point, the GPU may be more timeefficient than the CPU despite its less workefficient range query implementation. Therefore, the motivation for splitting the work between CPU and GPU is based on the suitability of each architecture to find the KNN of a given query point.
VB Hybrid KNNJoin
As discussed above, we split the work between the CPU and GPU to exploit their respective architectures. The GPU is good at concurrently processing large batches of queries where the kernel has a regularized instruction flow and can exploit the high memory bandwidth and massive parallelism afforded by the architecture. The CPU is much better at processing irregular instruction flows, and thus, is wellsuited to treebased indexes that are comprised of many branch instructions in comparison to the GPU. We denote the query points assigned to the CPU and GPU as and respectively, where .
Regarding the CPU KNN algorithm that we employ, we use the publicly available^{1}^{1}1The publicly available ANN algorithm can be found here: http://www.cs.umd.edu/~mount/ANN/. Approximate Nearest Neighbors (ANN) implementation for the CPU [7] that uses a kdtree index. The algorithm is efficient for both approximate and exact solutions to the KNN problem, and we execute the algorithm such that we obtain the exact nearest neighbors. We refer to the approach that we use in HybridKNNJoin as ExactANN. We parallelize the algorithm using a sharednothing MPI implementation. We simply have each process rank construct its own index of , and then each rank finds the nearest neighbors of a fraction of the points in , as assigned to the ranks in a round robin fashion.
Our GPU approach uses a single range query distance to find a number of nearby points that may satisfy the KNN query for the points in . However, unlike treebased approaches that perform backtracking [9] such that they guarantee neighbors are returned, we allow for the scenario where some of the searches of the points in fail (i.e., they find nearest neighbors). Otherwise, by allowing each query point in to use a different or expanding search radius, the performance of the GPU kernel will degrade due to divergent execution pathways. We refer to the GPU component of HybridKNNJoin as GPUJoin.
As will be elaborated below, query points that fail the KNN search on the GPU would be better suited for execution on the CPU. Consequently, we assign these failed GPU query points to the CPU for subsequent execution by ExactANN.
VC GPUJoin: Selection of the Search Distance
The input parameter to a KNN search is ; however, a range query needs an search distance. While analytically defining a good value of may be feasible in low dimensionality on datasets with wellknown distributions, realworld and highdimensional datasets often have data distributions that make an analytical approach intractable. Thus, we use a lightweight empirical technique to find a good value of to use for the range query. We motivate the problem of selecting below.
VC1 Motivating Example – KNN Failure and the Search Distance
Consider a search distance, , that on average finds neighbors per . As this is an average distance, some points may find a large number of neighbors, and some points may find few or none. Consider that if we use a search radius for all range queries, the total result set size , will be . Consider a dataset with points (thus executing all of the query points on the GPU), and . We select the range query distance such that the total result set size is . We examine the fraction of that has found at least neighbors using a pathological scenario outlined as follows: we allow a fraction of points to only find themselves (thus finding no neighbors); and, another fraction of points finds neighbors and some number of extra neighbors (e.g., the range query returns 10 instead of neighbors for a subset of ).
Figure 2 shows the fraction of the points in the dataset that have at least neighbors. In the example, ideally, we find exactly neighbors for each point; therefore there are no extra neighbors found per point (shown as the 0 extra neighbors bar in Figure 2). Thus, in this idealized case, we solve the KNN search by a single range query for each point. This is unrealistic, given that some range queries will find more than neighbors for a subset of the points.
If a fraction of points find neighbors and 1 extra neighbor per point (e.g., they all return 6 neighbors), then the total fraction of the dataset that satisfies the KNN query is only 80%. This scenario is also optimistic, as some range queries may return a much larger number of neighbors. If each point having at least neighbors finds 20 extra neighbors, then the fraction of the dataset that has found at least neighbors is only 20%. Thus, if we select , then we are unlikely to solve a large fraction of . Consequently, we advocate for selecting a range query distance , such that we satisfy finding KNN for a large fraction of .
VC2 Empirically Selecting the Search Distance
We rely on the execution of two GPU kernels (unrelated to the previous work in Section IV) that sample the dataset to determine a good value of , which are described as follows.
First, we simply sample , and compute the mean distance between points, denoted as . Next, we define a number of bins, , that store the frequency of the distances between pairs of points that fall within the distance bin, where the width of each bin is . We then select a fraction of the total points in the dataset and compute the distance between each of these points and every other point in , and store the distances in the respective bin, where any distance is not stored. We use this distance cutoff because performing a range query using a distance of will return a large fraction of the total dataset (far more than any reasonable value of ). We compute the cumulative number of points in each bin. Let denote the distance bins, where . Each stores: its distance range denoted as , where , and ; the number of points found within its distance range , denoted as ; , and the cumulative number of points in the bin (including bins with points at lower distances), denoted as , where .
This yields a relationship between the search distance and the average number of neighbors. This is similar to the procedure used to create a distance diagram that can be used in other contexts, such as selecting a value of to be used in the DBSCAN clustering algorithm [29].
We denote the value of corresponding to the query distance that yields cumulative neighbors as , i.e., , where .
We select a range query distance that considers the following: increasing increases the probability of finding at least neighbors with a range query, and thus increases the number of query points that can be successfully processed by the GPU instead of the CPU; and increasing too much can be computationally expensive, as many points may be found within which causes a significant amount of filtering overhead. We parameterize the selection of as follows using a parameter that increases as a function of the cumulative neighbors, . Let , where . Thus, if , then .
We use a gridbased index for the GPU (Section IVA) where the grid cell length is the same as the search radius. We assume that we want to select a search distance such that on average the number of neighbors found by a search with are located within a single cell. Therefore, we select , such that the distance is circumscribed within a cell (this holds for any dimension concerning an sphere and cube). Assuming that the data is uniformly distributed, this means that there is a high probability of a range query finding at least neighbors within a cell, as was derived experimentally to find the average cumulative distance that finds at least neighbors. Figure 3 shows a 2D example of the range query circumscribed within a cell, and the final value of .
As we will see in the next section, we discuss how we split the work between the CPU and GPU, where cells containing a sufficient number of points are assigned to the GPU. As we increase , then we create more opportunities to successfully solve the KNN query on the GPU for a larger fraction of . Thus, impacts whether the CPU or GPU processes a given query.
VD Dividing the Work Between the CPU and GPU
The GPU should execute range queries for points in dense regions, and the CPU should perform the KNN search in sparse regions (Figure 1). To determine if a point should be executed on the CPU or GPU, we use the grid index of length that will be sent to the GPU (described in Sections IVA and VC2).
We approximate the lower bound on the minimum number of points needed to be within a cell such that we assign a query point to the GPU. We find which cell a point falls within (this has already been computed when constructing the index). We use the distance described above (see Figure 3). Assuming the points are uniformly distributed within the cell, and the query point is located at the center of the cell, the minimum number of points needed to have at least neighbors requires considering the ratio of the volume of the cube to the sphere as the volume of the circumscribing cube is larger than the sphere. Therefore, at minimum, points in the cell are probabilistically needed to ensure that at least neighbors may be found within . The minimum number of points needed in a cell, , are as follows:
(1) 
Note the following: when indexing dimensions (Section IVC), then the dimensionality becomes in Equation 1; and, this is the lower bound on because the range query distance was defined to be . Thus, while we require points within a cell, the range query of is larger and has a greater chance of returning more neighbors than a search with .
We define a point threshold using . Let . Given a query point found in cell with points, we assign the query point to the GPU as follows: . The parameter allows us to select query points to send to the GPU if they are located in dense regions. If , then the expected number of neighbors within needs to be at least . If , then only expected neighbors are needed. The points assigned to the CPU, , are simply the points not assigned to the GPU, .
VE GPU KNN Failure and Point Reassignment to the CPU
Each query point in may not find at least neighbors within the range query, as are only probabilistically likely to have at least neighbors. For instance, a query point may be located on the edge of the grid, where there are fewer adjacent cells containing neighbors. Furthermore, since the points in realworld datasets are unlikely to be uniformly distributed, it is possible that many of the points within a cell will be outside of . Since we do not have a clairvoyant method of knowing whether each query point assigned to the GPU will satisfy the KNN query, if a point does not find at least neighbors, the points are reassigned to the CPU to be executed by ExactANN after it has completed processing . These query points that failed finding neighbors should have been initially assigned to CPU; however, the method outlined in Section VD is intended to be a lightweight method for splitting the work between CPU and GPU resources. More sophisticated methods could be employed at the expense of additional initial computation costs; however, probabilistic approaches cannot guarantee finding at least neighbors. Figure 4 shows an example where and the query point (red) only finds 1 neighbor. In this case, the red query point is assigned to the CPU for subsequent execution.
Depending on the dataset, the reassignment of failed GPU query points to the CPU may never occur. Also, and can be selected such that we are able to assign the GPU query points that have a very high probability of finding neighbors.
VF Large Workloads and the Division of Work
HybridKNNJoin is intended to be used when there is a large amount of work (e.g., large datasets, , high dimensionality). Small workloads should be directly executed on the CPU. Given the choice of and parameters (Sections VC2 and VD), the work may be biased towards executing on the GPU. Consider a large workload where the number of query points assigned to the GPU is much larger than the CPU, . This means that the GPU has far more work to execute than the CPUs. To avoid idle CPU cores, we set a minimum number of queries assigned to the CPU so that CPU cores are not idle for the majority of the execution. We set a parameter in the range [0, 1], that denotes the minimum fraction of the work assigned to the CPU, . If the values of and yield , then query points are reassigned from to , and they are those found within cells with the least number of points in , and are thus likely to be those with the smallest amount of work.
Observe that in the case where , we do not force a minimum number of queries to be executed on the GPU. This is because the total workload is likely to be small (Section VD); therefore, the GPU should not be utilized. Setting a lower limit on the number of query points assigned to the CPU means that fewer points in are likely to fail execution on the GPU (Section VE), as the points reassigned to are those that are found in lower density regions.
VG GPU: Optimizing Task Granularity
In the selfjoin work that we leverage [22, 26], a single thread is assigned to each point in the dataset, where the thread finds all points within of its assigned point. This approach was reasonable because the total number of threads is . When executing the range queries across multiple batches (Section IVB) the number of threads per batch is typically sufficient to saturate the GPU’s resources. However, when performing HybridKNNJoin, ; therefore, if we assign one thread per point in , then the GPU’s resources may be underutilized. Furthermore, the GPU hides high memory latency by performing fast context switching between resident threads, and needs at least a factor of a few more threads than cores to saturate the resources.
We divide the work of the distance calculations between multiple threads as assigned to each point in . The key idea is to increase the granularity of the parallel tasks. Figure 5 shows an example of using multiple threads per query point. The query point (red) is shown in the middle cell and an adjacent cell (dashed blue outline), where the distances between the query point and the six points are computed. Instead of using one thread per point, this example shows three threads each computing the distances between two points.
We describe two approaches for assigning threads to points and discuss the benefits and disadvantages of each.

Static number of threads per point (tstatic) – Assign a static number of threads to each point for performing the distance calculations. One thread per point, which can saturate the resources when is large. A benefit of tstatic is that the number of threads per point can be selected to reduce intrawarp thread divergence. For instance, if 32 threads per point are selected, then a warp will compute the distance between a given point and its neighboring points. There should be low divergence because each thread is taking nearly the same execution pathway. The drawbacks are: if the number of threads that are selected is too high, then there is overhead of launching the threads; and many of the query points may not require a large number of threads as there are too few neighbors (these threads will have little work).

Dynamic number of threads per point (tdynamic) – Assign a lower bound on the total number of threads to use at each kernel invocation. These threads are then evenly divided and assigned to the query points to perform the distance calculations. The advantage of this approach is that the GPU is guaranteed to execute at least a particular number of threads. The drawback of the approach is that a dynamic number of threads per point may not evenly divide into the warps like the static partitioning strategy, thus increasing divergence in the kernel. Similarly to tstatic, since the threads are evenly divided amongst the points in , then the approach may be excessive for points residing in sparser regions.
VH Algorithm Overview
We outline HybridKNNJoin as follows in Algorithm 1. We begin by getting the process rank and importing the data on lines 2–3. We use an MPI implementation and have 1 master GPU rank and several CPU ranks which begin their primary execution on lines 5 and 17, respectively. The GPU rank initalizes the result set (line 6), and then reorders the data by variance (Section IVD) on line 7. Next, we select the value of to be used for GPUJoin (Section VC) on line 8, and then construct the index as a function of , and the number of indexed dimensions (Sections IVA and IVC) on line 9. Next, we split the work between CPU and GPU (Section VD) on line 10. Using the batch estimator, the number of batches is computed on line 11 (the batch estimator determines the total number of batches so that GPUJoin can process result sets larger than global memory, see Section IVB).
The algorithm loops over all of the batches (line 12). At each iteration, the GPU join kernel is executed (line 13), which computes the result set for a single batch. On line 14 the result of the join operation is filtered (the result is in the form of key/value pairs which are filtered to reduce duplicate keys), and checked to ensure that each point in contains at least neighbors. On line 15, after all of the batches have been computed, those query points executed on the GPU that have neighbors are assigned to the query set, , to be executed by the CPU.
While the GPU is computing , the CPU processes are computing beginning on line 18. Any query points in are executed on lines 19–20.
We describe the GPU join kernel here, but refer the reader to [22] for more detail. To adapt the selfjoin kernel [22] for the purposes of HybridKNNJoin, we make two minor modifications. First, we add a query set, as we do not want to compare all points to each other, as range queries are only needed for those points in . Second, we allow multiple threads to process an individual point (Section VG). In the GPU join kernel shown in Algorithm 1, the result set is initialized (line 24), and then the global thread id is computed (line 25). Next, the query point assigned to the thread is stored (line 26), and a loop iterates over all adjacent cells (lines 27–28). The point assigned to the thread is compared to all points in the adjacent cells, where a result is stored when a point is found to be within of the query point (lines 29–30). The result is stored as key/value pairs, where the key is the query point id, and the results are both the point id within of the key, and the distance between the points.
If more than one thread is used to compute the distance between a query point and points in neighboring cells, then each thread only computes a fraction of the total points in the cell on line 29 (illustrated in Figure 5).
In the pseudocode, we omitted the minor synchronization tasks between the GPU and CPU. In particular, the CPU processes cannot begin execution of ExactANN until the work has been split between CPU and GPU (line 10), and the CPU processes cannot return until after has been generated (line 15), and processed (lines 19–20).
Some of the tasks in Algorithm 1 execute concurrently. First, multiple batches are used to incrementally perform the join (Section IVB). We use 3 CUDA streams (and threads on the host) to overlap data transfers to and from the GPU and to filter the result set of the join operation (lines 12–14). There is one GPU process (to execute lines 5–15), and several CPU processes that execute ExactANN on lines 17–20.
Vi Experimental Evaluation
Via Datasets
We utilize realworld datasets in the evaluation. We summarize the data characteristics in Table I. All datasets were obtained from the UCI ML repository [30]. The datasets are as follows: Color Histogram, CHist, 32D image features; Supersymmetry Particles, SuSy, 18D – properties of 5 million particles from the Large Hadron Collider [31]; ; Song Prediction Dataset, Songs, 90D – features of songs, with 415,345 points [32]; and Free Music Archive, FMA, 518D – features of songs with 106,574 tracks (points) [33].
Dataset  Dataset  
SuSy  18  Songs  515,345  90  
CHist  68,040  32  FMA  106,574  518 
The datasets have been selected because they encompass several application scenarios and KNN workloads. The major data characteristics that impact the amount of work required of any KNN search is the dimensionality (), number of data points (), and data distribution. These datasets span a range of these properties, and provide a good testbed for comparison between methods. We note that HybridKNNJoin is designed for large workloads that can concurrently exploit both multicore CPUs and the GPU; however, we also test our approach on smaller workloads (CHist and FMA are small, where the latter is of high dimensionality).
ViB Experimental Methodology
All HybridKNNJoin CPU code is written in C++, compiled using the GNU compiler (v. 5.4.0) with the O3 compiler flag. The GPU code is written in CUDA v. 9. We use OpenMPI v. 3.1.1 for parallelizing hostside tasks. Our platform consists of an NVIDIA GP100 GPU with 16 GiB of global memory, and has 2 E52620 v4 2.1 GHz CPUs, with 16 total physical cores. We configure HybridKNNJoin to use 16 processes ranks in total (1 for GPUJoin, and 15 for ExactANN), corresponding to the number of physical cores. The GPUJoin kernel uses 256 threads per block, and is executed with 64bit floats. In all experiments, we exclude the time needed to load the dataset or construct the index (we will elaborate on this in Section VIC). The response time of the main operation (performing the KNN search on the CPU and GPU) is measured after the indexes have been constructed by the ExactANN process ranks. With the exception of loading the dataset and indexing, all other components of the algorithm are accounted for in the response time measurements. All response time measurements reported are averaged over 3 trials. Optimizations, parameters and algorithms are summarized in Table II.
Optimization/ Parameter/ Algorithm  Description 
The dimensionality of the data.  
Number of indexed dimensions in the GPUJoin indexing scheme, where .  
reorder  Reorders data by variance to improve index pruning power. 
shortc  Short circuits the distance calculation when the running total distance exceeds . 
Parameter in the range [0, 1]. Increasing increases the size of in the grid. A larger value assigns more points to be processed by the GPU.  
Parameter in the range [0, 1] that defines a threshold number of points that are needed within a cell containing a query point such that the point is assigned to the GPU. A larger value increases the density of points needed for GPU execution.  
Parameter in the range [0, 1] that defines the minimum number of query points assigned to the CPU.  
tstatic  A static number of threads assigned to each query point on the GPU for distance calculations. 
tdynamic  A minimum total number of threads are used per kernel invocation which are assigned to each query point on the GPU for distance calculations. 
ExactANN  Parallel CPU component of HybridKNNJoin. 
GPUJoin  GPU component of HybridKNNJoin. 
HybridKNNJoin  The proposed CPU/GPU approach. 
RefImpl  Parallel CPUonly reference implementation used to compare against HybridKNNJoin. 
GPUJoinLinear  Selfjoin GPU brute force implementation. 
ViC RefImpl: CPUOnly Parallel Nearest Neighbor Reference Implementation
As discussed in Section VB, we parallelize ANN for the CPU [7], but obtain the exact neighbors. We compare HybridKNNJoin to the parallelized ANN without the GPU, denoted as RefImpl. It is similar to the parallel ExactANN algorithm in Section VB, but executed with an additional process rank, and the query set contains all of (not ). There is no communication between processes and results are written to main memory using MPI shared memory support. We let denote the MPI ranks, where , where is the total number of MPI ranks. Thus, rank is assigned point if , and each rank finds the KNN for points (assuming evenly divides ). The round robin distribution of points to ranks yields nearideal load balancing. We execute RefImpl on 16 ranks/processes.
As described above for HybridKNNJoin, we do not include the time to index because parallel index construction is not the focus of ANN [7]. We only consider the time to perform the KNN search after index construction.
Figure 6 plots the speedup of RefImpl on SuSy and FMA, which are the lowest and highest dimensional datasets that we consider. RefImpl achieves speedups between 10.04 (FMA), and 12.26 (SuSy) on our 16 core platform. This is a reasonable level of scalability given that RefImpl performs many memory operations, particularly when traversing its kdtree index, which may limit scalability due to the memory bottleneck.
ViD GPUJoinLinear: Brute Force SelfJoin Lower Bound
Index performance degrades in high dimensionality. To assess GPUJoin index efficacy, our brute force algorithm assigns one thread to each query point which then performs a linear search to find all other points in of . We only include the kernel execution time, and the time to allocate buffers on the host and device, but exclude the time to filter the neighbors of each point, and the time to return results to the host. By not returning result set batches to the host, we can execute a single kernel, which yields a lower bound response time. In principle, finding all of the neighbors of each point can then be used to find the KNN for any value of ; however, in practice a brute force approach can simply limit the result set size by only storing results within (as is the case for GPUJoin).
Figure 7 shows the brute force response time on three of the datasets for three values of (the values are representative of those used in GPUJoin when we compare our approach to the reference implementation). The figure shows that performance is independent of , as all points are compared to each other.
ViE Results
ViE1 GPU Kernel Task Granularity
GPUJoin uses a number of threads to process each point (Section VG). The number of batches, , needed to compute the join across all determines the number of query points executed per batch. We compare tstatic and tdynamic performance using values of that saturate GPU resources. On small datasets the response time can be dominated by overheads, and we cannot observe large differences in kernel performance. These overheads include: hostGPU data transfers, pinned memory allocation, selecting , and reorder. These overheads (non KNN selfjoin tasks) are amortized on large workloads.
Dataset  K  tstatic (Threads)  tdynamic (Min. Threads)  
1  8  32  
SuSy  1  756.72  238.47  624.42  756.95  280.99  529.93 
CHist  10  1.44  1.08  1.33  1.10  1.39  3.45 
Songs  1  1022.04  1023.25  1025.65  1023.91  1022.55  1025.69 
FMA  10  20.03  12.29  12.28  12.36  12.28  13.30 
From Table III, we find that on the Songs dataset, the performance across all kernel configurations is consistent, which shows that even when assigning one thread per query point, the GPU resources are saturated. However, using tstatic and assigning one thread leads to poor performance on the other datasets (SuSy, CHist, and FMA). tstatic with 8 threads per query point outperforms the other configurations, achieving a speedup on SuSy over tdynamic with threads of 1.18. Furthermore, on the Songs and FMA datasets, 8 threads per point is within 1% of the response time of the best kernel configurations. tstatic generally outperforms tdynamic because in the latter kernel, a query point can span multiple warps, and there is more thread divergence. Using 8 threads per query point reduces divergence, as a fixed number of query points (32/8=4) occupy a warp. In all that follows, we configure HybridKNNJoin to use tstatic with 8 threads per point. However, we note that on Songs and FMA, using tdynamic would not significantly degrade performance.
ViE2 Workload Division
We begin by examining the effect of the and parameters and leave unconstrained (). Recall that when splitting the work between the CPU and GPU, increases the grid cell size, such that more points fall within each cell and this enables more query points to be sent to the GPU (Section VC), and increases the density of nearby points needed to compute a query point on the GPU (Section VD). In all that follows, we use the reorder and shortc optimizations. Using GPUJoin, we index dimensions on all datasets, thus reducing the number of indexed dimensions compared to the data dimensionality.
Figure 8 plots the response time to find the nearest neighbors shown vs. for a range of values on all datasets. We select such that the overheads of using the GPU and preprocessing steps are mostly amortized (as discussed in Section VIE1). On the SuSy, CHist, and FMA datasets in Figure 8 (a), (b), (d), we observe that performance significantly degrades with . Increasing increases the GPUJoin value, which increases the search distance and amount of work needed to find neighboring points. On the SuSy, CHist, and Songs datasets, Figure 8 (a), (b), (c), the best value of is in the range [0.6, 1.0] (we omit lower values, as the curves overlap). The exception is on the FMA dataset (Figure 8 (d)), where yields the best performance. This dataset has dimensions. Therefore, a low value of sends more query points to the GPU, which is effective at computing the large number floating point operations (FLOP) needed for data of high dimensionality (computing the distance between a pair of points requires FLOP).
The best value of on Songs. A substantial fraction of fail to find at least neighbors using GPUJoin, and they are sent to the CPU for subsequent processing (Section VE). These failures are wasted work; therefore, as increases on the Songs dataset, a larger fraction of find at least neighbors, and there is less wasted work.
We examine the effects of the parameter. On large workloads, the GPU is likely to be assigned the vast majority of the query points, subsequently leaving the CPU cores idle. Thus, the parameter increases to balance the work between ExactANN, and GPUJoin. Increasing can degrade performance, as query points that would be best executed on the GPU are forced to execute on the CPU.
Figure 9 plots the response time vs. for a range of values ( means that all points execute using ExactANN, and means that there is not minimum fraction of queries executed by ExactANN). We examine the SuSy and Songs datasets because they show opposite trends in terms of and . Figure 9 (a) on SuSy shows that and yields the best performance. In contrast, we find that on the Songs dataset in Figure 9 (b), and yields the best performance, which is is in contrast to the other datasets. On the Songs dataset, a small fraction of the query points should be forced to execute on the CPU; also, a large decreases the number of query failures on the GPU (Section VE).
Parameter selection summary– Our examination of the , , and parameters demonstrates the challenges of understanding the amount of work required to find the KNN of each point in high dimensions. There is not a universal set of parameter values that can be selected to achieve good performance on all datasets. We motivate the use of some parameters as follows.
• – Controls the amount of work executed by GPUJoin, which increases with and decreases the fraction of queries that fail to find at least nearest neighbors. Since is selected to obtain at least neighbors on average when , increasing may inadvertently decrease GPUJoin performance. To compute a good set of parameters, a small and large value of should be considered.
• – Controls the threshold density of points needed to execute a query point using GPUJoin. Increasing decreases . The GPU should process dense data regions, and high dimensionality datasets (requiring many FLOP). To enable these scenarios, high and low values of should be considered.
• – Forces a minimum fraction of points to be processed by ExactANN. If , then on large workloads the majority of the points may be sent to the GPU, and the CPU will be underutilized. Also, in cases where the CPU is unsuited to the workload, it is best that the value of be low. impacts load balancing between GPUJoin and ExactANN.
We perform a grid search over the parameter space. Given the above observations, we only select two values for each of the and parameters. We will show that a single value of can be selected and then we can analytically determine a better value of as a function of the CPU and GPU queries.
We test , , and , and permute each set, thus testing a total of 4 sets of parameter values. While some of this information has been shown in Figures 8 and 9, Table IV highlights the total response time when using these permutations of parameters.
SuSy  CHist  Songs  FMA  
0.0  0.0  165.13  0.93  1064.36  14.65 
0.0  0.8  165.64  0.90  1065.49  23.48 
1.0  0.0  1746.46  6.14  747.66  57.01 
1.0  0.8  1724.61  6.28  748.49  55.72 
When executing HybridKNNJoin, we also record two values that will be used to optimize . Let and be the average time needed to find the KNN of a point in (ExactANN) and in (GPUJoin), respectively. The and values exclude all times associated with overheads and preprocessing steps, and are simply obtained when executing HybridKNNJoin (e.g., on the scenarios in Table IV).
We arbitrarily select ; however, this may yield a large load imbalance between GPUJoin and ExactANN. To achieve good load balancing, we want to select such that both components complete their respective workloads at roughly the same time. Assuming we have executed HybridKNNJoin and have obtained and , we model the value of needed to achieve good load balancing, and denote it as , as follows. First, observe that:
(2) 
(3) 
To achieve good load balancing, we select such that:
(4) 
(5) 
Equation 6 yields the value of needed to achieve good load balancing between the CPU and GPU components of HybridKNNJoin, assuming and have been obtained for an arbitrarily selected value. Equation 6 makes two implicit assumptions: while we only consider the number of queries in that find at least neighbors in the calculation of , assumes that no queries fail to find at least neighbors; and the average time to solve a query ( and derived from an arbitrary selection of ) will be the same after selecting . This is an unlikely scenario, as ExactANN or GPUJoin may be assigned a large fraction of queries that lead to more or less work on average than the original assignment of queries used to derive and . Thus, may not achieve nearideal load balancing in practice. An alternative to deriving from analytical arguments is performing a grid search on (like and ), but this would increase the search space, and thus preprocessing.
Initial Time (s) with  (s)  (s)  Time (s) with  Speedup using  
SuSy  1  0.0  0.0  165.13  2.948  5.474  0.650  131.31  1.26 
CHist  10  0.0  0.0  0.93  1.160  1.188  0.506  0.90  1.03 
Songs  1  1.0  0.8  748.49  2.610  4.624  0.151  462.87  1.62 
FMA  10  0.0  0.0  14.65  2.126  1.487  0.412  12.30  1.19 
Table V shows the parameters that yield good performance from Table IV, and the calculation of . Table V also compares the total response time using the arbitrarily selected and . Using improves performance, achieving a speedup over using of up to 1.61 on the Songs dataset. Therefore, the performance of HybridKNNJoin is strongly dependent on the load balancing properties of . In Table V on CHist, the speedup is 1.03, as the workload is too small to utilize the GPU (and little performance gain is achieved from using ). We will show that larger values of benefit from utilizing the GPU.
Selecting the best set of parameters using the above grid search and subsequent analytical selection of is computationally expensive. Table VI shows the same experiment in Table IV, but we only process a fraction of the queries, i.e., , where is the fraction of processed queries. We can recover the best parameters shown in Table IV on a lower computational budget by only processing a fraction of the dataset (compare bold face values in Tables IV and VI). We sample 1% of the queries for SuSy, Songs, and 3% for CHist, and FMA, where the latter two datasets have the fewest number of points, and thus need a larger sample size because of the overheads (discussed in Section VIE1). Thus, we can select good parameters using an inexpensive approach for a given dataset and value of .
Comparing Tables IV and VI, we can see that the response times in the latter table on SuSy and Songs are not 1% of the former table (or 3% of CHist and FMA). This is due to the initial overheads that are not amortized when processing a fraction of the dataset. However, when using larger values of when sampling the dataset to find good parameter values, we find that much more of this overhead is amortized. Low values of are more difficult to quantify parameters for due to these overheads, as will be shown in Section VIE3.
SuSy  CHist  Songs  FMA  
0.0  0.0  30.05  0.24  21.98  5.09 
0.0  0.8  30.00  0.23  21.85  4.83 
1.0  0.0  48.51  0.96  18.43  10.35 
1.0  0.8  48.71  0.99  18.56  10.52 
ViE3 Comparison with the Reference Implementation
We compare the performance of HybridKNNJoin to RefImpl. Recall that the major difference between ExactANN and RefImpl is that the latter executes using an additional processes. We showed in Section VIE2 that we can select good parameter values with a low computational budget. We begin by determining good parameter values across a range of values of . We compute for selected values of by sampling the datasets using , as described in Section VIE2. We could derive good parameter values without sampling the dataset (set ), and obtain more accurate values of at the expense of increased overhead. However, we aim to estimate the parameter values without expensive parameter selection preprocessing. We find that good parameter values may be weakly dependent on ; therefore, the parameter selection step can be computed with low overhead and be used in all executions of HybridKNNJoin for different values of .
Figure 10 plots the value of vs. for each dataset. We execute HybridKNNJoin on the datasets and denote the following execution parameters as (, , , ): SuSy (0, 0, 0.5, 0.02), CHist (0, 0, 0.5, 0.5), Songs (1, 0.8, 0.5, 0.02), FMA (0, 0, 0.5, 0.5). Note that like the procedure outlined in Section VIE2, we select and then compute a good value for . On the SuSy, CHist, and FMA datasets, we use and (these values yield good performance in Table IV). Figure 10 shows that on SuSy, CHist, and FMA, the value of is consistent when . This shows that on these datasets and range of values, the performance of ExactANN and GPUJoin degrades at the same rate with increasing . increases with on Songs, implying that as the workload increases, a larger fraction of the dataset should be processed by ExactANN to achieve good load balancing. We used on SuSy and Songs, and on CHist and FMA, because the latter datasets are smaller and the response time may be dominated by overheads (as described above).
Figure 11, plots the response time vs. across all datasets and compares HybridKNNJoin to RefImpl and GPUJoinLinear. We find that HybridKNNJoin outperforms RefImpl across roughly all values of on each dataset. However, there is a large difference in the relative speedup between approaches across the datasets. As expected, the lower the value of , the greater the performance gain, as a larger fraction of the dataset is processed by the GPU. For example, SuSy has , and the speedup over RefImpl ranges between 1.25 and 1.35 (Figure 11 (a)). In contrast, Songs has , and the speedup ranges between 1.61 and 2.56 (Figure 11 (c)). GPUJoinLinear is substantially slower than HybridKNNJoin and RefImpl. We configured RefImpl with the derived value (Section VC) corresponding to the median in the plots on each dataset. However, the performance of GPUJoinLinear is independent of (Figure 7).
Vii Discussion and Conclusions
In this paper, we demonstrate a hybrid CPU/GPU approach to the KNN selfjoin problem. However, we depart from traditional heterogeneous approaches, which include: a producerconsumer model where the computation is split between the CPU and GPU; or using a work queue to assign query points to the CPU or GPU. Instead, we split the work between the CPU and GPU components based on the characteristics of the workload. This allows us to assign query points to each architecture as a function of which architecture (and associated algorithm) is best suited to the workload. We find that some datasets are more efficiently executed on the CPU or GPU components of HybridKNNJoin.
To exploit both architectures using the ExactANN and GPUJoin components of HybridKNNJoin, we quantify the workload based on data density and load balancing using several parameters. Due to associated problems with high dimensional data, the parameters that lead to good performance need to be quantified experimentally, and cannot be entirely derived analytically. We find that we can recover these parameters using a small grid search using a fraction of the points in the dataset, and thus on a low computational budget. Therefore, the HybridKNNJoin parameters that impact performance (, , and ) can be first computed as a function of on each dataset, and then the parameters can be used for all future executions. The best parameters were independent of with the exception of one dataset (Songs).
Since the parameters split the dataset into two sets of query points to be executed by the CPU and GPU based on the workload, our hybrid approach is able to accommodate new algorithmic advances proposed by the community.
Future work includes a distributedmemory implementation, and applying the lessons learned from dataintensive workload characterization to other data analysis problems.
References
 [1] “Nvidia Volta,” http://images.nvidia.com/content/voltaarchitecture/pdf/voltaarchitecturewhitepaper.pdf, accessed: 20180917.
 [2] D. Foley and J. Danskin, “UltraPerformance Pascal GPU and NVLink Interconnect,” IEEE Micro, vol. 37, no. 2, pp. 7–17, 2017.
 [3] Y. Zhang, H. Ma, N. Peng, Y. Zhao, and X.b. Wu, “Estimating Photometric Redshifts of Quasars via the knearest Neighbor Approach Based on Large Survey Databases,” The Astronomical Journal, vol. 146, p. 22, 2013.
 [4] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A kmeans clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
 [5] G. Karypis, E.H. Han, and V. Kumar, “Chameleon: Hierarchical clustering using dynamic modeling,” Computer, vol. 32, pp. 68–75, 1999.
 [6] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest neighbor queries,” in Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, 1995, pp. 71–79.
 [7] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, “An optimal algorithm for approximate nearest neighbor searching fixed dimensions,” Journal of the ACM, vol. 45, no. 6, pp. 891–923, 1998.
 [8] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 11, pp. 2227–2240, 2014.
 [9] M. Nam, J. Kim, and B. Nam, “Parallel Tree Traversal for Nearest Neighbor Query on the GPU,” in 45th Intl. Conf. on Parallel Processing, 2016, pp. 113–122.
 [10] J. Kim and B. Nam, “Coprocessing heterogeneous parallel index for multidimensional datasets,” Journal of Parallel and Distributed Computing, vol. 113, pp. 195 – 203, 2018.
 [11] C. Xia, H. Lu, B. C. Ooi, and J. Hu, “Gorder: an efficient method for KNN join processing,” in Proc. of the Intl. Conf. on Very Large Data Bases, 2004, pp. 756–767.
 [12] A. Andoni and P. Indyk, “Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions,” in IEEE Symposium on Foundations of Computer Science, 2006, pp. 459–468.
 [13] C. Yu, B. Cui, S. Wang, and J. Su, “Efficient indexbased knn join processing for highdimensional data,” Information and Software Technology, vol. 49, no. 4, pp. 332–344, 2007.
 [14] B. Yao, F. Li, and P. Kumar, “K nearest neighbor queries and knnjoins in large relational databases (almost) for free,” in IEEE 26th Intl. Conf. on Data Engineering, 2010, pp. 4–15.
 [15] A. Andoni and P. Indyk, “ELSH 0.1 User Manual,” http://www.mit.edu/~andoni/LSH/manual.pdf, 2005.
 [16] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” CACM, vol. 51, no. 1, pp. 107–113, 2008.
 [17] W. Lu, Y. Shen, S. Chen, and B. C. Ooi, “Efficient Processing of K Nearest Neighbor Joins Using MapReduce,” Proc. VLDB Endow., vol. 5, no. 10, pp. 1016–1027, 2012.
 [18] C. Zhang, F. Li, and J. Jestes, “Efficient parallel kNN joins for large data in MapReduce,” in Proc. of the 15th Intl. Conf. on Extending Database Technology, 2012, pp. 38–49.
 [19] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” CACM, vol. 18, no. 9, pp. 509–517, 1975.
 [20] R. A. Finkel and J. L. Bentley, “Quad trees a data structure for retrieval on composite keys,” Acta informatica, vol. 4, no. 1, pp. 1–9, 1974.
 [21] A. Guttman, “Rtrees: a dynamic index structure for spatial searching,” in Proc. of ACM Intl. Conf. on Management of Data, 1984, pp. 47–57.
 [22] M. Gowanlock and B. Karsin, “GPU Accelerated SelfJoin for the Distance Similarity Metric,” in Proc. of the 2018 IEEE Intl. Parallel and Distributed Processing Symposium Workshops, 2018, pp. 477–486.
 [23] J. Kim, W.K. Jeong, and B. Nam, “Exploiting massive parallelism for indexing multidimensional datasets on the gpu,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 8, pp. 2258–2271, 2015.
 [24] D. V. Kalashnikov, “SuperEGO: fast multidimensional similarity join,” The VLDB Journal, vol. 22, no. 4, pp. 561–585, 2013.
 [25] M. D. Lieberman, J. Sankaranarayanan, and H. Samet, “A fast similarity join algorithm using graphics processing units,” in IEEE 24th Intl. Conf. on Data Engineering, 2008, pp. 1111–1120.
 [26] M. Gowanlock and B. Karsin, “GPU Accelerated Similarity SelfJoin for MultiDimensional Data,” in Technical Report, 2018. [Online]. Available: https://arxiv.org/abs/1809.09930
 [27] R. E. Bellman, Adaptive control processes: a guided tour. Princeton university press, 1961.
 [28] M. Gowanlock, C. M. Rude, D. M. Blair, J. D. Li, and V. Pankratius, “Clustering Throughput Optimization on the GPU,” in Proc. of the IEEE Intl. Parallel and Distributed Processing Symposium, 2017, pp. 832–841.
 [29] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A densitybased algorithm for discovering clusters in large spatial databases with noise,” in Proc. of the 2nd KDD, 1996, pp. 226–231.
 [30] M. Lichman, “UCI machine learning repository,” 2013.
 [31] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles in highenergy physics with deep learning,” Nature Communications, vol. 5, p. 4308, 2014.
 [32] T. BertinMahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million song dataset,” in Proc. of the Intl. Conf. on Music Information Retrieval, 2011.
 [33] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,” arXiv:1612.01840, 2016.