# Solving -Nearest Neighbor Problem on Multiple Graphics Processors

## Abstract

A recommendation system is a software system to predict customers’ unknown preferences from known preferences. In a recommendation system, customers’ preferences are encoded into vectors, and finding the nearest vectors to each vector is an essential part. This vector-searching part of the problem is called a -nearest neighbor problem. We give an effective algorithm to solve this problem on multiple graphics processor units (GPUs).

Our algorithm consists of two parts: an -body problem and a partial sort. For a algorithm of the -body problem, we applied the idea of a known algorithm for the -body problem in physics, although another trick is need to overcome the problem of small sized shared memory. For the partial sort, we give a novel GPU algorithm which is effective for small . In our partial sort algorithm, a heap is accessed in parallel by threads with a low cost of synchronization. Both of these two parts of our algorithm utilize maximal power of coalesced memory access, so that a full bandwidth is achieved.

By an experiment, we show that when the size of the problem is large, an implementation of the algorithm on two GPUs runs more than 330 times faster than a single core implementation on a latest CPU. We also show that our algorithm scales well with respect to the number of GPUs.

## 1Introduction

A recommendation system is a software system which utilizes known customers’ preferences to predict unknown preferences. It is widely used in Internet-based retail shops and other service providers, such as Amazon.com [2] for example. In a recommendation system, customers’ preferences or buying patterns for items are encoded into vectors and finding nearest vectors is an essential part of its computation. This vector-finding part is called a -nearest neighbor problem. We give an effective GPU algorithm to solve this problem.

Generally a recommendation system deals with large samples and large dimensions, as personitem for example. In such a case, the dimensionality reduction method such as singular value decomposition or latent Dirichlet allocation has been widely used [4]. As the result of the reduction, the problem becomes the -nearest neighbor search for a moderate dimension. However, the effect of the sample size is and it is a computational burden. Therefore some approximation has been considered to be necessary [7]. This paper indicates that strict computation in practical time is possible. Our target size for is to , for the dimension after reduction is to .

The -nearest neighbor problem is defined as follows: when a set of vectors , distance function and an integer is given, find nearest vectors to each . We propose an effective and scalable algorithm to solve it on multiple Graphics Processor Units (GPUs). Our algorithm is implemented in CUDA [1], which is extension of C language provided by NVIDIA.

A GPU is a powerful commodity processor. Although a GPU is originally designed for processing of graphics, the movement of the GPGPU (General Purpose computing on GPU) has arisen as an expected breakthrough for a large scale numerical computation. The typical characteristic of the GPGPU is highly massive parallelism. A GPU has hundres of cores, and to extract its power, it is necessary to run tens of thousand of threads per unit. Because of that property, a GPU consumes large energy as a unit, but it is energy effective per FLOPS.

The algorithm of the -nearest neighbor problem is fundamentally a combination of -body problem and partial sorting. Nyland et al. [8] showed an effective algorithm for -body problem on CUDA. Because dealing with high dimensional vectors, we give some trick in addition to the known -body algorithm. About sorting, [9] showed an effective algorithm, but we have employed another algorithm because we have to sort many arrays at once and we only need to have top element not fully sorted data.

Garcia et al. [?] showed a GPU algorithm to compute the -nearest neighbor problem with respect to Kullback-Leibler divergence. Their algorithm mainly uses a texture memory, which in effect, works as a cache memory. Its performance largely depends on the cache-hit ratio, and for a large data, it is likely that a cache miss occurs frequently. On the other hand, our algorithm utilizes maximal power of coalesced memory access, so that such loss as a cache miss never happens. Moreover, our algorithm is effective even for a symmetric distance function and for multiple GPUs.

The rest of this paper is organized as follows. In Sect. Section 2, outline of CUDA’s programming model is explained. In Sect. Section 3, we define the problem formally. We give overview of the algorithm in Sect. Section 4. In Sect. Section 5 and Section 6, we explain the detail of each step of the algorithm. In Sect. Section 7, we show the result of experiment. We conclude in Sect. Section 8.

## 2Programming model of CUDA

In this section, programming model of CUDA is briefly explained. For more details of CUDA, refer to [10].

**Thread model.** NVIDIA’s recent graphics processor contains hundreds of *stream processors* (SPs). An SP is like a core in a CPU; it can compute simultaneously. For example, GTX280 has 240 SPs. With such many SPs and very low cost of context switch, a GPU performs well for tens of thousands of threads. Threads are divided into thread blocks. Each thread block can contain at most 1024 threads. A function to synchronize threads in a block is provided, while there is no such function to synchronize thread blocks. The only way to synchronize thread blocks is to bring back the control to the CPU.

**Hierarchal memories.** Before running a GPU, the CPU must explicitly copy data to the GPU’s memory. The memory in GPU to share the data with CPU is called *global memory*. A thread block is also a unit to share data. Each thread block has a memory to share only in the thread block. It is called *shared memory*. The access to the global memory is relatively slow, and usually copying necessary data to shared memory is better for performance. Although global memory has some gigabytes, shared memory has only 16KB for each thread block. Each thread also has a local memory which is called a *register*. The access to a register is fast, but its size is also limited.

**Coalesced memory access.** In CUDA, for example, if successive 16 threads are accessing the successive 128 bytes in global memory at the same time, the memory access is coalesced. When a memory access is coalesced, it is done in only one fetch while otherwise access by 16 threads takes 16 fetches. Hence, effective utilization of coalesced memory access affects very much the total performance of an application. The detailed condition about when memory access can be coalesced is explained in [10].

## 3Description of the problem

The -nearest neighbor problem is described as follows.

Suppose that a set of vectors and distance function is given. Then output the nearest vectors to each .

In other words, for each , find a subset of indices such that

and

The distance function is arbitrary. Although we use the word “distance”, it does not necessarily need to satisfy the axiom of distance. We assume that is *cumulatively computable*. It means can be computed step by step by referring to each coordinate values. In other words, it is computed with a function and some initial value by and .

In this paper, we only discuss the case when is symmetric: i.e. . In the symmetric case, we can omit the half of distance calculations, and consequently, balancing of the workload becomes more difficult. The algorithm explained in this paper is easily modified for non-symmetric distance function.

## 4Overview of the algorithm

Since we have assumed that is symmetric, we only compute for . For an explanation, we depict the whole problem as a square where the point stands for the computation of . The distances to compute is represented by upper right triangle of the square.

Because of the limitation of the number of threads which can be run at once on GPU, the problem is divided into a first level blocks. We call each of them a *grid* (Fig. Figure 1). Each grid is processed in a GPU at once. A grid can be divided row-wisely into *blocks*, each of which is computed in a thread block. We denote the size of each side of a grid by . It means the grid stands for the region . Similarly we denote the size of a block (i.e. the number of rows in a block) by (Fig. Figure 2). is determined depending on so that the problem can be devided effectively, while is fixed according to the capability of CUDA.

To balance the workload, we assign GPUs as in Figure 3. In other words, the -th row of grids is assigned to -th GPU when or , where means the number of GPUs available. Here note that although it is enough to compute the upper-right part of the problem, each GPU virtually compute the mirror side of the assigned part (see also Figure 4)

To keep the -nearest vectors, we use a heap structure. The heap has at most elements and is in descending order, so that the -th smallest element can be found in . Moreover, each GPU keeps their own heaps to avoid a costly synchronization (Fig. Figure 4). It means each GPU has heaps which stores the nearest elements computed by itself. At the last phase, the heaps of different GPUs are merged in CPU.

Thus the outline of the algorithm is shown in Fig. ?. In this algorithm, the calclation of distances is explained in Sect. Section 5, and how to push the distances to the heaps is described in Sect. Section 6.

## 5Phase 1: calculation of distances

Basically the framework of the process to compute the distances of vectors is the same as the algorithm of -body problem written in [8]. A grid is row-wisely devided into blocks, and each block is assigned to a thread block. Each thread corresponds to a row. A block first copies a fixed number (which we denote by ) of columns to the shared memory. Then compute the distances.

However, in our problem, since the dimension is large, it is not possible to copy all the coordinate data to the shared memory even for a small . Hence, a thread iteratively reads a fixed number of coordinate values of corresponding vectors. In other words, if is expressed as , then are read in -th iteration (Fig. Figure 5). If a vector is expressed by single precision numbers, must be a multiple of 32 to utilize full power of coalesced memory accesses.

The algorithm to calculate the distaces for a given grid is shown in Fig. ?. Here, the arguments , , , and are given as re-indexed so that this procedure can calculate for the assigned grid. The index for the block is expressed by , and each block has threads. Each thread is indexed by

## 6Phase 2: taking smallest elements

In the second phase, each thread block is assigned to each row. The smallest distances are computed by parallel processing of threads in the block. If the number of thread in a block is denoted by , each thread read distances skipping , so that memory access is coalesced. A thread check if the element is smaller than current -th largest element in the heap, and store it in the local buffer if so. This is because is relatively small than and it is likely that only a few elements is stored in the local buffer. Because of this mechanism, the waiting time is shortened even though when pushing to the heap, the threads must be synchronized.

The algorithm is shown in Fig. ?. Here, the index for the block and thread is denoted by and respectively, and is thread-local array and its size is .

## 7Experiment

We experimented our algorithm on two GTX280’s and one GTX280. For a comparison, we also implemented CPU version and experimented it on Intel i7 920 (2.67GHz). GTX280 is one of the latest NVIDIA’s graphics chips. The algorithm experimented on the CPU is a simple one: it calculates each and pushes it to the corresponding heaps. Note that although Intel i7 has four cores with hyperthreading capability, we only worked on serial algorithm, i.e. it only uses one core.

The distance employed here is Hellinger distance, which often used in the context of statistics. Hellinger distance for two vectors and is defined as:

The result of the experiment for various is shown in Table Table 1. The other parameters are set as and ; and the data is generated randomly. It shows that for a large problem, our algorithm work well from the viewpoint of parallelism of GPUs. Moreover, it also tells the GPUs substantially outperforms the CPU; for a large problem, two GPU implementation is more than 330 times faster than the CPU.

10000 | 20000 | 40000 | 80000 | |
---|---|---|---|---|

GTX280 (a) | 1.8 | 5.7 | 17.7 | 68.6 |

GTX280 (b) | 2.7 | 8.6 | 34.1 | 131.8 |

i7 920 (CPU) (c) | 354.2 | 1419.0 | 5680.7 | 22756.9 |

(c)/(a) | 196.7 | 248.9 | 320.9 | 331.7 |

(c)/(b) | 131.1 | 173.3 | 166.5 | 172.6 |

(b)/(a) | 1.50 | 1.51 | 1.92 | 1.92 |

## 8Conclusion

We introduced an effective algorithm for -nearest neighbor problem which works on multiple GPUs. By an experiment, we have shown that it runs more than 330 times faster than an implementation on a single core of an up-to-date CPU. We have also shown that the algorithm is effective from the viewpoint of parallelism of GPUs. That is because 1) there is no synchronization between GPUs until the very end of the process and 2) the workload is well balanced.

Our algorithm includes simultaneous partial sort of multiple arrays. It minimizes the inter-thread synchronization utilizing the fact that if , most of the data are discarded. About this part of algorithm, we have achieved a good performance but still there is a room for improvement because it uses arrays in a local scope which are stored in a slow global memory in effect. To improve the performance of the simultaneous partial sort is our ongoing work, and we believe this problem alone is also important because it can be applied to other problems.

## Acknowledgment

The authors would like to thank Khan Vo Duc of NVIDIA for giving us a helpful advice about an early version of this paper.

### References

**CUDA Zone.**

NVIDIA: http://www.nvidia.com/object/cuda_home.html**http://www.amazon.com**

Amazon.com.**http://www.netflix.com**

Netflix.**Fast online SVD revisions for lightweight recommender systems.**

Brand, M.: In: In SIAM International Conference on Data Mining. (2003)**Latent dirichlet allocation.**

Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Journal of Machine Learning Research**3**(2003) 2003**Google news personalization: scalable online collaborative filtering.**

Das, A.S., Datar, M., Garg, A., Rajaram, S.: In: WWW ’07: Proceedings of the 16th international conference on World Wide Web, New York, NY, USA, ACM (2007) 271–280**Approximate nearest neighbors: towards removing the curse of dimensionality.**

Indyk, P., Miller, W.: In: In proceedings of the 1998 Symposyum on Theory of Computing. (1998)**Fast -body simulation with CUDA.**

Nyland, L., Harris, M., Prins, J.: In: GPU Gems III.**A practical quicksort algorithm for graphics processors.**

Cederman, D., Tsigas, P.: In: ESA ’08: Proceedings of the 16th annual European symposium on Algorithms, Berlin, Heidelberg, Springer-Verlag (2008) 246–258**CUDA 2.1 programming guide.**

NVIDIA: http://www.nvidia.com/object/cuda_develop.html (2008)