Massively Parallel Construction of the Cell Graph ***The research funded by National Science Center, decision DEC-2012/07/D/ST6/02483.
Motion planning is an important and well-studied field of robotics. A typical approach to finding a route is to construct a cell graph representing a scene and then to find a path in such a graph. In this paper we present and analyze parallel algorithms for constructing the cell graph on a SIMD-like GPU processor.
Additionally, we present a new implementation of the dictionary data type on a GPU device. In the contrary to hash tables, which are common in GPU algorithms, it uses a search tree in which all values are kept in leaves. With such a structure we can effectively perform dictionary operations on a set of long vectors over a limited alphabet.
Motion planning is a common task in robotics and artificial intelligence. One of the aims is to find a path, which can be traversed by a rigid body (e.g. a robot) to get to the destination point and avoid collisions with obstacles . Dobrowolski  considered the problem of motion planning in space (i.e. rotations about the origin in the Euclidean 3-space). He presented algorithms for constructing the cell graph, i.e. a graph representation of the configuration space. It is worth noting that these algorithms work for any space, not only for . Although the algorithms presented by Dobrowolski proved to be significantly faster than the naive approach, their running time was not acceptable for complicated scenes. As the main contribution of this paper, we present and analyze parallel extensions of these algorithms for GPU processors.
One of the parallel algorithms introduced in Section 3 uses some variation of a binary search tree, in which all elements are kept in leaves. There are several known implementations of a binary search tree on the GPGPU (see for example ). Our implementation allows us for an efficient execution of dictionary operations on a set of long vectors over an alphabet of a constant size. As a dictionary is a fundamental data type, widely used in many applications, we believe that our solution may be interesting and important on its own.
1.1 Definitions and basic properties
Let . By we denote the set . By we denote the set of all vectors of length over the alphabet . The -th coordinate (for binary vectors called the -th bit) of a vector is denoted by . The coordinates are indexed in zero-based convention, i.e. . For such that , by we denote the segment .
The Hamming distance of two binary vectors , denoted by , is the number of positions , such that . Observe that is a metric function, so it satisfies the triangle inequality: . From this it follows that: .
2 Problem of Cell Graph Construction for Motion Planning
In this section we describe the notion of the cell graph in motion planning. Although we use a very simple example, similar methods can be (and actually are) used in much more complicated settings (see for example [2, 5]).
2.1 Cells and vectors
Let us consider a system of inequalities (constraints), describing the boundaries (see Figure 1), that partition the space into a number of pairwise disjoint regions, called cells. We say that two cells are neighboring (a robot can move directly from one to another) if their boundaries share some arc (one point is not enough). Our task is to unify the cells and say which of them are neighbors. Then an obstacle-free route for a robot can be determined using a graph algorithm (see for example ). As the scenes (i.e. the space with the arrangement of obstacles) in real-life applications tend to be very complicated, an effective construction of the cell graph is a crucial part of this approach.
We shall represent each point by an -element binary vector . The -th bit of is 1 iff satisfies the inequality (see Figure 1). Observe that such a representation identifies all points belonging to the same cell, so it can be seen as a representation of this cell. The neighboring cells are exactly the cells whose representants differ on exactly one position, which means that their Hamming distance is 1. There are several known approaches to parallel computation of the Hamming distance in various settings [9, 12, 4]. However, none of them benefits for the specific properties of our task.
Observe that the theoretical bound for the number of different cells in the scene with constraints is . However, due to the arrangement of constraints, the actual number of non-empty cells is usually much smaller. For example, there is no point represented by a vector in Figure 1. This is the reason why detecting all cells for the given scene is a hard task. However, we are usually satisfied by approximate solutions generated by a randomized procedure called a sample generator. This procedure generates (and possibly accumulates) some sequence of random points. The simplest one is the so-called Shoemake’s method , in which the random points are generated uniformly.
2.2 Constructing the cell graph
Let be a set of binary vectors, each of length . These vectors represent the cells and are generated by a sample generator. The cell graph for is the graph with vertex set , in which edges are all pairs of vectors (for ), such that .
In this paper we are interested in solving the problem of constructing the cell graph, i.e. finding the edge set of for the given set . Since the degree of each vertex is at most , the graph has at most edges. Moreover, the total number of bits in is . Thus the lower bound for the complexity of any algorithm constructing the cell graph is .
It is also worth mentioning that many sample generators used in applications are not perfect and may output some vector more than once (so is in fact a multiset). Observe that in this case the number of pairs such that can increase to . A desired property of any algorithm constructing the cell graph is to be able to deal with such a situation and output only unique pairs of neighboring vectors, without increasing the complexity.
When comparing the complexities of the algorithms we will assume that . This is justified, since in most practical applications is about a few million, while is about a few hundreds.
3 Parallel algorithms
Parallel algorithms presented in this section are inspired by sequential algorithms for the problem presented by Dobrowolski .
3.1 Heuristic algorithm
In the naive approach we compare all pairs of vectors in total time . We improve this method by choosing a small constant and computing the distance between each of vectors and each vector in . Then, for each pair of vectors we compare them with each other to determine if their Hamming distance is 1. We are able to discard some pairs faster, using formula and previously computed distances to vectors . Observe that for this algorithm reduces to the naive one.
Each pair of vectors is considered by one block of threads. Each thread from this block considers the pair of corresponding segments of vectors and . For simplicity, assume that divides (otherwise the last thread of the block works on shorter segments). Each thread in the block compares a -element segment of with the corresponding segment of . More specifically, the -th thread (for ) of the block operates on segments .
First we compute for and and store the values in a shared memory. Each thread writes the number of positions, on which its segments differ, to a single cell of the shared vector . Then all those values are summed in parallel. Algorithm 1 shows the pseudo-code of this step.
The remaining part, shown in Algorithm 2, is analogous. The difference is that we may stop if we discover that it is greater than 1.
The worst-case time complexity of this algorithm is , while the space complexity is , as we need to store the values of . Since is chosen to be a constant, the time complexity and the space complexity are and , respectively. Observe that the choice of strongly affects the constants in the bounds for the complexity. However, the experiments show that even if is small, the effect on execution time may be significant.
3.2 Tree-based algorithm
The main drawback of the previous approach is that it is not aware of the structure of the constructed cell graph. Thus Dobrowolski  presented an optimized algorithm, based on a different approach. This algorithm first constructs an auxiliary binary tree, storing all vectors in . Using this tree we can determine if the particular vector is in in time .
In the parallel version of this algorithm, to improve memory accesses (see Section 4) we use a -ary tree for . Let be fixed and suppose for simplicity that divides (otherwise the last segment of each vector is considered in a slightly different way). For , let denote the vector in such that for every the sequence is the binary encoding of . By we denote the set . We can see as the representation of the sequences in . The tree has levels. Each level of corresponds to -th coordinate of . Each node contains pointers to nodes of the next level, each corresponding to a different element to (see Figure 2 for an example). If a particular child does not exist, then there are no vectors with the particular prefix. If is a node of and is its child node, corresponding to the value , then we say that is a -child of .
The first step of our algorithm is sorting the vectors in . As these vectors are binary, the sorting can clearly be done in time, using the radix sort algorithm. During this step we also remove all duplicates in .
Then we proceed to constructing the search tree . Each level of is constructed in parallel, with synchronization of threads after finishing each level. For each node and we introduce the set . Let be a node on level . The set consists of vectors , such that: i) is represented by in (with just a little abuse of notation we assume that for every vector satisfies this condition), and ii) . This means that the vectors from are exactly the ones, whose search path begins with the path from the root to the -th child of . Observe that since the set (and thus ) is sorted, each set can be represented by just two indices – of the first and of the last vector from this set. The Algorithm 3 shows the pseudo-code for this step. Observe that the computational complexity of Algorithm 3 is (recall that is a constant).
After constructing the search tree, we can proceed to the main step – identifying neighbors. For every vector in and every possible neighbor of we check if (in fact we check is ). Again, we do it in parallel. For every vector , each bit of is considered by a separate thread. Observe that each bit of corresponds to a single potential neighbor of . Thus each thread checks if this potential neighbor exists. The Algorithm 4 shows the pseudo-code of this procedure.
Observe that we do not have to keep the vector explicitly. At each step we need a segment of (corresponding to the current level of ), which can be found in constant time. The time complexity of the searching procedure is and so is the complexity of the whole algorithm. The space complexity of the algorithm is determined by the size of the search tree, which is .
Recall that during the sorting step we remove all duplicates. Thus this algorithm is robust in the sense that it does not assume that all input vectors are distinct and the same complexity bound holds even if is a multiset.
4 GPU Implementation Issues
In this section we discuss the implementation details of parallel algorithms described in the previous section. We shall omit an introduction to the computational model of GPGPU. The readers, who are not familiar with GPGPU programming, should refer to CUDA C literature [11, 8]. There are several limitations of GPU devices which are important from the algorithmic point of view. We are interested in algorithms which are able to: (1) use coalesced memory access, (2) maximize multiprocessor occupancy, (3) hide memory latency.
4.1 Heuristic algorithm
In order to achieve high processor occupancy we need to define the number of blocks which is at least three or four times higher than the number of streaming processors. Memory latency may be hidden if there is sufficient number of warps assigned to the same processor and memory accessing is interspersed with computations.
Algorithms 1 and 2 contain two nested loops iterating over an array of results (it is an upper-triangular square array with zeros on the main diagonal). Using blocks as the parallel computation units in the outer loop and threads in the inner one gives us a fair number of blocks and threads achieving good occupancy and hiding memory latency. Each thread reads parts of two vectors into registers and then performs comparison. Thus a significant number of computational instructions are executed between reads and writes.
Coalesced memory access is automatic if each vector is stored as a continuous array of bytes. Fragmented results of the comparison of two vectors in Algorithm 2 (one array for each block) may be stored in a shared memory and added up in parallel by threads of this block using classical parallel reduction pattern.
4.2 Tree-based algorithm
Tree construction in Algorithm 3 requires a synchronization after each level. Such a global synchronization can only be achieved by finishing a kernel and launching a new one. The number of threads in each kernel execution is equal to number of tree nodes in the previous level times (in our experiments we used , so ). All threads run independently and their division into blocks may be set arbitrarily in order to achieve best processor occupancy.
Algorithm 4 again contains two nested loops. The outer one is executed for each input vector and the inner one iterates over its coordinates. Similarly as in Algorithm 2, assigning the outer loop to blocks and the inner one to threads gives good parallelism properties. Each thread performs tree searching and reads in random memory locations. Coalesced memory reads are thus not possible. However, threads may still benefit from the global memory cache since up to threads may read the same byte from the memory performing independent searches.
5 Experimental Results and Discussion
In order to evaluate our parallel algorithms we utilized the sample generator developed by Dobrowolski  and some real-life scenes. The experiment was performed on a professional computation server (Intel Xeon E5-2620 2GHz, 15MB cache, 6 cores, 32GB RAM) equipped with NVIDIA Tesla K40 computational unit (2668 cores, 12GB memory).
In all parallel algorithms there are parameters which may influence their performance, i.e. the number of blocks and threads and the value of in the heuristic algorithm. According to NVIDIA white papers, due to the complication of the parallel processing model, the only way to find optimal values of these parameters for different devices and environments is to perform experiments. In the case of the value of (for the heuristic algorithm) our tests indicated that the optimal value for the CPU is 3, while for K40 it is 5. The rest of the experiments for heuristic algorithm were performed with these settings. An analysis of the size of the kernel grid for the parallel tree-based algorithm (divided into three stages: sorting, building and searching) is presented in Figure 3. The total processing time was minimal for 4096 blocks of 32 threads.
Figure 4 (left) presents the evaluation of the processing time in three stages of the tree-based algorithm: sorting, tree building, and neighbor searching. We can clearly see that searching time is growing faster than building, which is related to the factor in the complexity bound. However, for 44.000 vectors it is still smaller than the building part, due to high constants in the latter. Experiments show that the sorting stage does not influence the total processing time by more than 30% in case of bigger input sets.
On the right of the Figure 4 a comparison of several solutions is presented. Let us first analyze the heuristic methods. The sequential solution for the optimal value of (equal to 3) is significantly faster than the solution with set to 0, which corresponds to the naive solution. Similarly, parallel version with set to the optimal value (5) is much faster than the naive one. Both parallel and sequential procedures show similar growth of processing time for increasing size of the input data set. This shows that the algorithm scales well. The best performance is achieved by the parallel tree-based procedure. Sequential version behaves similarly but significantly (more than two orders of magnitude) slower.
6 Conclusion and further research directions
We presented two important parallel algorithms for construction of the cell graph in the motion planning problem. Our experiments show that the parallel solution based on a search tree is much faster than its sequential counterpart. This is also one of rare efficient search tree structures for GPU processors. We show that creating and searching such a structure can be efficient also on a SIMD-like processors, which were so far identified with vector processing. This was possible due to proper tree node construction and memory caching available in modern devices.
As a modification of the tree-based algorithm, Dobrowolski presented an algorithm, constructing the cell graph in time. Using an auxiliary data structure, the searching step can be performed in time. Unfortunately, the experiments on the real data (see Section 5) show that constructing the tree takes the majority of the execution time. Moreover, as this improved searching procedure requires lots of synchronization, it may actually lead to worse execution time. A very natural research direction is to design a scalable parallel algorithm, constructing the cell graph for a given set of vectors in time .
As mentioned before, cell graphs are used in motion planning. A path in the cell graph corresponding to a given scene is equivalent to some approximate solution of the motion planning problem. There are several approaches to traversing large graphs using GPGPU (see [10, 6]). An interesting problem is to design such an algorithm, taking into consideration its structure.
-  P. c. Chen and Y. k. Hwang. SANDROS: a dynamic graph search algorithm for motion planning. In ICRA, 1998.
-  J. Canny. A new algebraic method for robot motion planning and real geometry. In Proc. of SFCS 1987, pages 39–48, 1987.
-  P. Dobrowolski. An efficient method of constructing cell graph in arrangement of quadrics in 3-dimensional space of rotations. In Proc. of SII 2013, pages 671–676. IEEE, 2013.
-  S. Grabowski and K. Fredriksson. Bit-parallel string matching under hamming distance in worst case time. Inf. Proc. Letters, 105:182 – 187, 2008.
-  Y. K. Hwang and N. Ahuja. Gross motion planning – a survey. ACM Comput. Surv., 24:219–291, Sept. 1992.
-  K. Kaczmarski, P. Przymus, and P. Rzążewski. Improving high-performance gpu graph traversal with compression. In New Trends in Database and Information Systems II, pages 201–214. Springer, 2015.
-  C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proc. SIGMOD 2010, pages 339–350, 2010.
-  D. B. Kirk and W. H. Wen-mei. Programming massively parallel processors: a hands-on approach. Newnes, 2012.
-  Y. Liu, L. Guo, J. Li, M. Ren, and K. Li. Parallel algorithms for approximate string matching with mismatches on CUDA. In IPDPS 2012, pages 2414–2422, 2012.
-  D. Merrill, M. Garland, and A. S. Grimshaw. Scalable GPU graph traversal. In PPOPP, pages 117–128. ACM, 2012.
-  NVIDIA Corporation. NVIDIA CUDA C Best Practices Guide 6.5, 2014. docs.nvidia.com/cuda/cuda-c-best-practices-guide.
-  J. Pan, C. Lauterbach, and D. Manocha. Efficient nearest-neighbor computation for GPU-based motion planning. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2243–2248, 2010.
-  K. Shoemake. Uniform random rotations. In D. Kirk, editor, Graphics Gems III, pages 124–132. Academic Press Professional, Inc., 1992.