Solving The Longest Overlap Region Problem for Noncoding DNA Sequences with GPU

Solving The Longest Overlap Region Problem for Noncoding DNA Sequences with GPU


Early hardware limitations of GPU (lack of synchronization primitives and limited memory caching mechanisms) can make GPU-based computation inefficient. Now Bio-technologies bring more chances to Bioinformatics and Biological Engineering. Our paper introduces a way to solve the longest overlap region of non-coding DNA sequences on using the Compute Unified Device Architecture (CUDA) platform Intel(R) Core(TM) i3- 3110m quad-core. Compared to standard CPU implementation, CUDA performance proves the method of the longest overlap region recognition of noncoding DNA is an efficient approach to high-performance bioinformatics applications. Studies show the fact that efficiency of GPU performance is more than 20 times speedup than that of CPU serial implementation. We believe our method gives a cost-efficient solution to the bioinformatics community for solving longest overlap region recognition problem and other related fields.

keywords: CUDA, GPUs, RMQ, suffix array, DC3, LCP, Noncoding DNA,

I Introduction

In the early years, many science researcher held that a large amount of non-coding DNA had not biological functions and viewed non-coding DNA as ¡±junk DNA¡±. But, recent studies show many non-coding DNA are functional and beneficial to human beings. For example, many biological scientists found the RNA sequences of telomeres have special function which was ignored by previous research. Sequence alignment is an important problem in computational biology and sequence comparison is an important tool for researchers in molecular biology [1]. Computational recognition of genes is one of challenges in the analysis of newly sequenced genomes filed, which is fundamental for modern functional genomics [2].

In recent years, modern multi-core and many-core architectures are revolutionary high-performance computing (HPC). Now CPU microprocessors, based on a single central processing, promote the performance of computer application to make floating point arithmetic achieved 11 times per second single chip, the era of many-core processor has begun [3]. The emergence of many-core architectures, such as compute unified device architecture (CUDA)-enabled GPUs and other accelerator technologies, these emerging technologies open more opportunities to optimize many biological algorithms and to provide industries with more advanced and powerful computing hardware.

Due to exponentially growing of DNA bases and awareness of function of non-coding DNA, computational recognition of genes became a time-consuming and challenging job. The development of bioinformatics need to have a better performance. Several efforts have been made to optimize the biological algorithm and reach more accurate results. Life science have emerged as a primary application area for use of GPU computing. Now a method to recognize the longest overlap region of DNA is on two different platforms: multi-core( CPUs) and many-core(GPU). GPU performance grows faster than development of CPU in many fields of bioinformatics. In this paper, we present a modified and efficient parallel algorithm with CUDA to solve the longest overlap region problem. Based on DC3 algorithm which is used to construct suffix array and RMQ algorithm, we change series programming into computation of the problem in parallel. Here DC3 algorithm is more than 20 times speedup than that of CPU serial implementation and the fact shows solving the longest overlap region with CUDA is a perfect method.

Ii general introduction of GPU and CUDA programming model

Recently, more and more application developers pay much attention to GPUs. The new products of GPUs are dramatically increasing programmability and generality but still providing developers with huge memory bandwidth and powerful computational ability [4]. In order to meet growing requirements of programming developers, many industries have been engaging into designing processors such like NVIDIA used a large amount of processors cores to built programming processor [5]. Culminating in NVIDIA¡¯s first GPU in 1999. After one year when NVIDIA have been developing GPU terms, many software and computer games were not only filed which made remarkable breakthroughs with technology. The General Purpose GPU (GPGPU) movement had dawned [6].

CUDA is a parallel computing platform and programming model invented by NVIDIA. [3] [6]. Since yielded in 2007, hundreds of millions of computers equipped with CUDA processing capability of GPU is widely used. GPU parallel technology is now widely used in various fields, GPU high-speed parallel processing capabilities could be used in digital image processing algorithms, discrete simulation, general computing, greatly improving computational speed, because GPU performance characteristics can be applied to numerical calculation and matrix processing [7].

Fig. 1: GPU CPU architecture

CUDA programming model is very well suited to parallel capabilities of GPUs. CUDA devices collect a plenty of data parallelism to accelerate data process. Two control circuits are relatively simple to GPU, and demand for Cache is small, so you can put most of transistors in the cell. of GPU is ALU. Illustrate in Figure.1 .

It is a common CUDA programming model that sequential host program processes organize an application and parallel device can process at least one host programme. Typically, parts of host programme are implemented on CPU and GPU take responsibility of the others parallelized parts. Threads are rationally divided into each thread block by programmer; kernel that we used thus is made up with a grid which consist of at least one thread block. A cluster of threads synchronously work through barrier synchronization in a thread block which shared a same memory space [8].

Iii LCP and related algorithm

In this paper, we present the longest overlap region problem that is to compute the longest common prefix(LCP) non-coding DNA sequences and its Pos array in suffix arrays. Given a reference sequence . For i 1 , 2 , 3..n, every S(i,n) is a suffix of S, also every S(1,i) is the prefix of S. For convenience, we define LCP(S,i,j) as the longest common prefix between SA[i,n]  (SA is the suffix array of S) and SA[j,n].

Suffix array is a space-efficient data structure that guarantee accurate and quick searching of a subsequence from pattern. Pos of all suffixes of a sequences are stored basically in Suffix array [9]. Compared with suffix trees, We naturally viewed Suffix arrays as a more efficient and convenient method data structure. This structure can be served as an array of integers representing the start position of every lexicographically ordered suffix of a string. In the table, we can use SA of S or SAS to indicate suffix array of S. It is easy to find the rules of suffix array of S that an array SA[0..n] which contains a permutation of the integers 0..n so, we can make a conclusion that , meanwhile we can give a rule: SA[m] = k iff S[k..l] is the suffix of S [10] [11]. Given example , we can build a sort suffix table of S in the Table 1

    S A T T G C T A C
INDEX 0 1 2 3 4 5 6 7
   SA 6 0 7 4 3 5 2 1

Traditional method build suffix array can be in O(n*log(n)). In this paper, we give a linear-time algorithm called DC3 which is a special case of a another cover sample. in this situation, the sample must meet the requirements: 1 : The sample itself can be sorted efficiently. 2 : Sorted sequence of all suffixed is helped by of the sorted sample. sorting suffixes in DC3 method is beginning at location in a difference cover sample modulo 3, after the process utilize index to find all suffixes [10]. It takes the following 2=3-recursive divide-and-conquer approach :
Definition :

A set  is difference cover module v if


A v-periodic sample C of [0,n] with the period D, that is,


is a difference cover sample if D is a difference cover modulo v.

For k¡Ê[0,..3),    define :


the set of cover position is ,

Step 1 :   Built suffix array of the suffixes starting at position i (i mod ) and sort sample suffixed

So we can see to be the set of the sample of positions. We can easily find , and . For , make a string

whose characters are triples . Let be the concatenation of R1 and R2. , refer to table 1, and we can get . For i C, let denote rank of in the sample set of . For , is undefined. illustrate in Table.2 [9]

    S A T T G C T A C
     i 0 1 2 3 4 5 6 7
5 4 2 3 1

STEP 2 search Non-sample Suffixes
Given each Nonsample suffix with the pair ,. It is so obvious that : for , . For example, , because

STEP 3 merge those of the two segments
Combining two segments into one string should be obey the rules :
Let with ,


: . Finally we can obtain sorted sequence in Table 3.

    S A T T G C T A C
     i 0 1 2 3 4 5 6 7
2 8 7 5 4 6 1 3

After construction of suffix array, we should find the longest overlap prefix of two sequences we assigned. The lcp between two suffixes is the minimum of the lcps of all pairs of adjacent suffixes between them on the Pos array [9]. That is :


Range-Minimum-Query-Problem is to preprocess an array so that the position of the minimum element between two specified indices can be obtained efficiently [12] [13]. First we give a definition to the Range Minimum Query (RMQ) :

Given an array S[1…n] of elements from a totally ordered set (with order relation ””, returns the index of a smallest element in A[i, j], i.e., =. Illustrate in table.

So LCP problem can be successfully transformed into RMQ problem. The method was first proposed and presented by Bender and Farach-Colton. we present in this paper that Berkman and Vishkin algorithm which is combined with 1RMQ is other case of RMQ problem. Disadvantages of the algorithm is waste of the large space, but it maintains efficiency of query process in to complete.

    S A T T G C T A C
INDEX 0 1 2 3 4 5 6 7
   SA 6 0 7 4 3 5 2 1
   lcp 0 1 0 1 0 0 1 1

This algorithm can be divided into two parts :

STEP 1 RMQ problem can be converted to LCA problem(least common ancestor): first build cartesian tree store the input sequence of A , build the complexity of the cartesian tree is O (n).

STEP 2 Transform LCA problems into 1RMQ : based on DFS search of the tree by Euler path (Euler Tour), set up three arrays E, L and R. The E and L size are , elements of E is label value for each node of the cartesian tree (is actually index of A), elements of L is corresponding to the depth of the node of euler path. R stores the positions of the each node queried for the first time.

If want to query LCA(u,v), it is actually equivalent to solving RMQ problem of the array of L. Due to do RMQ can be transformed into RMQ, it also can use the 1RMQ algorithm in time.

Iv accelerate contruction of lcp table and solve largest overlap region problem with cuda

Sorting is a common problem in bioinformatics, like quick sort. Merge sort etc. Quick sort algorithm not only guarantee the computing process on hardware which is shortage of atomic operations but also utilize geometry shaders, however is generally slightly slower and quick sort is testified as efficiency algorithm. In DC3 algorithm, the construction process of the sample suffix array and the non-sample suffix both use radix sort to built array. To many a certain keys problem solving by comparison-base algorithm, Radix sort is relatively an efficient and low-cost method. This sorting algorithm contains b processes which process j-th digits of the keys in order from the least to the most digit. Previously, many compute scientists have developed many variant algorithm, especially using bitonic sort (Govindaraju et al. 2006, Zachmann 2006). In our paper, we present a efficient radix sort to implement DC3 [14] [15].

This algorithm presents each chunk of subsequence is sorted by radix sort. a grid of thread blocks cooperatively sort chunks in a parallel way. Due to a lack of enough size of shared memory, it makes difficult for single multiprocessors to allocate more space to chunks of input array. Given a string , divide S into every single chunk like , , ...illustrate in Figure.2.

Fig. 2: chunks of divided S

Taka a number which has m-bits keys as an example, it need to take m steps to finish radix sort algorithm. Thus every step of process requires us to do scan the bit table. It is notable that split primitive is significant for all parallel process to achieve an efficient and satisfied performance. After that we should organize each sort keys which are gained by split into a certain keys list, and each value represent the least significant bit that is either ”1” or ”0”. But there are some tricks to produce new reversed the table list that all ”0” sorted key are replaced by ”1” and vice versa. Finally we need to formulate a rule to compute the destination address of each input number. For example, , let array L store the least significant bit b, so . Define E is a array, and for , . And we can obtain the . Scan E to obtain F. Illustrate in Figure.3.

Fig. 3: scan process

. Process illustrate in the Figure.3. Using array F[n-1] and E[n-1] to obtain totalFalse, . Final index of each element in S can be computed by formula : . All the process follow Figure.4 [16].

Fig. 4: DC3 process
1:  while each  do
2:     ;
3:     ;
4:     ;
6:     ;
7:      record i_th bit of each element
8:     ;
9:     ;
10:     ;
11:     ;
12:     ;
13:     for all  , and, do
14:        scan the 1s table to calculate prefix sum of each element
15:        ;
16:        if  then
17:           ;
18:        else
19:           ;
20:        end if
21:        ;
22:     end for
23:     ;
24:     ;
26:     ;
27:     ;
28:     ;
29:     array d store the index of the sorted S
30:     ;
31:     ;
32:     ;
33:  end while

When all assigned thread blocks chunks finished, each chunk of sorted numbers should be repeatedly combined into a new sorted list and each new sorted list is allocated to a coalescent chunk until all chunks finish merge sort process. The pseudocode is depicted in Alg.1 in the kernel function cudaDC3RS. In the kernel function id is a private register with respect to a thread respectively, which denote as thread index.

Before the process depicted as Alg.1, we should divide whole elements of S into sample suffix and Nonsample suffix. here we just present a single process of sample suffix. Let S to be sample suffix. Follow the Alg.1 all single block of subsequence are organized into a string array S ,so that the exclusive thread can be mapped into S[id]. Then each subsequence is executed respectively so as to the indexed structure enable all threads to string matching simultaneously by means of radix sort algorithm supported by GPU.

V result

In this section we compare the sequential performance with parallel performance of CUDA implementation. Both algorithms are implemented using Microsoft Visual Studio 2010 combined with CUDA version 5.5 for the parallel implementation. The proposed algorithms are carried out in a Intel(R) Core(TM) i3-3110K quad-core running at 2.40GHZ with 2.0GB RAM. The used CUDA driver and runtime version are both 5.0, NVIDIA GeForce 610M GPU(kepler architecture) which has a total of 48 streaming multiprocessors operating at a clock rate of 900 MHZ.

Noncoding DNA sequences from NCBI Nucleotide are to evaluate the performance of the previously described algorithms. The reference sequences corresponds to 3 groups: Homo sapiens chromosome Y noncoding region downstream from the DAZ gene of which accession is AH011747.1 GI: 21929706, Caenorhabditis elegans Bristol N2 genomic chromosome of which accession is BX284603.4 GI: 449020129 and Homo sapiens isolate NA19204 noncoding region T1419 genomic sequence of which accession is GQ846167.1 GI: 260538176. Several groups of query sequences were used to experimental test, each one consists of a different number of query sequences, ranging from 512 to 2097152 query sequences. It is worth noticing that the programmer writes a kernel and organize its execution in a grid of thread blocks, Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM. Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM. The actual execution of a thread is performed by the CUDA Cores contained in the SM. It is a very important process for mapping noncoding DNA sequences into the grid. Illustrate in Table.5.

Accept for implementation of finding the longest overlap region of noncoding DNA on CUDA, CPU programme of the algorithm also is executed for sake of evaluating the different efficiency between the series programme and parallel programme. And the series parts execute on Intel(R) Core(TM) i3-3110M CPU. The results is illustrated in Figure.5. The result shows DC3 radix sort algorithm on GPU in O(n) time to construct suffix array compared with execution on CPU. And it is easy to find that the performance on GPU is not better than CPU execution when the number of the noncoding DNA sequences are lower than 65536. After more the number of noncoding DNA sequences, performance improvement on the GPU of algorithm is much satisfied. It also observed that efficiency of performance is growing with significantly growing the numbers of sequences. Illustrate in Figure.6. Unlike what happened in CPU implementations, tile optimization partition local data into shared memory, due to the achieved performance in accordance with memory accesses, reduce the access of global memory.

        TYPE / NUMBER 256 1024 4096 16384 65536 262144 1048576
        G1CPU 0.001 0.0062 0.047 1.425 5.325 19.375 98.253
       0.002 0.015 0.085 0.741 2.012 4.524 9.1012
       0.002 0.014 0.072 0.531 1.842 3.254 8.335
        G2CPU 0.001 0.0068 0.0051 1.615 5.827 22.941 98.453
       0.002 0.016 0.087 0.6981 1.962 5.014 8.232
       0.002 0.014 0.088 0.513 1.901 3.044 7.963
        G3CPU 0.001 0.0067 0.049 1.577 5.553 21.531 95.616
       0.002 0.019 0.087 0.720 2.232 5.271 7.816
       0.002 0.015 0.081 0.681 1.822 3.116 6.735
TABLE V: DC3 radix sort based on CPU and GPU
        TYPE / SPEED RATIO 256 1024 4096 16384 65536 262144 1048576
       -0.5 -0.75 -0.47 2.25 2.61 4.28 10.79
       -0.48 -0.67 -0.31 2.32 2.89 5.96 11.79
       -0.5 -0.71 -0.49 2.75 2.96 4.57 11.85
       -0.48 -0.61 -0.28 3.15 3.07 7.52 12.37
       -0.5 -0.69 -0.47 2.25 2.39 4.08 12.24
       -0.48 -0.67 -0.31 2.32 3.04 6.91 14.20
TABLE VI: Speedup rating of different execution

From the obtained results, it can be observed that runtime of CPU were consistently higher than implemented DC3 radix sort, it shows solving the longest overlap region problem with respect to suffix array construction using GPU even is efficient method to high performance bioinformatics applications.

Vi conclusion

This paper presents a new method to solve problem of the longest overlap region of Noncoding DNA sequence using GPU hardware and modified algorithm. Related experiment were thoroughly compared using two different executional ways: multicore (i3-3110M) and many-core (NVIDIA GeForce GTX610M GPU).

These observations reveal that massive data applied to the longest overlap problem is time-consuming job. Nevertheless, parallel programming on GPU achieves the improvement in rate accelerating making solving DNA sequence more precise and more faster so that can meet our requirement in more important projects.

Fig. 5: performance of CPU and kernel execution of GPU
Fig. 6: average speedup ratio between GPU and CPU


  1. P. Bernaola-Galván, I. Grosse, P. Carpena, J. L. Oliver, R. Román-Roldán, and H. E. Stanley, “Finding borders between coding and noncoding dna regions by an entropic segmentation method,” Physical Review Letters, vol. 85, no. 6, p. 1342, 2000.
  2. H. Shi, B. Schmidt, W. Liu, and W. Muller-Wittig, “Accelerating error correction in high-throughput short-read dna sequencing data with cuda,” in Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on.   IEEE, 2009, pp. 1–8.
  3. C. Nvidia, “Compute unified device architecture programming guide,” 2007.
  4. G. Liao, Q. Sun, L. Ma, and Z. Qin, “Gpu accelerated multiple deoxyribose nucleic acid sequence parallel matching,” arXiv preprint arXiv:1303.3692, 2013.
  5. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded gpu using cuda,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming.   ACM, 2008, pp. 73–82.
  6. W. H. Wen-mei, GPU Computing Gems Emerald Edition.   Elsevier, 2011.
  7. X. Sierra-Canto, F. Madera-Ramirez, and V. Uc-Cetina, “Parallel training of a back-propagation neural network using cuda,” in Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on.   IEEE, 2010, pp. 307–312.
  8. N. Satish, M. Harris, and M. Garland, “Designing efficient sorting algorithms for manycore gpus,” in Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on.   IEEE, 2009, pp. 1–10.
  9. T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, “Linear-time longest-common-prefix computation in suffix arrays and its applications,” in Combinatorial Pattern Matching.   Springer, 2001, pp. 181–192.
  10. J. Kärkkäinen, P. Sanders, and S. Burkhardt, “Linear work suffix array construction,” Journal of the ACM (JACM), vol. 53, no. 6, pp. 918–936, 2006.
  11. S. J. Puglisi and A. Turpin, “Space-time tradeoffs for longest-common-prefix array computation,” in Algorithms and Computation.   Springer, 2008, pp. 124–135.
  12. D. Gusfield, Algorithms on strings, trees and sequences: computer science and computational biology.   Cambridge University Press, 1997.
  13. J. Yang, Y. Xu, and Y. Shang, “An efficient parallel algorithm for longest common subsequence problem on gpus,” in Proceedings of the world congress on engineering, WCE, vol. 1, 2010.
  14. A. Greb and G. Zachmann, “Gpu-abisort: Optimal parallel sorting on stream architectures,” in Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International.   IEEE, 2006, pp. 10–pp.
  15. M. Farach, “Optimal suffix tree construction with large alphabets,” in Foundations of Computer Science, 1997. Proceedings., 38th Annual Symposium on.   IEEE, 1997, pp. 137–143.
  16. H. Nguyen, “Gpu gem 3,” 2007.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description