A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings

A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings

Abstract

Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this fact has motivated the development of algorithms, to compute the BWT, that work almost entirely in external memory.

On the other hand, the related problem of computing the Longest Common Prefix (LCP) array is often instrumental in several algorithms on collection of strings, such as those that compute the suffix-prefix overlap among strings, which is an essential step for many genome assembly algorithms.

The best current lightweight approach to compute BWT and LCP array on a set of strings, each one characters long, has I/O complexity that is (where is the size of the alphabet), thus it is not optimal.

In this paper we propose a novel approach to build BWT and LCP array (simultaneously) with I/O complexity, where is the length of longest substring that appears at least twice in the input strings.

1 Introduction

In this paper we address the problem of costructing in external memory the Burrows-Wheeler Transform (BWT) and the Longest Common Prefix (LCP) array for a large collection of strings. An efficient indexing of very large collections of strings is strongly motivated by the widespread use of Next-Generation Sequencing (NGS) technologies that are producing everyday collections of data that fill several terabytes of secondary storage, that has to be processed by sofware applications. Common applications in metagenomics require indexing of collections of strings (reads) that are sampled from several genomes, where those genomes amount to billions of base pairs. For example, over 500 gigabases of data have been analyzed to start a catalogue of the human gut microbiome [20].

The Burrows-Wheeler Transform (BWT) [6] is a reversible transformation of a text that was originally designed for text compression; it is used for example in the BZIP2 program. The BWT of a text is a permutation of its symbols and is strictly related to the Suffix Array of . In fact, the symbol of the BWT is the symbol preceding the smallest suffix of according to the lexicographical sorting of the suffixes of . The Burrows-Wheeler Transform has gained importance beyond its initial purpose, and has become the basis for self-indexing structures such as the FM-index [8], which allows to efficiently perform important tasks such as searching a pattern in a text [8, 12, 21]. The generalization of the BWT (and the FM-index) to a collection of strings has been introduced in [16, 17].

An entire generation of recent bioinformatics tools heavily rely on the notion of BWT. For example, representing the reference genome with its FM-index is the basis of the most widely used aligners, such as Bowtie [10], BWA [13, 14] and SOAP2 [15].

Still, to attack some other fundamental bioinformatics problems, such as genome assembly, an all-against-all comparison among the input strings is needed, especially to find all prefix-suffix matches (or overlaps) between reads in the context of the Overlap Layout Consensus (OLC) approach based on string graph [18]. This fact justifies to search for extremely time and space efficient algorithms to compute the BWT on a collection of strings [11, 25, 7, 2]. For example, SGA (String Graph Assembler) [23] is a de novo genome assembler that builds a string graph from the FM-index of the collection of input reads. In a preliminary version of SGA [22], the authors estimated, for human sequencing data at a 20x coverage, the need of 700Gbytes of RAM in order to build the suffix array, using the construction algorithm in [19], and the FM-index.

Another technical device that is used to tackle the genome assembly in the OLC approach is the Longest Common Prefix (LCP) array of a collection of strings, which is instrumental to compute (among others) the prefix-suffix matches in the collection. The huge amount of available biological data has stimulated the development of the first efficient external-memory algorithms (called, BCR and BCRext) to construct the BWT of a collection of strings  [1]. Similarly, a lightweight approach to the construction of the LCP array has been investigated [3]. Towards an external memory genome assembler, LSG [4, 5] is founded upon BCRext and builds in external memory the string graph of a set of strings. In that approach, external memory algorithms to compute the BWT and the LCP array [2, 3] are fundamental.

Still, the construction of the BWT (and LCP array) of a huge collection of strings is a challenging task. A simple approach is constructing the BWT from the Suffix Array, but it is prohibitive for massive datasets. A first attempt to solve this problem [24] partitions the input collection into batches, computes the BWT for each batch and then merges the results.

In this paper we present a new lightweight (external-memory) approach to compute the BWT and the LCP array of a collection of strings, which is alternative to BCRext [1]. The algorithm BCRext is proposed together with BCR and both are designed to work on huge collections of strings (the experimental analysis is on billions of 100-long strings). Those algorithms are lightweight because, on a collection of strings of length , BCR uses only RAM space and CPU time, where is the time taken to sort integers. The same complexity holds for the lightweight LCP algorithm given in [3]. Though the use of the RAM is negligible for DNA data, the overall I/O volume is . Clearly, a main question is if it is possible to achieve the optimal I/O complexity. Both BCR and BCRext build the BWT with a column-wise approach, where at each step the elements preceding the suffixes of length of each read are inserted in the correct positions of the partial BWT that considers only suffixes shorter than . Moreover, both algorithms are described as a succession of sequential scans, where the partial BWTs are read from and and written to external files, thus obtaining a small main memory footprint.

Compared to BCRext, our algorithm uses an I/O volume that is , where is the maximum length of any substring appearing at least twice in the input collection. Clearly . Compared with BCR, our approach does not require an explicit sort of a generic set, but it is mainly based on the simple idea of building partial BWTs, each one for the set of suffixes of a given length , then merging those partial BWTs to obtain the complete BWT by using an approach similar to the one proposed in [9], where the construction of a multi-string BWT is proposed with the main goal of merging BWTs for distinct genomic sequences.

2 Preliminaries

Let be a finite alphabet where (called sentinel), and where specifies the lexicographic ordering over alphabet . We consider a collection of strings, where each string consists of symbols over the alphabet and is terminated by the symbol $. The symbol of string is denoted by and the substring of is denoted by . In order to simplify the presentation, we assume that all the strings in have the same length . The suffix and prefix of of length are the substrings (denoted by ) and (denoted by ) respectively. Then the -suffix and -prefix of a string is the suffix and prefix with length , respectively. The lexicographic ordering among strings in is defined in the usual way. Though we use the same sentinel to terminate strings, we can easily distinguish the same suffix of different strings by assuming an implicit ordering of the sentinels that is induced by the ordering of the input strings. More precisely, we assume that given , with , then the sentinel of precedes the sentinel of .

Given the lexicographic ordering of the suffixes of , the Suffix Array is the -long array where the element is equal to if and only if the element of is the -suffix of string . The Burrows-Wheeler Transform (BWT) of is the -long array where if , then is the first symbol of the -suffix of if , otherwise . In other words consists of the symbols preceding the ordered suffixes of . The Longest Common Prefix (LCP) array of is the -long array such that is the length of the longest prefix shared by suffixes and . Conventionally, .

Now, we give the definition of interleave of a generic set of arrays, that will be used extensively in the following.

Definition 1.

Given arrays , then an array is an interleave of if is the result of merging the arrays such that: (i) there is a 1-to-1 function from the set to the set , (ii) for each , and (iii) for each .

By denoting with the total length of the arrays, the interleave is a -long array giving a fusion of which preserves the relative order of the elements in each one of the arrays. As a consequence, for each with , the element of corresponds to the occurrence in of an element of . This fact allows to encode the function as a -long array such that if and only if is an element of . Given , it is possible to reconstruct by considering that is equal to where is the number of values equal to in the interval ; this number will be called rank at position . In the following, we will refer to vector as interleave-encoding (or simply encoding). Algorithm 1 shows how to reconstruct an interleave from its encoding (the array is used to store the rank values), and can also be used to simulate a scan of by means of its encoding .

1 for  to  do
2      ;
3      
4for  to  do
5      ;
6       ;
7       ;
8      
Algorithm 1 Reconstruct the interleave from the encoding

3 The lightweight algorithm for BWT and LCP array

Let and () be -long arrays such that is the symbol preceding the smallest -suffix of and is the smallest -suffix of . It is easy to see that the BWT is an interleave of the arrays , since the ordering of symbols in () is preserved in , i.e. is stable w.r.t. each array . This fact is a direct consequence of the definition of and . For the same reason, the lexicographic ordering of all suffixes of is an interleave of the arrays . Let be the encoding of the interleave of arrays giving the BWT , and let be the encoding of the interleave of arrays giving . Then it is possible to show that .

Our algorithm for building the BWT and the LCP array, differently from [1], consists of two distinct phases: in the first phase the arrays are computed, while the second phase determines (which is equal to ) thus allowing to reconstruct as an interleave of . Indeed, BCRext [1] computes the BWT of the collection incrementally via iterations. At each iteration , with , the algorithm computes a partial BWT that is the BWT for the ordered collection of suffixes of length at most , that is for the lexicographic ordering of . This approach requires that, at each iteration , the symbols preceding the -suffixes of must be inserted at their correct positions into , that is each iteration simulates the insertion of the -suffixes in the ordered collection of . Updating the partial BWT in external memory, the process requires a sequential visit of the file containing the basic information of the partial . Thus the I/O volume at each iteration is at least (since there are suffixes for each length between to ). Consequently the total I/O volume for computing is at least . More precisely, the BCRext algorithm in [1] that uses less RAM, requires at each iteration an additional I/O volume given by , due to a process of ordering special arrays used to save RAM space. Our algorithm instead consists of a first phase that has I/O volume and time complexity and produces the arrays (see procedure Partition-suffixes), and a second phase which computes by implicitly merging the arrays into the interleave of the overall ordered set of all suffixes (see procedure Merge-suffixes). As described in Section 5, the procedure does not need to compute explicitly the arrays and the interleave . Inspired by [9], we perform this step by a number of iterations, where the length of the longest substring that has at least two occurrences in . Thus the merging operation takes fewer iterations than BCRext (the latter requires ).

4 The Procedure Partition-suffixes

The input set is preprocessed in order to have a fast access to its symbols, and -long arrays are obtained. More in detail, the element () is the symbol of the string , that is . In other words is the symbol preceding the -suffix of . The procedure Partition-suffixes (see Algorithm 2) takes in input the arrays and computes the arrays by using -long arrays (), where if and only if the -suffix of the input string is the element of . Notice that the symbol precedes the -suffix , that is . In particular, contains the sequence of indexes and contains the sequence of the last symbols of the input strings (i.e. the symbols before the sentinels).

In order to specify the structure of the Procedure Partition-suffixes, given a symbol of the alphabet , we define the -projection operation over the array that consists in taking from only the entries such that . In other words is the vector that projects the entries of corresponding to strings whose -suffix is preceded by the symbol . Then the following Lemma directly follows from definition of .

Lemma 1.

Given the array , the sequence of indexes of strings, whose -suffix starts with symbol and ordered w.r.t. the -suffix, is equal to vector .

As a main consequence of the above Lemma the array can be simply obtained from as the concatenation where is the lexicographic order of symbols of alphabet . Notice that the -projection of , is computed by listing the positions of such that . Indeed, lists the symbols precedings the ordered -suffixes.

The procedure Partition-suffixes computes arrays in iterations. At each iteration , arrays and are computed from arrays and . The array is stored in lists , where is the -projection of . In the following, the arrays are treated as lists which can be stored in external files.

The basic procedure to compute from and is the following. First, is sequentially read and, for each position , is appended to the list , where . At this point, is given by the concatenation of lists . After computing , the vector can be obtained. Indeed, assuming that the element in the ordered list of -suffixes is the suffix of string (that is, ) the symbol preceding such suffix is and is directly obtained by accessing position of vector (recall that has been computed in the preprocessing phase). More precisely, is sequentially read and, for each position , if then . Due to a random access, array it is assumed to be kept in RAM with a space cost of .

Input : The arrays .
Output : The arrays .
1 for  to  do
2      ;
3       ;
4      
5for  to  do
6      foreach  do
7             empty list;
8            
9       empty list;
10       empty list;
11       for  to  do
12            ;
13             Append to ;
14            
15      for  to  do
16            Append to ;
17            
18      for  to  do
19            Append to ;
20            
Algorithm 2 Partition-suffixes

5 The procedure Merge-suffixes

The second step of our algorithm computes the encoding of the interleave of the arrays , giving the lexicographic ordering of all suffixes of and (at the same time) computes the LCP array. Recall that is equal to the encoding of the interleave of the arrays giving the BWT . This section is devoted to describe how to compute from which it is easy to obtain the BWT as explained in Algorithm 1, while the description of the approach to obtain the LCP array is postponed until Section 5.1.

Before entering into the details, we need some definitions.

Definition 2.

Let and be two generic suffixes of , with length respectively and . Then, given an integer , (and we say that p-precedes ) iff one of the following conditions hold: (1) is lexicographically strictly smaller than , (2) and , (3) , and .

Definition 3.

Given the arrays , the -interleave () is the interleave such that is the smallest suffix in the -ordering of all the suffixes of .

It is immediate to verify that (that is, the suffixes sorted according to the relation) is equal to , hence . Therefore, our approach is to determine by iteratively computing by increasing values of , starting from . Observe that lists the suffixes in the same order given by the concatenation of arrays and the encoding is trivially given by s, followed by s, …, followed by values equal to .

Definition 4.

Let be the -interleave of , and let be a position. Then, the -segment of in is the maximal interval such that and all suffixes in have the same -prefix. Positions and are called respectively begin and end position of the segment, and the common -prefix is denoted by .

It is immediate to observe that the set of all the -segments of a -interleave form a partition of its positions . Observe that, by definition, a suffix smaller than belongs to a -segment having . In other words, such suffix is the unique element of the -segment.

Before describing the approach, the computation of from is explained. Let ( and ) be the -long array such that is the symbol of the suffix . In particular, is the sentinel $ if the suffix is smaller than . Moreover, let be the interleave of the arrays such that . In other words, is the symbol of the suffix .

Lemma 2.

Let be a -segment of . Then, is a permutation of defined by the permutation of the indexes producing the stable ordering of the symbols in , such that the suffix of is the suffix of in position .

Proof.

First we prove that is a permutation of . Let us denote with the -prefix common to suffixes in , and let be a position in . Given a position , by definition, the -prefix of is strictly smaller than . Then, the -prefix of is strictly smaller than the -prefix of . In the same way, given a position by definition, the -prefix of is strictly greater than . Then, the -prefix of is strictly greater than the -prefix of . Hence, the set of the suffixes of before and the set of the suffixes after are equal (respectively) to the set of the suffixes of before and to the set of the suffixes after , thus deriving that for the suffix is equal to for some in , completing the proof of the first part.

Furthermore, all suffixes in share the common -prefix , and therefore their -order can be determined by ordering their symbols. More specifically, the suffix () is the suffix in , where is the rank of its character in the stable order of . ∎

Given the suffix in position of , such that is in the -segment , the Lemma 2 allows to compute its position on . Let be the number of symbols of that are strictly smaller than and let be the number of symbols of which are equal to . Then, the rank of suffix in is , thus deriving that its position in is . It is possible to notice that the positions on are partioned into -segments (referred as induced by the -segment of ), where is the number of distinct non-$ symbols in plus the number of symbols $ in . Observe that the first -segments have width , while the width of the last -segments can be computed as follows. Let be the ordered set of the distinct non-$ symbols in . Then, the width of () is equal to the number of occurrences of the symbol in . From what described above, it derives that the -segments on form a partition of its positions that is a refinement of the partition formed by the -segments on .

Now, we describe a simple procedure (see Algorithm 3) to compute from the -segment of . The procedure uses (initially empty) lists . Each position is considered from to and each suffix is appended to the list such that is the symbol of . Afterwards, each list of the sequence is read sequentially, and each suffix in position of the list is put in the position of , where is given by the total size of the lists (which have been previously read) plus the position of . Observe that contains the rank of in the -ordering of the suffixes of the -segment.

1 empty lists;
2 for  to  do
3      ;
4       Append to ;
5      
6;
7 for  to  do
8      for  to  do
9            ;
10             ;
11             ;
12            
Algorithm 3 Compute -segment on from a -segment

Algorithm 3 can be easily modified in order to produce also the -segments induced by the -segment . Observe that the first -segments have width , while for the last -segments it is easy to prove that where is the total size of the first nonempty lists . The entire interleave is obtained by computing for each distinct -segment of .

At this point, it is immediate to extend the definition of -segment from to its encoding , and to see that the Algorithm 3 can be slightly modified to compute from the -segment of (see Algorithm 4).

1 empty lists;
2 for  to  do
3      ;
4       Append to ;
5      
6;
7 for  to  do
8      for  to  do
9            ;
10             ;
11             ;
12            
Algorithm 4 Compute -segment on from a -segment

Based on Algorithm 4 we designed the iterative procedure Merge-suffixes (see Algorithm 6) to compute the encoding starting from the encoding that can be easily obtained as explained before. Recall that is the encoding of the interleave of the arrays giving the BWT of the input set . The iteration of the procedure computes from , by scanning the array , and is detailed in Algorithm 5. Precisely, the procedure, for each -segment , computes the portion of . We point out that it is not actually necessary to reconstruct the interleave from the arrays , since its encoding is , and therefore a scan of allows also to simulate a scan of (see Algorithm 1).

1 empty lists;
2 for  to  do
3      ;
4      
5;
6 for  to  do
7      if pick_up_start = true then
8             ;
9             ;
10            
11      ;
12       ;
13       ;
14       Append to ;
15       if  is the end position of a -segment then
16             ;
17             ;
18             for  to  do
19                  for  to  do
20                        ;
21                         ;
22                         if  and  then
23                               ;
24                              
25                        else
26                               ;
27                              
28                        ;
29                        
30             empty lists;
31            
Algorithm 5 Compute from

The conditions at line 5 of Algorithm 5 and at line 6 of Algorithm 6 are checked by using an auxilary binary array storing the -segments. More specifically, is true iff is the end position of some -segment. The array is sufficient to reconstruct the set of all -segments since they form a partition of positions , and it is read sequentially with the other arrays. For the sake of brevity the computation of (of each iteration ) is omitted.

Observe that, under the assumption that the input set does not contain duplicates, all the -segments of the encoding have width equal to . Moreover, after iterations, where is the length of the longest common substring of two strings in , (1) the encoding is equal to and (2) each with is identical to . Those two facts are a consequence of the following two observations: (i) the length of the longest common prefix between two strings is equal to the length of the longest common substring in , if all the -prefixes of the suffixes are distinct, (ii) the order relation does not effect the ordering given by , that is .

Algorithm 4 computes also the LCP array whose description is in the following Section 5.1. Section 5.2 is devoted to describe how to compute the arrays used by iteration .

Input : The arrays
Output : The encoding .
1 for  to  do
2      for  to  do
3            ; ;
4            
5Compute lists for ;
6 ;
7 while there exists some -segment on which is wider than  do
8      Compute from ;
9       Compute lists for ;
10      
11Output ;
Algorithm 6 Merge-suffixes

5.1 Computing the LCP array

The LCP array is obtained by exploiting Proposition 3 which easily follows from the definition of -segment.

Proposition 3.

Let be a position on the LCP array . Then is the largest such that is the start of a -segment (of ) and is not the start of a -segment (of ).

Proof.

Notice that, since the -segments are a refinement of the -segments, then there can be only one such . Let and be respectively the and the lexicographically smallest suffix of . Assume initially that is the start of a -segment, but not of a -segment. Since is not a start of a -segment, then and belong to the same -segment hence, by definition of segment, they share the same -prefix. Since is the start of a -segment, then and cannot belong to the same -segment, hence they do not share the same -prefix. Thus, . Assume now that , that is and share a common -prefix, but not a -prefix. Again, by definition of segment, and belong to the same -segment but not to the same -segment. ∎

At this point, let be the -long array such that is the length of the longest common prefix between the -prefix of suffix and the -prefix of suffix . The array is clearly equal to the LCP array of the input set .

Each iteration (see Algorithm 4) of our procedure computes from and the array is set to all s before starting the iterations. The following invariant, which directly implies its correctness, is maintained.

Lemma 4.

At the end of iteration , iff is not the start position of any -segment.

Proof.

We will prove the lemma by induction on . Before the first iteration, the array is set to all s, therefore we only have to consider the general case. Observe that (at the beginning of iteration ), given a -segment , we have for . Then, the procedure (see Algorithm 4) sets to the array in all positions of the induced -segments different from their start positions (line 5), completing the proof. ∎

5.2 Computing the arrays

In this section we describe how to compute the arrays used by iteration . Recall that is the -long array such that is the symbol of the smallest -suffix of ( is a sentinel $ if the suffix is smaller than ). The following proposition establishes a recursive definition of .

Lemma 5.

Let and be respectively the sorted -suffixes and -suffixes of the set . Let and be respectively the -suffix and the -suffix of a generic input string . Then the symbol of is the symbol of .

Since the suffixes and can have different positions in and , the list is a permutation of . Still, Algorithm 7 exploits the construction of to quickly compute . Notice that, for , is the result of sorting whereas for , is a sequence of sentinels Therefore the arrays can be trivially computed.

Input : The lists on alphabet , an integer with , and all .
Output : The lists for each
1 for  to  do
2       empty list;
3       for  to  do
4             empty list;
5            
6      for  to  do
7            Append to ;
8            
9      for  to  do
10            Append to ;
11            
Algorithm 7 Compute all lists for any given .

In order to prove the correctness of Algorithm 7 we need to show that the permutation over indexes of induced by the lexicographic ordering of , is the correct permutation of to obtain . Indeed, observe that is the permutation that relates positions of indexes of strings in to their positions in . More precisely, given a string of , such that its -suffix is in position of list , then if , it means that the -suffix is of the string is in position of list .

The above observation is a consequence of the fact that in order to get the lexicographic ordering of </