A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings

# A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings

## Abstract

Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this fact has motivated the development of algorithms, to compute the BWT, that work almost entirely in external memory.

On the other hand, the related problem of computing the Longest Common Prefix (LCP) array is often instrumental in several algorithms on collection of strings, such as those that compute the suffix-prefix overlap among strings, which is an essential step for many genome assembly algorithms.

The best current lightweight approach to compute BWT and LCP array on a set of strings, each one characters long, has I/O complexity that is (where is the size of the alphabet), thus it is not optimal.

In this paper we propose a novel approach to build BWT and LCP array (simultaneously) with I/O complexity, where is the length of longest substring that appears at least twice in the input strings.

## 1 Introduction

In this paper we address the problem of costructing in external memory the Burrows-Wheeler Transform (BWT) and the Longest Common Prefix (LCP) array for a large collection of strings. An efficient indexing of very large collections of strings is strongly motivated by the widespread use of Next-Generation Sequencing (NGS) technologies that are producing everyday collections of data that fill several terabytes of secondary storage, that has to be processed by sofware applications. Common applications in metagenomics require indexing of collections of strings (reads) that are sampled from several genomes, where those genomes amount to billions of base pairs. For example, over 500 gigabases of data have been analyzed to start a catalogue of the human gut microbiome [20].

The Burrows-Wheeler Transform (BWT) [6] is a reversible transformation of a text that was originally designed for text compression; it is used for example in the BZIP2 program. The BWT of a text is a permutation of its symbols and is strictly related to the Suffix Array of . In fact, the symbol of the BWT is the symbol preceding the smallest suffix of according to the lexicographical sorting of the suffixes of . The Burrows-Wheeler Transform has gained importance beyond its initial purpose, and has become the basis for self-indexing structures such as the FM-index [8], which allows to efficiently perform important tasks such as searching a pattern in a text [8, 12, 21]. The generalization of the BWT (and the FM-index) to a collection of strings has been introduced in [16, 17].

An entire generation of recent bioinformatics tools heavily rely on the notion of BWT. For example, representing the reference genome with its FM-index is the basis of the most widely used aligners, such as Bowtie [10], BWA [13, 14] and SOAP2 [15].

Still, to attack some other fundamental bioinformatics problems, such as genome assembly, an all-against-all comparison among the input strings is needed, especially to find all prefix-suffix matches (or overlaps) between reads in the context of the Overlap Layout Consensus (OLC) approach based on string graph [18]. This fact justifies to search for extremely time and space efficient algorithms to compute the BWT on a collection of strings [11, 25, 7, 2]. For example, SGA (String Graph Assembler) [23] is a de novo genome assembler that builds a string graph from the FM-index of the collection of input reads. In a preliminary version of SGA [22], the authors estimated, for human sequencing data at a 20x coverage, the need of 700Gbytes of RAM in order to build the suffix array, using the construction algorithm in [19], and the FM-index.

Another technical device that is used to tackle the genome assembly in the OLC approach is the Longest Common Prefix (LCP) array of a collection of strings, which is instrumental to compute (among others) the prefix-suffix matches in the collection. The huge amount of available biological data has stimulated the development of the first efficient external-memory algorithms (called, BCR and BCRext) to construct the BWT of a collection of strings  [1]. Similarly, a lightweight approach to the construction of the LCP array has been investigated [3]. Towards an external memory genome assembler, LSG [4, 5] is founded upon BCRext and builds in external memory the string graph of a set of strings. In that approach, external memory algorithms to compute the BWT and the LCP array [2, 3] are fundamental.

Still, the construction of the BWT (and LCP array) of a huge collection of strings is a challenging task. A simple approach is constructing the BWT from the Suffix Array, but it is prohibitive for massive datasets. A first attempt to solve this problem [24] partitions the input collection into batches, computes the BWT for each batch and then merges the results.

In this paper we present a new lightweight (external-memory) approach to compute the BWT and the LCP array of a collection of strings, which is alternative to BCRext [1]. The algorithm BCRext is proposed together with BCR and both are designed to work on huge collections of strings (the experimental analysis is on billions of 100-long strings). Those algorithms are lightweight because, on a collection of strings of length , BCR uses only RAM space and CPU time, where is the time taken to sort integers. The same complexity holds for the lightweight LCP algorithm given in [3]. Though the use of the RAM is negligible for DNA data, the overall I/O volume is . Clearly, a main question is if it is possible to achieve the optimal I/O complexity. Both BCR and BCRext build the BWT with a column-wise approach, where at each step the elements preceding the suffixes of length of each read are inserted in the correct positions of the partial BWT that considers only suffixes shorter than . Moreover, both algorithms are described as a succession of sequential scans, where the partial BWTs are read from and and written to external files, thus obtaining a small main memory footprint.

Compared to BCRext, our algorithm uses an I/O volume that is , where is the maximum length of any substring appearing at least twice in the input collection. Clearly . Compared with BCR, our approach does not require an explicit sort of a generic set, but it is mainly based on the simple idea of building partial BWTs, each one for the set of suffixes of a given length , then merging those partial BWTs to obtain the complete BWT by using an approach similar to the one proposed in [9], where the construction of a multi-string BWT is proposed with the main goal of merging BWTs for distinct genomic sequences.

## 2 Preliminaries

###### Lemma 2.

Let be a -segment of . Then, is a permutation of defined by the permutation of the indexes producing the stable ordering of the symbols in , such that the suffix of is the suffix of in position .

###### Proof.

First we prove that is a permutation of . Let us denote with the -prefix common to suffixes in , and let be a position in . Given a position , by definition, the -prefix of is strictly smaller than . Then, the -prefix of is strictly smaller than the -prefix of . In the same way, given a position by definition, the -prefix of is strictly greater than . Then, the -prefix of is strictly greater than the -prefix of . Hence, the set of the suffixes of before and the set of the suffixes after are equal (respectively) to the set of the suffixes of before and to the set of the suffixes after , thus deriving that for the suffix is equal to for some in , completing the proof of the first part.

Furthermore, all suffixes in share the common -prefix , and therefore their -order can be determined by ordering their symbols. More specifically, the suffix () is the suffix in , where is the rank of its character in the stable order of . ∎

###### Lemma 5.

Let and be respectively the sorted -suffixes and -suffixes of the set . Let and be respectively the -suffix and the -suffix of a generic input string . Then the symbol of is the symbol of .

Since the suffixes and can have different positions in and , the list is a permutation of . Still, Algorithm 7 exploits the construction of to quickly compute . Notice that, for , is the result of sorting whereas for , is a sequence of sentinels Therefore the arrays can be trivially computed.

In order to prove the correctness of Algorithm 7 we need to show that the permutation over indexes of induced by the lexicographic ordering of , is the correct permutation of to obtain . Indeed, observe that is the permutation that relates positions of indexes of strings in to their positions in . More precisely, given a string of , such that its -suffix is in position of list , then if , it means that the -suffix is of the string is in position of list .

The above observation is a consequence of the fact that in order to get the lexicographic ordering of </