Faster algorithms for 1-mappability of a sequence

Faster algorithms for 1-mappability of a sequence

Mai Alzamel Department of Informatics, King’s College London, London, UK
[mai.alzamel,panagiotis.charalampopoulos,costas.iliopoulos,solon.pissis]@kcl.ac.uk
Panagiotis Charalampopoulos Department of Informatics, King’s College London, London, UK
[mai.alzamel,panagiotis.charalampopoulos,costas.iliopoulos,solon.pissis]@kcl.ac.uk
Costas S. Iliopoulos Department of Informatics, King’s College London, London, UK
[mai.alzamel,panagiotis.charalampopoulos,costas.iliopoulos,solon.pissis]@kcl.ac.uk
Solon P. Pissis Department of Informatics, King’s College London, London, UK
[mai.alzamel,panagiotis.charalampopoulos,costas.iliopoulos,solon.pissis]@kcl.ac.uk
Jakub Radoszewski Wing-Kin Sung Department of Computer Science, National University of Singapore, Singapore
ksung@comp.nus.edu.sg
Abstract

In the -mappability problem, we are given a string of length and integers and , and we are asked to count, for each length- factor of , the number of other factors of length of that are at Hamming distance at most from . We focus here on the version of the problem where . The fastest known algorithm for requires time and space . We present two algorithms that require worst-case time and , respectively, and space , thus greatly improving the state of the art. Moreover, we present an algorithm that requires average-case time and space for integer alphabets if , where is the alphabet size.

1 Introduction

The focus of this work is directly motivated by the well-known and challenging application of genome re-sequencing—the assembly of a genome directed by a reference sequence. New developments in sequencing technologies [15] allow whole-genome sequencing to be turned into a routine procedure, creating sequencing data in massive amounts. Short sequences, known as reads, are produced in huge amounts (tens of gigabytes); and in order to determine the part of the genome from which a read was derived, it must be mapped (aligned) back to some reference sequence that consists of a few gigabases. A wide variety of short-read alignment techniques and tools have been published in the past years to address the challenge of efficiently mapping tens of millions of reads to a genome, focusing on different aspects of the procedure: speed, sensitivity, and accuracy [8]. These tools allow for a small number of errors in the alignment.

The -mappability problem was first introduced in the context of genome analysis in [5] (and in some sense earlier in [3]), where a heuristic algorithm was proposed to approximate the solution. The aim from a biological perspective is to compute the mappability of each region of a genome sequence; i.e. for every factor of a given length of the sequence, we are asked to count how many other times it occurs in the genome with up to a given number of errors. This is particularly useful in the application of genome re-sequencing. By computing the mappability of the reference genome, we can then assemble the genome of an individual with greater confidence by first mapping the segments of the DNA that correspond to regions with low mappability. Interestingly, it has been shown that genome mappability varies greatly between species and gene classes [5].

Formally, we are given a string of length and integers and , and we are asked to count, for each length- factor of , the number of other length- factors of that are at Hamming distance at most from .

Example 1.

Consider the string and . The following table shows the -mappability counts for and .

position 0 1 2 3 4 5 6 7
factor occurrence aab aba baa aaa aab abb bbb bbb
0-mappability 1 0 0 0 1 0 1 1
1-mappability 3 2 1 4 3 5 2 2

For instance, consider the position 0. The 0-mappability is 1, as the factor aab occurs also at position 4. The 1-mappability at this position is 3 due to the occurrence of aab at position 4 and occurrences of two factors at Hamming distance from aab: aaa at position and abb at position 5.

The -mappability problem can be solved in time with the well-known LCP data structure [7]. For , to the best of our knowledge, the fastest known algorithm is by Manzini [14]. This solution runs in time and space and works only for strings over a fixed-sized alphabet. Since the problem for can be solved in time, one may focus on counting, for each length- factor of , the number of other factors of that are at Hamming distance exactly — instead of at most — from .

Our contributions. Here we make the following threefold contribution:

(a)

We present an algorithm that, given a string of length over a fixed-sized alphabet and a positive integer , solves the -mappability problem in time and space, thus improving on the algorithm of [14] that requires time and space.

(b)

We present an algorithm to solve the -mappability problem in time and space that works for strings over an integer alphabet.

(c)

We present an algorithm that, given a string of length over an integer alphabet of size , with the letters of being independent and identically distributed random variables, uniformly distributed over , and a positive integer , solves the -mappability problem for in average-case time and space .

The paper is organised as follows. In Section 2, we provide basic definitions and notation as well as a description of the algorithmic tools we use to design our algorithms. In Sections 3 and 4, we provide the average-case and the worst-case algorithms, respectively. We conclude with some final remarks in Section 5.

2 Preliminaries

We begin with some basic definitions and notation. Let be a string of length over a finite ordered alphabet of size . We also consider the case of strings over an integer alphabet, where each letter is replaced by its rank in such a way that the resulting string consists of integers in the range .

For two positions and on , we denote by the factor (sometimes called substring) of that starts at position and ends at position (it is of length if ). By we denote the empty string of length 0. We recall that a prefix of is a factor that starts at position 0 () and a suffix of is a factor that ends at position (). We denote the reverse string of by , i.e. .

Let be a string of length with . We say that there exists an occurrence of in , or, more simply, that occurs in , when is a factor of . Every occurrence of can be characterised by a starting position in . Thus we say that occurs at the starting position in when .

The Hamming distance between two strings and of the same length is defined as . If , we set . If two strings and are at Hamming distance , we write .

The computational problem in scope can be formally stated as follows.

1-mappability Input: A string of length and an integer , Output: An integer array of size such that stores the number of factors of that are at Hamming distance from

2.1 Suffix array and suffix tree

Let be a string of length . We denote by SA the suffix array of . SA is an integer array of size storing the starting positions of all (lexicographically) sorted non-empty suffixes of , i.e. for all we have  [13]. Let lcp denote the length of the longest common prefix between and for positions , on . We denote by LCP the longest common prefix array of defined by LCP for all , and LCP. The inverse iSA of the array SA is defined by , for all . It is known that SA, iSA, and LCP of a string of length , over an integer alphabet, can be computed in time and space  [16, 7]. It is then known that a range minimum query (RMQ) data structure over the LCP array, that can be constructed in time and space [4], can answer lcp-queries in time per query [13]. A symmetric construction on can answer the so-called longest common suffix (lcs) queries in the same complexity. The lcp and lcs queries are also known as longest common extension (LCE) queries.

The suffix tree of string is a compact trie representing all suffixes of . The nodes of the trie which become nodes of the suffix tree are called explicit nodes, while the other nodes are called implicit. Each edge of the suffix tree can be viewed as an upward maximal path of implicit nodes starting with an explicit node. Moreover, each node belongs to a unique path of that kind. Thus, each node of the trie can be represented in the suffix tree by the edge it belongs to and an index within the corresponding path. The label of an edge is its first letter. We let denote the path-label of a node , i.e., the concatenation of the edge labels along the path from the root to . We say that is path-labelled . Additionally, is used to denote the string-depth of node . Node is a terminal node if its path-label is a suffix of , that is, for some ; here is also labelled with index . It should be clear that each factor of is uniquely represented by either an explicit or an implicit node of . In standard suffix tree implementations, we assume that each node of the suffix tree is able to access its parent. Once is constructed, it can be traversed in a depth-first manner to compute for each node .

It is known that the suffix tree of a string of length , over an integer alphabet, can be computed in time and space  [6]. For integer alphabets, in order to access the children of an explicit node by the first letter of their edge label, perfect hashing [9] can be used.

3 Efficient Average-Case Algorithm

In this section we assume that is a string over an integer alphabet . Recall that if two strings and are at Hamming distance , we write .

Fact 1 (Folklore).

Given two strings and of length , we have that if , then and share at least one factor of length .

Fact 2.

Given a string and any two positions on , we have that if , then and have at least one common factor of length starting at positions and of , such that and .

Proof.

It should be clear that every factor of of length fully contains at least two factors of length starting at positions equal to mod . Then, if and are at Hamming distance , analogously to Fact 1, at least one of the two factors of length that are fully contained in occurs at a corresponding position in ; otherwise we would have a Hamming distance greater than 1. ∎

We first initialize an array of size , with in all positions; for all , will eventually store the number of factors of that are at Hamming distance from . We apply Fact 2 by implicitly splitting the string into blocks of length —the suffix of length is not taken as a block—starting at the positions of that are equal to mod . In order to find all pairs of length- factors that are at Hamming distance from each other, we can find all the exact matches of every block and try to extend each of them to the left and to the right, allowing at most one mismatch. However, we need to tackle some technical details to correctly update our counters and avoid double counting.

We start by constructing the SA and LCP arrays for and in time. We also construct RMQ data structures over the LCP arrays for answering LCE queries in constant time per query. By exploiting the LCP array information, we can then find in time all maximal sets of indices such that the longest common prefix between any two of the suffixes starting at these indices is at least and at least one of them is the starting position of some block.

Then for each such set, denoted by , we have to do the following procedure for each index such that .

For every other , we try to extend the match by asking two LCE queries in each direction. I.e., we ask an query to find the first mismatch positions and , respectively, and then to find the second mismatch ( and , respectively). A symmetric procedure computes the mismatches and to the right, as shown in Figure 1. We omit here some technical details with regards to reaching the start or end of .

Now we are interested in positions such that and and positions such that and . Each such position (resp. ) implies that , where . Henceforth, we only consider positions of the type .

Note that if , we will identify the unordered pair based on the described approach times, where is the total number of full blocks contained in and in after the mismatch position. It is not hard to compute the number in time based on the starting positions and as well as and each time we identify . To avoid double counting, we then increment the and counters by .

By we denote the time required to process a pair of elements of a set such that at least one of them, or , equals mod .

The time is .

Proof.

Given , with at least one of them equal to mod , we can find the pairs of positions that satisfy the inequalities discussed above in time. They are a subset of . For each such pair we can compute and increment and accordingly in time. The total time to process all pairs for given is thus . ∎

Theorem 1.

Given a string of length over an integer alphabet of size with the letters of being independent and identically distributed random variables, uniformly distributed over , the 1-mappability problem can be solved in average-case time and space if .

Proof.

The time and space required for constructing SA and LCP tables for and and the RMQ data structures over the LCP tables is .

Let denote the number of blocks over , and let denote the block length. We set

 L=⌊m3⌋,B=⌊nL⌋

to apply Fact 2. Recall that by we denote a maximal set of indices of the LCP table such that the length of the longest common prefix between any two suffixes starting at these indices is at least and at least one of them is the starting position of some block. Processing all such sets requires time

 EXTi,j⋅Occ

where is the time required to process a pair of elements of a set ; and Occ is the sum of the multiples of the cardinality of each set times the number of the elements of set that are equal to mod . By Lemma 1 we have that . Additionally, by the stated assumption on the string , the expected value for Occ is no more than . Hence, the algorithm on average requires time

 O(n+m⋅B⋅nσL).

Assuming that , we have the following:

 m⋅B⋅nσL=m⋅⌊n⌊m/3⌋⌋⋅nσ⌊m3⌋≤m⋅(nm/3−1)⋅nσm3−1≤12n2nlogσlogn(m3−1)=12n2−(m−3)logσ3logn.

Consequently, in the case when

 m≥3⋅lognlogσ+3

the algorithm requires time on average. The extra space usage is . ∎

4 Efficient Worst-Case Algorithms

4.1 O(mn)-time and O(n)-space algorithm

In this section we assume that is a string over an integer alphabet . The main idea is that we want to first find all pairs that have a mismatch in the first position, then in the second, and so on.

Let us fix . In order to identify the pairs with (i.e. with the mismatch in the position), we do the following. For every , we find the explicit or implicit node in that represents and the node in that represents . In each such node , we create a set —if it has not already been created—and insert the triple .

When we have done this for all possible starting positions of , we group the triples in each set by the node variable (i.e., the first component in the triples). For each such group in we count the number of triples that have each letter of the alphabet and increment array accordingly. More precisely, if contains triples that correspond to the same node , among which correspond to the letter , then for each such triple we increment by ; we subtract to avoid counting equal factors in . Before we proceed with the computations for the next index , we delete all the sets . We formalize this algorithm, denoted by -Map, in the pseudocode presented below and provide an example.

{algo}

1-Mapx,n,m \SETT(x)SuffixTree(x) \SETT(rev(x))SuffixTree(rev(x)) \DOFORstring-depth to \DOFOR to \SETu_i,jNode(x[i. .i+j-1]) \SETv_i,jNode(rev(x)[n-i-m . .n-i-j-2]) \ACTInsert to \OD\DOFORevery node of string-depth in \ACTGroup triples in by the node variable \DOFORa group corresponding to the node in \ACTCount number of triples with each letter \ACTUpdate accordingly for each triple \OD\ACTDelete \OD\OD

Example 2.

Suppose we have , for some distinct positions . We then increment , , , , , and by ; and by ; and by .

We now analyze the time complexity of this algorithm. The algorithm iterates from to . In the iteration, we need to compute . When , for every is the root of and we can find for all naïvely in time. For , can be found in time from by moving one letter up in for all , while can be obtained from by going down in based on letter . We then include in .

This requires in total randomized time due to perfect hashing [9] which allows to go down from a node in (or in ) based on a letter in randomized time. We can actually avoid this randomization, as queries for a particular child of a node are asked in our solution in a somewhat off-line fashion: we use them only to compute ( times) and (from ).

Observation 1.

For an integer alphabet , one can answer off-line queries in asking for a child of an explicit or implicit node labelled with the letter in (deterministic) time.

Proof.

A query for an implicit node is answered in time, as there is only one outgoing edge to check. All the remaining queries can be sorted lexicographically as pairs using radix sort. We can also assume that the children of every explicit node of are ordered by the letter (otherwise we also radix sort them). Finally, all the queries related to a node can be answered in one go by iterating through the children list of once. ∎

Lastly, we use bucket sort to group the triples for each according to the node variable (recall that the nodes are represented by the edge and the index within the edge) and update the counters in time in total (using a global array indexed by the letters from , which is zeroed in time after each has been processed). Overall the algorithm requires time.

The suffix trees require space and we delete the sets after the iteration; the space complexity of the algorithm is thus . We obtain the following result.

Theorem 2.

Given a string of length over an integer alphabet and a positive integer , we can solve the 1-mappability problem in time and space.

4.2 O(nlog2n)-time and O(n)-space algorithm

In this section we assume that is a length- string over an ordered alphabet , where . Consider two factors of represented by nodes and in ; the first observation we make is that the first mismatch between the two factors is the first letter of the labels of the distinct outgoing edges from the lowest common ancestor of and that lie on the paths from the root to and . For -mappability we require that what follows this mismatch is an exact match.

Definition 1.

Let be a rooted tree. For each non-leaf node of , the heavy edge is an edge for which the subtree rooted at has the maximal number of leaves (in case of several such subtrees, we fix one of them). The heavy path of a node is a maximal path of heavy edges that passes through (it may contain 0 edges). The heavy path of is the heavy path of the root of .

Consider the suffix tree and its node . We say that an (explicit or implicit) node is a level ancestor of at string-depth if and is a prefix of . The heavy paths of can be used to compute level ancestors of nodes in time. However, a more efficient data structure is known.

Lemma 2 ([2]).

After -time preprocessing on , level ancestor queries of nodes of can be answered in time per query.

Definition 2.

Given a string and a factor of , we denote by the range in the SA of that represents the suffixes of that have as a prefix.

Every node in corresponds to an SA range . We can precompute for all explicit nodes in in time while performing a depth-first traversal of the tree as follows. For a non-terminal node with children , we set and . If is a terminal node (with children ), representing the suffix , we set and . When a considered node is implicit, say along an edge , then .

Our algorithm relies heavily on the following auxiliary lemmas.

Lemma 3.

Consider a node in with . Let be the node such that . Given the SA and the iSA of , can be computed in time after -time preprocessing.

Proof.

The SA range of the node is ; corresponds to the suffix . By removing the first letters, the suffix becomes . The corresponding SA value is .

Let be the node of such that . The sought node is the ancestor of located at string-depth . It can be computed in time using the level ancestor data structure of Lemma 2. ∎

Lemma 4.

Let and be two nodes in . We denote by and by . We further denote by the node such that . Given the SA and the iSA of , as well as and , can be located in time after -time preprocessing.

Proof.

We can compute in time using the SA and the iSA of  [10, 11]; we can then locate in time using the level ancestor data structure of Lemma 2. ∎

We are now ready to present an algorithm for -mappability that requires time and space. The first step is to build . We then make every node of string-depth explicit in and initialize a counter for it. For each explicit node in , the SA range is also stored. We also identify the node with path-label for each in time.

{algo}

PerformCountT,m \SETHPHeavyPath(T) \DOFOReach side-tree attached to a node on HP with \ACTLet be the edge that connects to HP \SETcthe edge label of \SETdthe edge label of the heavy edge \DOFOReach node in with \SETwsuf(z, D(u)+1) \DOFOReach , label of an outgoing edge from \SETtconcat(u, concat(v_c’, w)) \SETCount(z)Count(z) + |I_t| \OD\SETz’concat(u, concat(v_d, w)) \SETCount(z’)Count(z’) + |I_z| \OD\CALLPerformCountS_i,m-D(u) \OD

We then call PerformCount, which does the following (inspect also the pseudocode above and Figure 2). At first, a heavy path HP of is computed. Initially, we want to identify the pairs of factors of of length at Hamming distance that have a mismatch in the labels of the edges outgoing from a node in HP. Given a node in HP, with , for every side tree attached to it (say by an edge with label ), we find all nodes of with string-depth . For every such node , with path-label , we use Lemma 3 to obtain the node ; that is, . We then use Lemma 4 to compute for all such that there is an outgoing edge from with label and increment by . Let the heavy edge from have label ; we also increment , where is the node with path-label , by while processing node .

This procedure then recurs on each of the side trees; i.e. for side tree , attached to node , it calls PerformCount. Finally, we construct array from array Count while performing one more depth-first traversal.

On the recursive calls of PerformCount in each of the side trees (e.g. ) attached to HP, we first compute the heavy paths (in time for ) and then consider each node of string-depth of at most once; as above, we process each node in time due to Lemmas 3 and 4. As there are at most nodes of string-depth , we do work in total. This is also the case as we go deeper in the tree. Since the number of leaves of the trees we are dealing with at least halves in each iteration, there at most steps. Hence, each node of string-depth will be considered times and every time we will do work for it. The overall time complexity of the algorithm is thus . The space complexity is clearly . By applying Theorem 2 we obtain the following result.

Theorem 3.

Given a string of length over a fixed-sized alphabet and a positive integer , we can solve the 1-mappability problem in time and space.

5 Final Remarks

The natural next aim is to either extend the presented solutions to work for arbitrary without increasing the time and space complexities dramatically or develop fundamentally new algorithms if this is not possible. In fact, we already know that the fast average-case algorithm presented in Section 3 can be generalized to work for arbitrary in linear time. This adds, however, a multiplicative factor of on the condition for the value of . An interesting generalization of this problem would be to consider the edit distance model instead of the Hamming distance model; i.e. apart from mismatches to also allow for letter insertions and deletions.

Furthermore, a practical extension of the aforementioned problem is the following. Given reads from a particular sequencing machine, the basic strategy for genome re-sequencing is to map a seed of each read in the genome and then try and extend this match [1]. In practice, a seed could be for example the first letters of the read—the accuracy is higher in the prefix of the read. It is reasonable to allow for a few (e.g. ) errors when matching the seed to the reference genome to account for sequencing errors and genetic variation. Hence a closely-related problem to genome mappability that arises naturally from this application is the following: What is the minimal value of that forces of starting positions in the reference genome to have -mappability equal to ?

A standard implementation of the algorithm presented in Section 3, when applied to a large sequence of length , requires more than bytes of internal memory. Such memory requirements are a significant hurdle to the mappability computation for large datasets on standard workstations. Another direction of practical interest is thus to devise efficient algorithms for the problems of -mappability and -mappability for the External Memory model of computation. Efficient algorithms for computing the suffix array and the longest common prefix array in this model are already known and shown to perform well in practical terms (see [12], for example). Since the average-case algorithm in Section 3 scans the longest common prefix array from left to right sequentially, it would be interesting to see whether it can be implemented efficiently in external memory.

References

• [1] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215(3):403–410, 1990.
• [2] Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Trans. Algor., 3(2):19, 2007.
• [3] Pavlos Antoniou, Jacqueline W. Daykin, Costas S. Iliopoulos, Derrick Kourie, Laurent Mouchard, and Solon P. Pissis. Mapping uniquely occurring short sequences derived from high throughput technologies to a reference genome. In ITAB, pages 1–4. IEEE Computer Society, 2009.
• [4] Michael A. Bender and Martín Farach-Colton. The LCA problem revisited. In LATIN, volume 1776 of LNCS, pages 88–94. Springer-Verlag, 2000.
• [5] Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David Knowles, Emanuele Raineri, Roderic Guigó, and Paolo Ribeca. Fast computation and applications of genome mappability. PLoS ONE, 7(1), 2012.
• [6] Martin Farach. Optimal suffix tree construction with large alphabets. In FOCS, pages 137–143. IEEE Computer Society, 1997.
• [7] Johannes Fischer. Inducing the LCP-array. In WADS, volume 6844 of LNCS, pages 374–385. Springer-Verlag, 2011.
• [8] Nuno A. Fonseca, Johan Rung, Alvis Brazma, and John C. Marioni. Tools for mapping high-throughput sequencing data. Bioinformatics, 28(24):3169–3177, 2012.
• [9] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. J. ACM, 31(3):538–544, 1984.
• [10] Dan Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, 1997.
• [11] Trinh N. D. Huynh, Wing-Kai Hon, Tak-Wah Lam, and Wing-Kin Sung. Approximate string matching using compressed suffix arrays. Theor. Comput. Sci., 352(1):240–249, March 2006.
• [12] Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi, and Bella Zhukova. Engineering external memory induced suffix sorting. In ALENEX, pages 98–108. SIAM, 2017.
• [13] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993.
• [14] Giovanni Manzini. Longest common prefix with mismatches. In SPIRE, volume 9309 of LNCS, pages 299–310. Springer, 2015.
• [15] Michael L. Metzker. Sequencing technologies – the next generation. Nat. Rev. Genet., 11(1):31–46, 2010.
• [16] Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In DCC, IEEE, pages 193–202, 2009.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters