Optimal alignments of longest common subsequences and their path properties

Optimal alignments of longest common subsequences and their path properties

[ [    [ [    [ [ University of Tartu, Liivi 2-513 50409, Tartu, Estonia. \printeade1 Georgia Tech, School of Mathematics, Atlanta, GA 30332-0160, USA.
\printeade2
University of Plymouth, School of Computation and Mathematics, Plymouth, PL4 8AA, Devon, UK. \printeade3
\smonth12 \syear2011\smonth11 \syear2012
\smonth12 \syear2011\smonth11 \syear2012
\smonth12 \syear2011\smonth11 \syear2012
Abstract

We investigate the behavior of optimal alignment paths for homologous (related) and independent random sequences. An alignment between two finite sequences is optimal if it corresponds to the longest common subsequence (LCS). We prove the existence of lowest and highest optimal alignments and study their differences. High differences between the extremal alignments imply the high variety of all optimal alignments. We present several simulations indicating that the homologous (having the same common ancestor) sequences have typically the distance between the extremal alignments of much smaller size than independent sequences. In particular, the simulations suggest that for the homologous sequences, the growth of the distance between the extremal alignments is logarithmical. The main theoretical results of the paper prove that (under some assumptions) this is the case, indeed. The paper suggests that the properties of the optimal alignment paths characterize the relatedness of the sequences.

\kwd
\aid

0 \volume20 \issue3 2014 \firstpage1292 \lastpage1343 \doi10.3150/13-BEJ522 \newremarkExampleExample \newremarkRemarkRemark

\runtitle

Path properties

{aug}

1]\initsJ.\fnmsJüri \snmLember\corref\thanksref1label=e1]jyril@ut.ee, 2]\initsH.\fnmsHeinrich \snmMatzinger\thanksref2label=e2]matzing@math.gatech.edu and 3]\initsA.\fnmsAnna \snmVollmer\thanksref3label=e3]anna-lisa.vollmer@plymouth.ac.uk

longest common subsequence \kwdoptimal alignments \kwdhomologous sequences

1 Introduction

Let be a finite alphabet. In everything that follows, and are two strings of length . A common subsequence of and is a sequence that is a subsequence of and at the same time of . We denote by the length of the longest common subsequence (LCS) of and . LCS is a special case of a sequence alignment that is a very important tool in computational biology, used for comparison of DNA and protein sequences (see, e.g., watermanintrocompbio (), Vingron (), Watermanphase (), Durbin (), china ()). They are also used in computational linguistics, speech recognition and so on. In all these applications, two strings with a relatively long LCS, are deemed related. Hence, to distinguish related pairs of strings from unrelated via the length of LCS (or other similar optimality measure), it is important to have some information about the (asymptotical) distribution of . Unfortunately, although studied for a relatively long time, not much about the statistical behavior of is known even when the sequences and are both i.i.d. and independent of each other. Using the subadditivity, it is easy to see the existence of a constant such that

(1)

(see, e.g., Alexander (), Vingron (), rate ()). Referring to the celebrated paper of Chvatal and Sankoff Sankoff1 (), the constant is called the Chvatal–Sankoff constant; its value is unknown for even as simple cases as i.i.d. Bernoulli sequences. In this case, the value of obviously depends on the Bernoulli parameter . When , the various bounds indicate that steele86 (), kiwi (), Baeza1999 (). For a smaller , is even bigger. Hence, a common subsequence of two independent Bernoulli sequences typically makes up large part of the total length, if the sequences are related, LCS is even larger. As for the mean of , not much is also known about the variance of . In Sankoff1 (), it was conjectured that for Bernoulli parameter , the variance is of order . Using an Efron–Stein type of inequality, Steele steele86 () proved . In Waterman-estimation (), Waterman conjectured that grows linearly. In series of papers, Matzinger and others prove the Waterman conjecture for different models BonettoLCS (), periodicLCS (), LCIS (), notsym ().

Because of relatively rare knowledge about its asymptotics, it is rather difficult to build any statistical test based on or any other global optimality criterion. The situation is better for local alignments (see e.g., Watermanphase (), Waterman-estimation ()), because for these alignments approximate -values were recently calculated siegmund (), hansen ().

In the present paper, we propose another approach – instead of studying the length of LCS, we investigate the properties and behavior of the optimal alignments. Namely, even for moderate , the LCS is hardly unique. Every LCS corresponds to an optimal alignment (not necessarily vice versa), so in general, we have several optimal alignments. The differences can be of the local nature meaning that the optimal alignments do not vary much, or they can be of global nature. We conjecture that the variation of the optimal alignments characterizes the relatedness or homology of the sequences. To measure the differences between various optimal alignment, we consider so-called extremal alignments and study their differences. {Example*} Let us consider a practical example to give an insight in what follows. Let ATAGCGT, CAACATG. There are two longest common subsequences: AACG and AACT. Thus, . To every longest common subsequence corresponds two optimal alignments. These optimal alignments can be presented as follows:

First, two alignments correspond to optimal subsequence AACG, the last two correspond to AACT. In the following, we shall often consider the alignments as the pairs , where for every . With this notation, the four optimal alignments above are , , and . We now represent every alignment as two-dimensional plot. For the alignments in our example, the two dimensional plots are

Putting all four alignment into one graph, we see that on some regions all alignments are unique, but on some region, they vary:

In the picture above, the alignment (corresponding to AACG) lies above all others. This alignment will be called highest alignment. Similarly the alignment (corresponding to AACT) lies below all others. This alignment will be called lowest alignment. The highest and lowest alignment will be called extremal alignments.

Thus, the highest (lowest) alignment is the optimal alignment that lies above (below) all other optimal alignments in two-dimensional representation. For big , we usually align the dots in the two dimensional representation by lines. Then, to every alignment corresponds a curve. We shall call this curve the alignment graph (when it is obvious from the context, we skip “graph”). In Figure 1, there are extremal alignments of two independent i.i.d. four letter sequences (with uniform marginal distributions) of length . It is visible that the extremal alignments are rather far from each other, in particular, the maximum vertical and horizontal distances are relatively big.

Figure 1: The extremal alignments of two independent i.i.d. four letter sequences.

We call the sequences and unrelated, if they are independent. There are many ways to model the related sequences, the model in the present paper is based on the assumption that there exists a common ancestor, from which both sequences and are obtained by independent random mutations and deletions. The sequences with common ancestor are called homologous, detecting homology of given sequences is one of the major tasks in modern computational molecular biology china (). In this paper, we shall call the homologous sequences related.

More precisely, we consider an -valued i.i.d. process that will be referred to as the common ancestor or the ancestor process. A letter has a probability to mutate according to a transition matrix that does not depend on . The mutations of the letters are assumed to be independent. After mutations, some letters of the mutated process disappear. The disappearance is modeled via a deletion process that is assumed to be an i.i.d. Bernoulli sequence with parameter , that is, . If , then the th (possibly mutated letter) disappears. In such a way, a random sequence is obtained. The sequence is obtained similarly: the ancestor process is the same, but the mutations and deletions (with the same probabilities) are independent of the ones used to generate -sequence. The formal definition is given in Section 4.1.

Figure 2 presents a typical picture or extremal alignments of two related four-letter sequences (of uniform marginal distribution) of length 949. The sequences in Figure 2, thus, have the same marginal distribution as the ones in Figure 1, but they are not independent any more. Clearly the extremal alignments are close to each other; in particular the maximal vertical and horizontal distance is much smaller than these ones in Figure 1.

Figure 2: The extremal alignments of two related four letter sequences.

Figures 1 and 2 as well as many other similar simulations (see VollmerReport ()) clearly indicate that for related sequences the differences of optimal alignments are of local nature, whilst for independent sequences they vary much more. This motivates us to find a way to quantify the non-uniqueness and use the obtained characteristic as a measure of the relatedness. For that we measure the differences of extremal alignments in several ways: the maximal vertical and horizontal distance and Hausdorff’s distance (see Section 1.1.2 for formal definition of Hausdorff’s distance). The simulations in Section 7 show that for independent sequences, the growth of both of them is almost linear; for related sequence, however, it is logarithmic. Under some assumptions, the latter is confirmed by the main theoretical results about related sequences, Theorems 1.1, 1.2 and 1.3. More specifically, Theorem 1.1 states that under some assumption that never holds for independent sequences, there exist universal constants and so that for big enough,

Here, stands for a slight modification of Hausdorff’s distance between extremal alignments, which we shall call restricted Hausdorff’s distance. We conjecture the result also holds for (full) Hausdorff’s distance, denoted by . Note that by Borel–Cantelli lemma, from the inequality above, it follows that

that is, the ratio is eventually bounded above by , a.s. Theorem 1.2 states the similar result with maximal vertical distance instead of Hausdorff’s distance.

Theorem 1.3 considers the sequences with random lengths. The expected length of both sequences is , the randomness comes from the fact that instead of fixing the lengths of both sequences, we fix the length of the common ancestor process. In a sense, this situation is more realistic, since in practice the sequences are hardly of exactly the same lengths; however, when they are related, then the common ancestor must be of the same length for both of the sequences. It turns out that the case of the random lengths the statement of Theorems 1.1 and 1.2 hold with (full) Hausdorff’s distance instead of the restricted Hausdorff’s distance . More precisely, Theorem 1.3 states that under the same assumptions as in Theorem 1.1, there exist universal constants and so that

where stands for (full) Hausdorff’s distance between extremal alignments of random-length sequences.

Another measure could be the length of the biggest non-uniqueness stretch, that is, the (horizontal) length between ’s. The simulations in Section 7 show that the length of the biggest non-uniqueness stretch behaves similarly: the growth is almost linear for the independent and logarithmic for the related sequences. The latter has not been proven formally in this paper, but we conjecture that it can be done using similar arguments as in the proof of Theorems 1.1, 1.2 and 1.3.

1.1 The organization of the paper and the main results

1.1.1 Preliminary results

The paper is organized as follows. In Section 2, the necessary notation is introduced and the extremal alignments are formally defined and proven to exist (Proposition 2.1). Also some properties of the extremal alignments are proven. The section also provides some combinatorial bounds needed later.

Section 3 considers the case, where and are independent. The main result of the section is Theorem 3.1 that states for independent sequences the Chvatal–Sankoff constant satisfies the inequality

(2)

where

(3)

that is, is the binary entropy function. The equality has two solutions, hence, as a byproduct, (2) gives (upper and lower) bounds to unknown . These bounds need not to be the best possible bounds, but they are easy and universal in the sense that they hold for any independent model. For example, taking the distribution of and uniform over the alphabet with letters (thus and ), we obtain the following upper bounds to unknown . In the last row of the table, the estimators of unknown is obtained via simulations. It is interesting to note that, independently of , the upper bound overestimates about the same amount. We also obtain the lower bounds, but for these model the lower bounds are very close to zero and therefore not informative.

In Section 4, the preliminary results for related (homologous) sequences are presented. In Section 4.1, the formal definition of related sequences are given. Our definition of relatedness is based on the existence of common ancestor. Hence, our model models the homology in most natural way. In our model, the related sequences and both consists of i.i.d. random variables, but the sequences are, in general, not independent. Independence is a special case of the model so that all results for related sequences automatically hold for the independent ones. It is also important to note that (unless the sequences are independent), the two dimensional process is not stationary, hence also not ergodic. Hence, for the related sequences ergodic theorems cannot be automatically applied. In particular, Kingsman’s subadditive ergodic theorem cannot be applied any more to prove the convergence (1). This convergence as well as the corresponding large deviation bound has been proven in Section 4.2. Since we often consider the sequences of unequal length, instead of (1), we prove a more general convergence (Proposition 4.1):

(4)

Here is a constant. We shall denote . From (4), it follows that for any ,

If the sequences are independent and , then . Corollary 4.1 postulates the corresponding large deviation result, stating that for every there exists such that for every big enough

(5)

In the Appendix, it is proven that , if and , if (Lemma .1). That result together with (5) (obviously (4) follows from (5)) are the basic theoretical tools for proving the main results of the paper, Theorems 1.1, 1.2 and 1.3.

\tablewidth

==0pt

Table 1: Upper bounds to Chvatal–Sankoff constant via inequality (2)
2 3 4 5 6 7 8
0.866595 0.786473 0.729705 0.686117 0.650983 0.621719 0.596756
0.81 0.72 0.66 0.61 0.57 0.54 0.52

1.1.2 (Restricted) Hausdorff’s distance and the main results

Definition of (restricted) Hausdorff’s distance

We are interested in measuring the distance between the lowest and highest alignment. One possible measure would be the maximum vertical or horizontal distance (provided they are somehow defined). However, those distances need not match the intuitive meaning of the closeness of the alignment. For example, the following two alignments (marked with and , respectively) have a relatively long maximal vertical distance (3), though they are intuitively rather close:

(6)

To overcome the problem, we measure the distance between two alignments also in terms of Hausdorff’s distance. More precisely, let , be two alignments, both represented as sets of two-dimensional points. The Hausdorff’s distance between and is:

where is a distance in . In our case, we take as the maximum-distance (but one can also consider the usual Eucledian metric). We remark that Hausdorff’s distance is defined for any kind of sets. For the alignments in (6), the Hausdorff’s distance is obviously 1 (if were Euclidean, the Hausdorff’s distance would be ).

Let now, for every , be fixed, and we define the subset consisting of those elements of that have the first coordinate at least further from : . Similarly, the subset is defined. Formally, thus

The restricted Hausdorff’s distance between and is defined as follows:

where is a distance in . Clearly . Since in our case and are alignments so that different points have different coordinates, the definition of can be (somehow loosely) interpreted as a fraction of both alignments are left out when applying maximum in Hausdorff’s distance. We shall consider the case . Hence, the proportion of points left out decreases as grows.

Sequences with fixed length

We now state our main theorems for the sequences of fixed lengths. Recall the definition of and from (3). Let

(7)

Here is the conditional probability given that no deletion occurs, or, in other words and have the common ancestor (see Section 4.1 for formal definition). Finally, let

In the following theorems, stands for the restricted Hausdorff’s distance between alignments and , both represented as a set of 2-dimensional points. Recall that Hausdorff’s distance could be defined with the help in any metric in . In the following, we shall consider both maximum and -norms. Throughout the paper, we shall use and for and , respectively.

Theorem 1.1

Let and be related. Let be the (2-dimensional representations of) lowest and highest alignments of and . Assume

(8)

Then there exist positive constants such that, for big enough,

(9)

where is defined with

and Hausdorff’s distance is defined using maximum norm. If is defined with respect to norm, then (9) holds with replaced by .

For independent sequences , thus . Then also so that

The last inequality follows from (2) (Theorem 3.1). Hence, for unrelated (independent) sequences the condition (8) fails. It does not necessarily mean that in this case (9) holds not true, but based on our simulations in Section 7 we conjecture that this is indeed the case.

In Theorem 1.1, we used the 2-dimensional representation of alignments, so an alignment were identified with a finite set of points. In the alignment graph, these points are joined by a line. We consider the highest and lowest alignment graphs, and we are interested in the maximal vertical (horizontal) distance between these two piecewise linear curves. This maximum is called vertical (horizontal) distance between lowest and highest alignment graphs. The following theorem is stated in terms of vertical distance. Clearly the same result holds for horizontal distance as well. In the theorem, we shall also use the letters and , but now they stand for extremal alignment graphs rather than for the alignments as the sets of the points. Since an alignment and the corresponding alignment graph are very closely related, we hope that the notation is not too ambiguous and the difference will be clear from the context.

Theorem 1.2

Let and be related. Let be the lowest and highest alignment graphs of and . Assume (8). Then for big enough,

(10)

where the constants , and are the same as in Theorem 1.1.

Hence, Theorems 1.1 and 1.2 state that when is sufficiently bigger than (corresponding) , then the distance between the extremal alignment (either measured with restricted Hausdorff’s metric or using alignment graphs) grows no faster than logarithmically in . Clearly, is the bigger the more and are related. Hence, the inequality (8) measures the degree of the relatedness – if this is big enough, then the distance between extremal values grows (at most) logarithmically. Theorem 3.1 states that for independent sequence (8) fails, so that the assumptions of Theorems 1.1 and 1.2 hold for related sequences, only.

The fact that the distances between extremal alignments are measured with respect to the restricted Hausdorff’s distance, that is, so that a small fraction of the alignments left out is obviously a bit disappointing. Technically, this is due to the requirement that both sequences are of the same length. As we shall see, this is not the case when the lengths of the sequences are random. However, as also the simulations in Section 7 suggest, we believe that the results of Theorems 1.1 and 1.2 hold also when is replaced by the (full) Hausdorff’s distance and supremum is taken over .

Theorems 1.1 and 1.2 are proven in Section 5. The proof is based on the observation that under (8) the probability that the sequences with length about do not contain any related pairs is exponentially small in (Lemmas 5.1 and 5.2). Section 5.2 studies the location of the related pairs in two dimensional representation. It turns out that with high probability, the gaps between them are no longer then , where is suitable big constant. Applying these properties together with Lemma 5.2, we obtain that every optimal alignment, including the extremal ones, cannot be far away from the related points, since otherwise it would have a long piece without any related pair contradicting Lemma 5.2. This argument is formalized in Lemma 5.3 and Lemma 5.4 in Section 5.3. The formal proof of Theorems 1.1 and 1.2 are given in Sections 5.4 and 5.5, respectively.

The sequences with random length

The related sequences are defined as follows: there is a common ancestor process consisting on -valued i.i.d. random variables. Every letter has a probability to mutate according to a transition matrix that does not depend on . The mutations are independent of each other. After mutations, some of the letters disappears. Thus, to every letter , there is associated a Bernoulli random variables with . When , then the corresponding (mutated) letter disappears. The deletions are independent and the remaining letters form the sequence . The sequence is defined in the same way: every ancestor letter has another random mutation (independent of the all other mutations including the ones that were used to define the -sequence), and independent i.i.d. deletions with the same probability. For more detailed definition, see Section 4.1.

When dealing with the sequences of random length, we consider exactly ancestors . Hence after deletions, the length of obtained -sequence is and the length of -sequence is . The expected length of both sequences is thus and we choose so that the expected length of the both sequences is . For simplicity, is assumed to be integer. Thus, we shall consider the sequences and of random lengths. It turns out that mathematically this case is somehow easier so that the counterpart of Theorem 1.3 holds with full Hausdorff’s distance instead of .

Theorem 1.3

Let and be the related sequences of random lengths. Let be the (2-dimensional representation) of the highest and lowest alignment. Assume (8). Then there exist constants and so that

(11)

where is the Hausdoff’s distance with respect to maximum norm. If is defined with respect to norm, then (9) holds with replaced by .

From the proofs, it is easy to see that the random length analogue of Theorem 1.2 with holds as well. Theorem 1.3 is proven in Section 6.

Finally, in Section 7, some simulations about the speed of the convergence are studied. The simulation clearly indicate that for related sequences the growth of Hausdorff’s and vertical distance is at order of , hence the simulations fully confirm the main results of the paper.

We would like to mention that to our best knowledge, the idea of considering the extremal alignments as a characterization of the homology has not been exploited, although the optimal and sub-optimal alignments have deserved some attention before macrounilargefluct (), barder (). Therefore, the present paper as the first step does not aim to minimize the assumptions or propose any ready-made tests. These are the issues of the further research. In a follow-up paper hirmo (), we apply some extremal-alignments based characteristics to the real DNA-sequences, and compare the results with standard sequence-alignment tools like BLAST.

2 Preliminaries

Let and be two sequences of lengths and from finite alphabet . Let there exist two subsets of indices and satisfying , and Then is a common subsequence of and and the pairs

(12)

are (the 2-dimensional representation of) the corresponding alignment. Let

be the biggest such that there exist such subsets of indices. The longest common subsequence is any common subsequence with length and any alignment corresponding to a longest common subsequence is called optimal. In the following, we shall often consider the case, where, for some constants , , . Let us denote

Thus is the length of the longest common sequence, when both sequences are of equal length, . The random variable is the main objet of interest.

Extremal alignments: Definition and properties

We now formally define the highest (optimal) alignment corresponding to . Let

be the set of all optimal alignments. Hence, and is the index set so that the elements of will be identified with optimal alignments. For every (resp., ), where and , we shall denote (resp., ). We define

Let . There might be many alignments such that . Among such alignments take to be minimum. Formally, After fixing , we take as the biggest such that the corresponding is smaller than . Formally,

There might be several ’s such that corresponding is . Amongst them, we choose the minimum. Thus,

Proceeding so, we obtain an alignment. We call this the highest alignment procedure. We now prove that the procedure can be repeated times, that is, the obtained alignment is optimal.

Proposition 2.1

The highest alignment procedure produces an optimal alignment

where can be obtained as follows

(13)
{pf}

Clearly the pair is the last pair of an optimal alignment, that is, there exists such that . So (13) holds with . Similarly, there exists a such that . Let us show this. There exists a such that , we have to show that . Note that cannot be , since otherwise would be an alignment of length . Suppose . Since , by definition of , it must be that Since , we have that . On the other hand, implying that . Hence, would be an alignment of length . Therefore, . Let us now prove that (13) with holds. If this were not the case, then . This implies the existence of so that and . But as we saw, those inequalities would give an alignment with the length . This concludes the proof of (13) with . For proceed similarly.

Figure 3: An example of the highest alignment.

Figure 3 is an example of an highest alignment. The solid lines are aligned pairs (the upper-index is dropped from the notation). If , then, as showed by dashed line, could be aligned with that contradicts the highest alignment procedure. Thus, every in the highest alignment is different from all that are right after and before . This observation is postulated as statements (15) and (16) in the following corollary. Similarly, if , then, as showed by dashed line, could be aligned with that also contradicts the highest alignment procedure. Thus, in the highest alignment all -s right after and before must differ from . This observation is formulated as the statements (17) and (18) in the following corollary. In the highest alignment, typically, and . If is not aligned, then it must be that for , otherwise they could be aligned (as showed by dashed line) contradicting the optimality. These observations are statements (19) and (20) in the following corollary. Similarly, if is not aligned, it should be different from all . These observations are statements (21) and (22) in the following corollary.

Corollary 2.1

The highest alignment has the following properties:

(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
{pf}

The equalities (14) are obvious. Suppose that for a there exists an index such that and . Then the pairs

would correspond to an optimal alignment, say , satisfying

Thus, (15) holds. The same argument proves (16), (17) and (18). If one of the inequalities in (19)–(22) is not fulfilled, then it would be possible to align one more pair without disturbing already existing aligned pairs. This contradicts the optimality.

One can also think of the left-most or nord-west alignment. It could be defined as an alignment , where , and for every ,

Here the superscript “” stands for west. By the analogue of Proposition 2.1,

(23)

Using (13) and (23), it is easy to see that the left-most and highest alignments actually coincide. Indeed, by (13) and (23), and , . If, for a , , then, by the definitions, both inequalities have to be strict, that is, and . To see this, suppose . This means that there exists an alignment , such that and . This, in turn, implies that

that is, . The same argument shows that if , then also . Thus implies that and . These inequalities, however, would imply the existence of an alignment with the length .

The lowest (the right-most) alignment will be defined similarly:

(24)
{Remark*}

Note that the left-most alignment equals the lowest alignment of and implying that the highest alignment of and equals to the lowest alignment of and . Thus, the lowest alignment between and can be defined as the highest alignment between and .

Combinatorics

Another way to study an alignment of and is to present it as a strictly increasing mapping

(25)

Notation (25) means: There exists and a mapping

such that and is strictly increasing: , if . The length of is denoted as . In the notation of previous sections, thus, , .

Consider now the case , that is, both sequences are of length . Let then be the set of all alignments with length . Formally,

Fix , and let

(26)

Hence, consists of these alignments that have length not smaller that and not bigger that . In the subsequent sections, we shall show that there exists a constant (depending on the model) so that for big enough all optimal alignments belong to with high probability. Thus, in a sense the set contains all alignments of interest. We are interested in bounding the size of that set. For that, we use the binary entropy function

Let, for such that

(27)

Since

for every

(28)

it holds

Hence, the number of alignments in can be bounded as follows:

(29)

Let us consider now a more general case . Denote . Assume that . Then

Instead of (28), we assume to satisfy

(30)

Then

and

In this case, defining

it holds

3 Independent sequences

In this section, only, let and be two independent i.i.d. sequences from the alphabet . Recall that for any , and . By the Kingman’s subbadditive ergodic theorem, there exists a constant so that

We shall denote , the constant is often called the Chvatal–Sankoff constant. In the Appendix, it will be shown that when , then (Lemma .1).

Note that is a function of i.i.d. random variables. Clearly, changing one of the variables changes the value of at most by one, so that by McDiarmid inequality (see, e.g., rate ()), for every

(31)

Take so big that . Then