Probabilistic Arithmetic Automata and their ApplicationsParts of this article have been published in conference proceedings [29, 47, 23, 48, 50]. An extended version of one of these articles [50] has been submitted to a journal. A preprint is available on arXiv [49].

Probabilistic Arithmetic Automata and their Applicationsthanks: Parts of this article have been published in conference proceedings [29, 47, 23, 48, 50]. An extended version of one of these articles [50] has been submitted to a journal. A preprint is available on arXiv [49].

Tobias Marschall Bioinformatics for High-Throughput Technologies, Computer Science 11, TU Dortmund, 44221 Dortmund, Germany    Inke Herms Genome Informatics, Faculty of Technology, Bielefeld University, 33501 Bielefeld, Germany    Hans-Michael Kaltenbach Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland    Sven Rahmann
Abstract

Abstract. We present probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two different algorithms to exactly calculate the distribution of the results obtained by such probabilistic calculations. PAAs provide a unifying framework to approach many problems arising in computational biology and elsewhere. Here, we present five different applications, namely (1) pattern matching statistics on random texts, including the computation of the distribution of occurrence counts, waiting time and clump size under HMM background models; (2) exact analysis of window-based pattern matching algorithms; (3) sensitivity of filtration seeds used to detect candidate sequence alignments; (4) length and mass statistics of peptide fragments resulting from enzymatic cleavage reactions; and (5) read length statistics of 454 sequencing reads. The diversity of these applications indicates the flexibility and unifying character of the presented framework.

While the construction of a PAA depends on the particular application, we single out a frequently applicable construction method for pattern statistics: We introduce deterministic arithmetic automata (DAAs) to model deterministic calculations on sequences, and demonstrate how to construct a PAA from a given DAA and a finite-memory random text model. We show how to transform a finite automaton into a DAA and then into the corresponding PAA.

1 Introduction

In many applications, processes can be modeled as chains of operations working on operands that are drawn probabilistically. As an example, let us consider a simple dice game. Suppose you have a bag containing three dice, a 6-faced, a 12-faced, and a 20-faced die. Now a die is drawn from the bag, rolled, and put back. This procedure is repeated  times. In the end one may, for example, be interested in the distribution of the maximum number observed. Many variants can be thought of, for instance, we might start with a value of 0 and each die might be associated with an operation, e.g. the spots seen on the 6-faced die might be subtracted from the current value and the spots on the 12-faced and 20-faced dice might be added. In addition to the distribution of values after  rolls, we can ask for the distribution of the waiting time for reaching a value above a given threshold.

The goal of this article is to establish a general formal framework, referred to as probabilistic arithmetic automata (PAAs), to directly model such systems and answer the posed questions. We emphasize that we are not interested in simulation studies or approximations to these distributions, but in an exact computation up to machine accuracy. Further, we show that problems from diverse applications, especially from computational biology, can be conveniently solved with PAAs in a unified way, whereas they are so far treated heterogeneously in the literature. This article is a substantially revised and augmented version of several extended abstracts that introduced the PAA framework [47] and outlined some of the applications presented here [29, 23, 48, 50].

Let us give an overview of the application domains of PAAs considered in this paper. We begin with the field of pattern matching statistics. Biological sequence analysis is often concerned with the search for structure in long strings like DNA, RNA or amino acid sequences. Frequently, “search for structure” means to look for patterns that occur very often. An important point in this process is to sensibly define a notion of “very often”. One option is to consult the statistical significance of an event: Suppose we have found a certain pattern  times in a given sequence. What is the probability of observing  or more matches just by chance? The answer to this question depends on the used null model, i.e. the notion of “by chance”. It turns out that the PAA framework paves the way to using quite general null models; finite-memory text models as used in this article comprise i.i.d. models, Markovian models of arbitrary order and character-emitting hidden Markov models (HMMs).

The PAA framework can also be applied to the exact analysis of algorithms. Traditionally, best case, average case, and worst case behavior of algorithms are considered. In contrast, we construct PAAs to compute the whole exact distribution of costs of arbitrary window-based pattern matching algorithms like Horspool’s or Sunday’s algorithm. For these algorithm, we present (perhaps surprising) exemplary results on short patterns and moderate text lengths.

Another application arises when searching for a (biological) query sequence, such as a DNA or protein sequence, in a comprehensive database. The goal is to quickly retrieve all sufficiently similar sequences. Heuristic methods use so-called alignment seeds in order to first detect candidate sequences which are then investigated more carefully. To evaluate the quality of such a seed, one computes its sensitivity or hitting probability, i.e. the fraction of all desired target sequences it hits. Similarly, we ask which fraction of non-related sequences is hit by a seed by chance, as this quantity directly translates into unnecessary subsequent alignment work. The stochasticity arises from directly modelling alignments of similar (or non-similar) sequences rather than modelling the sequences themselves. We use the PAA framework to compute the match distribution and, in particular, the sensitivity of certain filtration seeds under finite-memory null models.

Next, we investigate protein identification by mass spectrometric analysis. If one can describe the typical length and mass of peptide fragments measured by the mass spectrometer, one can define a reliable comparison of so-called peptide mass fingerprints, based on an underlying null model. In this article, we compute the joint length-mass distribution of random peptide fragments. Furthermore, we calculate the occurrence probability of fragments in a given mass range, i.e., the probability that cleaving a random protein of a given length yields at least one fragment within this mass range. This probability aids the interpretation of mass spectra as it gives the significance of a measured peak. Again, our framework permits to use arbitrary finite-memory protein models.

The final application concerns DNA sequencing: The task of determining a DNA sequence has seen great technological progress over the last decades. For a particular sequencing technology (“454 sequencing”), the length of a sequenced DNA fragment depends on the order of its characters. We compute the exact length distribution of such fragments. With this distribution we can specify the technology settings that yield the longest reads on average if statistical properties of the genome under consideration are known, improving the sequencing performance by up to 10%.

Most of the mentioned applications have, in some form or another, previously been discussed in the literature, but never been identified as instances of the same abstract scheme. The contribution of this article is twofold. On the one hand, we introduce a generic framework unifying the view on the presented applications. On the other hand, we show that, in all considered applications, not only known results can be reproduced using PAAs, but new achievements are made. Since the range of applications is quite diverse, we give further references to relevant literature in each section separately.

Organization of the Article.

In the first part of this article we set up the PAA framework. Specifically, we formally introduce PAAs in Section 2 and give generic algorithms in Section 3. We consider waiting time problems on PAAs in Section 4. In Section 5, deterministic arithmetic automata and finite-memory text models are defined and shown to be a convenient means of specifying a PAA.

Having the framework in place, we explore several application domains. Section 6 deals with the statistics of patterns on random texts. In Section 7, PAAs are employed for the analysis of window-based pattern matching algorithms. We cover the field of alignment seed statistics in Section 8. Then, in Section 9, we apply PAAs to mass statistics of fragments resulting from enzymatic digestion. The optimization of read lengths in 454 sequencing is discussed in Section 10. We conclude the article with a summarizing discussion in Section 11.

The relations between this article and preliminary versions published in conference proceedings are as follows. Sections 23 and parts of Section 6 are based on [47]. Section 5 is based on [50]. Section 6.4 contains material from [48]. Sections 78, and 9 are based on [50], [23], and [29], respectively.

An extended version of [50] has been submitted for consideration in a special issue of a journal, a preprint is available as [49]. There, the analysis of pattern matching algorithms is further generalized and applied to more algorithms. Here, we include a basic example in Section 7 to give another example of the utility of PAAs.

Notation and Conventions.

The natural numbers (without zero) are denoted ; to include zero, we write . Throughout the article, is a finite alphabet; as usual, denotes the set of all finite strings. Indices are zero-based, i.e.  for . Substrings, prefixes, and suffixes are written , , and , respectively. All stochastic processes considered in this article are discrete. Therefore, appropriate probability spaces can always be constructed; we do not clutter notation by stating them explicitly. By , we refer to a probability measure; denotes the distribution of the random variable . Iverson brackets are written , i.e.  if the statement  is true and otherwise.

2 Probabilistic Arithmetic Automata

In this section, we define probabilistic arithmetic automata (PAAs) in order to formalize chains of operations with probabilistic operands. PAAs can be interpreted as generalized Markov Additive Processes (MAP) [18, 19] in the discrete case.

Definition 2.1 (Probabilistic Arithmetic Automaton).

A probabilistic arithmetic automaton  is a tuple

where

  • is a finite set of states,

  • is called start state,

  • is a transition function with for all , i.e.  is a stochastic matrix,

  • is a set called value set,

  • is called start value,

  • is a finite set called emission set,

  • each is an emission distribution associated with state ,

  • each is an operation associated with state .

We attach the following semantics: At first, the automaton is in its start state , as for a classical deterministic finite automaton (DFA). In a DFA, the transitions are triggered by input symbols. In a PAA, the transitions are purely probabilistic; gives the chance of going from state to state . Note that the tuple defines a Markov chain on state set with transition matrix , where the initial distribution is the Dirac distribution assigning probability 1 to .

While going from state to state, a PAA performs a chain of calculations on a set of values . It starts with value . Whenever a state transition is made, the entered state, say state , generates an emission from  according to the distribution . The current value and this emission are then subject to the operation , resulting in the next value from the value set . Notice that the Markov chain , together with the emission set  and the distributions , defines a hidden Markov model (HMM). In the context of HMMs, however, the focus usually rests on the sequence of emissions, whereas we are interested in the value resulting from a chain of operations on these emissions.

Figure 1: Illustration of a PAA for the dice example. Each of the three dice is represented by a state (circles). Emission distributions are depicted as gray boxes. Each arrow stands for a possible state transition, each transition having a probability of . The state’s operations are given in triangles next to the state. For the start state, emission distribution and operation are not drawn.

By introducing PAAs, we emphasize that many applications can naturally be modelled as a chain of operations whose result is of interest. When compared to Markov chains, PAAs do not offer an increase in expressive power. In fact, from a theoretical point of view, every PAA might be seen as a Markov chain on the state space . Thus, we advocate PAAs not because of their expressive power but for their merits as a modelling technique. As we shall see, the framework lends itself to many applications and often allows simple and intuitive problem formulations. Before we formalize the introduced semantics in Definition 2.3, let us come back to the dice example from Section 1.

Example 2.2 (Dice).

We model each of the three dice as a PAA state. All transition probabilities equal and the emissions have uniform distributions over the number of faces of the respective dice. If we are interested in the maximum, each state’s operation is the maximum. In general, we can associate individual operations with each state, for instance: “sum the numbers from the 12- and 20-faced dice and subtract the numbers seen on the 6-faced die”. The value set is , the start value is naturally . The corresponding PAA is illustrated in Figure 1.

Definition 2.3 (Stochastic processes induced by a PAA).

For a given PAA , we denote its state process by . It is defined to be a Markov chain with and

(1)

for all . Further, we define the emission process  through

(2)

i.e. the current emission depends solely on the current state. We use and to define the process of values  resulting from the performed operations:

(3)

If the considered PAA is clear from the context, we omit the superscript  and write , , and , respectively.

3 Computing the State-Value Distributions of PAAs

We describe two algorithms to compute the distribution of resulting values. In other words, we seek to calculate the distribution of the random variable for a given . The idea is to compute the joint distribution and then to derive the sought distribution by marginalization:

(4)

For the sake of a shorter notation, we define for , , .

A slight complication arises when is infinite. However, for each , the range of is finite, as it is a function of the states and emissions up to time , and these are finite sets. We define and . Clearly . Therefore all actual computations are on finite sets. As we will see, in many applications, grows only polynomially (even linearly) with . In the following, we shall understand as the appropriate union of sets. Running times of algorithms are given in terms of .

3.1 Basic Algorithm

We discuss an algorithm to compute the distribution . A basic recurrence relation follows from Definitions 2.1 and 2.3.

Lemma 3.1 (State-value recurrence).

For a given PAA, the state-value distribution can be computed by

(5)

and

(6)

where denotes the inverse image set of under .

Proof.

Equation (5) follows directly from (1) and (3). Let us verify Equation (6):

We further evaluate the expression :

where is true because of (2) and (3) and  follows from the fact that  is a Markov chain. ∎

We start with the distribution and calculate the subsequent distributions by applying Equation (6) until we obtain the desired . A straightforward implementation of Equation (6) results in a pull-strategy; that means each entry in the table representing is calculated by “pulling over” the required probabilities from table . Note that this approach makes it necessary to calculate in a preprocessing step. In order to avoid this, we may implement a push-strategy, meaning that we iterate over all entries in rather than and “push” the encountered summands over to the appropriate places in table ; in effect, we just change the order of summation. Algorithm 1 shows the push-strategy in detail.

0:  , , space for two tables of size
0:  
1:  for  to  do
2:     initialize for all ,
3:     for all  and  do
4:        for all  and  do
5:           
6:           
7:        end for
8:     end for
9:  end for
10:  return
Algorithm 1 PaaDist

In the course of the computation, we have to store two distributions, and , at a time. Once is calculated, can be discarded. Since the table at time  has a size of , the total space consumption is . Computing from takes time, as can be seen from Algorithm 1. We arrive at the following lemma.

Lemma 3.2.

Given a PAA , the distribution of values can be computed in time and space.

3.2 Doubling Technique

If is large, executing the above algorithm may be slow. In this section, we present an alternative algorithm that may be favorable for large . To derive this algorithm, we consider the conditional probability

(7)

Note that does not depend on , because transition as well as emission probabilities do not change over “time” (a property called homogeneity). Once is known, we can simply read off the desired distribution :

(8)

The following lemma shows how can be computed.

Lemma 3.3.

Let be a PAA and and its state and value process, respectively. Then,

(9)

and, for all and ,

(10)

Using these recurrences, the distribution of values can be computed in time and space.

Proof.

Equation (9) follows from Definition 2.3, while Equation (10) follows from the Chapman-Kolmogorov Equation for homogeneous Markov chains when the PAA is seen as a Markov chain with state space . Computing from and takes time, as follows from Equation (10). On the other hand, one step suffices to obtain from . Thus, we can compute all for in steps, which in turn can be combined into in at most steps. ∎

We note that the doubling technique is asymptotically faster if is and and are fixed.

4 Waiting Times

Besides calculating the distribution of values after a fixed number of steps, we can ask for the distribution of the number of steps needed to reach a certain value or a certain state. Such waiting time problems play an important role in many applications. We discuss examples in sections 6 and 10. A classical treatment of waiting time problems is given in [21]. Applications to occurrence problems in texts are reviewed in [59].

Definition 4.1 (Waiting time for a value).

The waiting time for a set of target values is a random variable defined as if this set is not empty, and defined as infinity otherwise.

While may be nonzero for all , we are frequently only interested in the distribution up to a fixed time . Then, of course, the exact values of are unknown for , but their total probability is known, and is typically chosen such that this total probability remains below a desired threshold.

Lemma 4.2.

Let be a PAA and . Then, the probabilities can be computed in time and space. Alternatively, this can be done using time and space.

Proof.

We construct a modified PAA by defining a new value set , assuming (without loss of generality) that , and new operations

for all . Let be the modified value process. Using the modified PAA, the probability of waiting time  can be expressed as

Runtime and space bounds follow from Lemmas 3.2 and 3.3. ∎

Besides waiting for a set of values, we may also wait for a set of states. Since a PAA’s state process does not depend on emission and value processes, the remainder of this section solely concerns the Markov chain , which is part of the PAA . Waiting times in Markov chains are a well-studied topic with particular interest in pattern occurrences [59] and queuing theory [13]. For completeness and implementation purposes within the PAA framework, we briefly restate the construction here.

Definition 4.3 (Waiting time for a state).

The waiting time for a set of target states is a random variable defined as if this set is not empty and defined as infinity otherwise.

Lemma 4.4.

Let be a PAA, be a probability distribution on , and be a set of target states. Consider the Markov chain , let be its state process, and let be the waiting time for states . Then can be computed in time and space, or in time and space using the doubling technique. If , then .

Proof.

As in the proof of Lemma 4.2, we introduce an aggregation state to replace and an absorbing state to “flush” . Then . ∎

When , the above lemma yields the waiting time for the first event of reaching one of the states in . We further consider the waiting time of a return event

If the Markov chain is aperiodic and irreducible, it has a unique stationary state distribution, against which the PAA state distribution converges exponentially fast. We can then use Lemma 4.4 to compute

by choosing in Lemma 4.4 as the stationary distribution restricted to .

5 PAAs Based on Random Sequences

We discuss the construction of PAAs modelling the deterministic processing of random sequences. That means we assume to be given a mechanism that processes sequences character by character and deterministically computes a value for a given string. A pattern matching algorithm that computes the number of matches in a given sequence might serve as an example. In this section, we ask for the distribution of resulting values when a deterministic computation is applied to random strings. Therefore we first define text models and deterministic arithmetic automata to represent random texts and deterministic computations, respectively, and then combine both into a PAA.

5.1 Random Text Models

Given an alphabet , a random text is a stochastic process , where each takes values in . A text model is a probability measure assigning probabilities to (sets of) strings. It is given by (consistently) specifying the probabilities for all . We only consider finite-memory models in this article which are formalized in the following definition.

Definition 5.1 (Finite-memory text model).

A finite-memory text model is a tuple , where is a finite state space (called context space), a start context, an alphabet, and with for all . The random variable giving the text model state after  steps is denoted  with . A probability measure is now induced by stipulating

for all , , and .

The idea is that the model given by generates a random text by moving from context to context and emitting a character at each transition, where is the probability of moving from context  to context  and thereby generating the letter .

Note that the probability is obtained by marginalization over all context sequences that generate . This can be efficiently done, using the decomposition of the following lemma.

Lemma 5.2.

Let be a finite-memory text model. Then,

for all , , and .

Proof.

We have

Renaming to yields the claimed result. ∎

Similar text models are used in [36], where they a called probability transducers. In the following, we refer to a finite-memory text model simply as text model, as all text models considered in this article are special cases of Definition 5.1.

For an i.i.d. model, we set and for each , where  is the occurrence probability of letter  (and may be interpreted as an empty context). For a Markovian text model of order , the distribution of the next character depends only on the  preceding characters (fewer at the beginning); thus we set . The conditional follow-up probabilities are given by

This notion of text models also covers variable order Markov chains as introduced in [62], which can be converted into equivalent models of fixed order. Text models as defined above have the same expressive power as character-emitting HMMs, that means, they allow to construct the same probability distributions. For a given HMM, we can construct an equivalent text model by using the same state space (contexts) and setting , where  and  are the HMM’s transition function and emission distribution attached to state , respectively. When, on the other hand, a text model  is given, we construct an equivalent HMM by using as state space and setting

and

5.2 Deterministic Arithmetic Automata (DAAs)

In order to model deterministic calculations on sequences, we define a deterministic counter-part to PAAs.

Definition 5.3 (Deterministic Arithmetic Automaton, DAA).

A deterministic arithmetic automaton is a tuple

where  is a finite set of states, is the start state, is a finite alphabet, is called transition function, is a set of values, is called the start value, is a finite set of emissions, is the emission associated to state , and is a binary operation associated to state . Further, we define the associated joint transition function

We extend the definitions of and inductively to  in their second argument by setting

When  for some and , we say that  computes value  for input  and define .

Informally, a DAA starts with the state-value pair  and reads a sequence of symbols from . Being in state  with value , upon reading , the DAA performs a state transition to  and updates the value to  using the operation and emission of the new state .

For each state , the emission  is fixed and could be dropped from the definition of DAAs. In fact, one could also dispense with values and operations entirely and define a DFA over state space , performing the same operations as a DAA. However, we intentionally include values, operations, and emissions to emphasize the connection to PAAs.

5.3 Constructing PAAs from DAAs and Text Models

We now formally state how to convert a DAA into a (restricted) PAA, where each emission distribution is deterministic (assigning probability  to a particular value), given a text model.

Lemma 5.4 (Daa Text model Paa).

Let be a text model and be a DAA. Then, define

  • a state space ,

  • a start state ,

  • transition probabilities

    (11)
  • (deterministic) emission probability vectors

    for all .

  • operations for all .

Then, is a PAA with

for all , where  is a random text according to the text model .

Proof.

is a PAA by Definition 2.1. As in Section 3, we define . To prove , we show that

(12)

for all , , , and . For , Equation (12) is correct by definitions of PAAs, DAAs and text models. For we prove it inductively. Assume (12) to be correct for all with .

(13)
(14)
(15)
(16)
(17)
(18)
(19)

In the above derivation, step (13)(14) follows from (6). Step (14)(15) follows from the definitions of and . Step (15)(16) uses the definitions of  and in Lemma 5.4. Step (16)(17) uses the induction assumption. Step (17)(18) uses Lemma 5.2. The final step (18)(19) follows by combining the four Iverson brackets summed over and into a single Iverson bracket. ∎

Remark 5.5.

In the above lemma, states having zero probability of being reached from may be omitted from and .

Lemma 5.6 (PAA from DAA; Construction time and space).

  1. For a PAA constructed according to Lemma 5.4, the value distribution , or the joint state-value distribution, can be computed with operations using space. The same statement holds for computing the waiting time distribution up to time .

  2. If for all and , there exists at most one such that , then the time is bounded by .

  3. Using the doubling technique, the distributions can be computed in time and space.

Proof.

  1. From Lemma 3.2, we obtain bounds for time and space complexity of and , respectively. By construction, . Recall that Lemma 3.2 is based on Algorithm 1. The loops in lines 1 and 3 together account for a factor of