Optimal mean-based algorithms for trace reconstruction

In the (deletion-channel) trace reconstruction problem, there is an unknown -bit source string . An algorithm is given access to independent traces of , where a trace is formed by deleting each bit of  independently with probability . The goal of the algorithm is to recover  exactly (with high probability), while minimizing samples (number of traces) and running time.

Previously, the best known algorithm for the trace reconstruction problem was due to Holenstein et al. [HMPW08]; it uses samples and running time for any fixed . It is also what we call a “mean-based algorithm”, meaning that it only uses the empirical means of the individual bits of the traces. Holenstein et al. also gave a lower bound, showing that any mean-based algorithm must use at least samples.

In this paper we improve both of these results, obtaining matching upper and lower bounds for mean-based trace reconstruction. For any constant deletion rate , we give a mean-based algorithm that uses time and traces; we also prove that any mean-based algorithm must use at least traces. In fact, we obtain matching upper and lower bounds even for subconstant and subconstant: when the bound is , and when the bound is .

Our proofs involve estimates for the maxima of Littlewood polynomials on complex disks. We show that these techniques can also be used to perform trace reconstruction with random insertions and bit-flips in addition to deletions. We also find a surprising result: for deletion probabilities , the presence of insertions can actually help with trace reconstruction.

1 Introduction

Consider a setting in which a string of length over an alphabet is passed through a deletion channel that independently deletes each coordinate of with probability . The resulting string, of length somewhere between and , is referred to as a trace of , or as a received string; the original string is referred to as the source string. The trace reconstruction problem is the task of reconstructing (with high probability) given access to independent traces of . This is a natural and well-studied problem, dating back to the early 2000’s [Lev01b, Lev01a, BKKM04], with some combinatorial variants dating even to the early 1970’s [Kal73]. However, perhaps surprisingly, much remains to be discovered both about the information-theoretic and algorithmic complexity of this problem. Indeed, in a 2009 survey [Mit09, Section 7], Mitzenmacher wrote that “the study of [trace reconstruction] is still in its infancy”.

Before discussing previous work, we briefly explain why one can assume a binary alphabet without loss of generality. In case of a general , drawing traces will with high probability reveal the entire alphabet of symbols that are present in . For each symbol we may consider the binary string whose -th character is iff ; a trace of is easily converted into a trace of , so the trace reconstruction problem for can be solved by solving the binary trace reconstruction problem for each and combining the results in the obvious way. For this reason, our work (and most previous work) focuses on the case of a binary alphabet.

1.1 Prior work

As described in [Mit09], the trace reconstruction problem can arise in several natural domains, including sensor networks and biology. However, the apparent difficulty of the problem means that there is not too much published work, at least on the problem of “worst-case” trace reconstruction problem (“worst-case” in the sense that the source string may be any element of ). Because of this, several prior authors have considered an “average-case” version of the problem in which the source string is assumed to be uniformly random over and the algorithm is required to succeed with high probability over the random draw of the traces and over the uniform random choice of . This average-case problem seems to have first been studied by Batu et al. [BKKM04], who showed that a simple efficient algorithm which they call Bitwise Majority Alignment succeeds with high probability for sufficiently small deletion rates using only traces. Subsequent work of Kannan and McGregor [KM05] gave an algorithm for random that can handle both deletions and insertions (both at rates as well as bit-flips (with constant probability bounded away from ) using traces. Viswanathan and Swaminathan [VS08] sharpened this result by improving the deletion and insertion rates that can be handled to . Finally, [HMPW08] gave a -time, -trace algorithm for random that succeeds with high probability for any deletion rate that is at most some sufficiently small absolute constant.

Several researchers have considered, from an information-theoretic rather than algorithmic perspective, various reconstruction problems that are closely related to the (worst-case) trace reconstruction problem. Kalashnik [Kal73] showed that any -bit string is uniquely specified by its -deck, which is the multiset of all its length- subsequences, when ; this result was later reproved by Manvel et al. [MMS91]. Scott [Sco97] subsequently showed that suffices for reconstruction from the -deck for any , and simultaneously and independently Krasnikov and Roditty [KR97] showed that suffices. (McGregor et al. observed in [MPV14] that the result of [Sco97] yields an information-theoretic algorithm using traces for any deletion rate , but did not discuss the running time of such an algorithm.) On the other side, successively larger lower bounds on the value of that suffices for reconstruction of an arbitrary from its -deck were given by Manvel et al. [MMS91] and Choffrut and Karhumäki [CK97], culminating in a lower bound of due to Dudík and Schulman [DS03].

Surprisingly few algorithms have been given for the worst-case trace reconstruction problem as defined in the first paragraph of this paper. Batu et al. [BKKM04] showed that a variation of their Bitwise Majority Alignment algorithm succeeds efficiently using traces if the deletion rate  is quite low, at most Holenstein et al. [HMPW08] gave a “mean-based” algorithm (we explain precisely what is meant by such an algorithm later) that runs in time and uses traces for any deletion rate that is bounded away from 1 by a constant; this is the prior work that is most relevant to our main positive result. [HMPW08] also gave a lower bound showing that for any bounded away from 0 by a constant, at least traces are required for any mean-based algorithm. Since the result of [HMPW08], several researchers (such as [Mos13]) have raised the question of finding (potentially inefficient) algorithms which have a better sample complexity; however, no progress had been made until this work.

One may also ask (as was done in the “open questions” of [Mit09, Section 7]) for trace reconstruction for more general channels, such as those that allow deletions, insertions, and bit-flips. The only work we are aware of along these lines is that of Andoni et al. [ADHR12], which gives results for trace reconstruction for average-case words in the presence of insertions, deletions, and substitutions on a tree.

1.2 Our results

Theorem 1.1 (Deletion channel positive result).

There is an algorithm for the trace reconstruction problem which, for any constant , uses traces and running time.

Theorem 1.1 significantly improves the running time and sample complexity of the [HMPW08] algorithm, which is for fixed constant . Furthermore, we can actually extend Theorem 1.1 to the case of or ; see Theorem 1.3 below.

The algorithm of Theorem 1.1 is a “mean-based” algorithm, meaning that it uses only the empirical mean of the trace vectors it receives. We prove an essentially matching lower bound for such algorithms:

Theorem 1.2 (Deletion channel negative result).

For any constant , every mean-based algorithm must use at least traces.

As mentioned, we can also treat and :

Theorem 1.3 (Deletion channel general matching bounds).

The matching bounds in Theorems 1.1 and 1.2 extend as follows: For , the matching bound is (and for any smaller  we have a upper bound). Writing for the “retention” probability, for the matching bound is .

For simplicity in the main portion of the paper we consider only the deletion channel and prove the above results. In Appendix A we consider a more general channel that allows for deletions, insertions, and bit-flips, and prove the following result, which extends Theorem 1.1 to that more general channel and includes Theorem 1.1 as a special case.

Theorem 1.4 (General channel positive result).

Let be the general channel described in Section A.1 with deletion probability , insertion probability , and bit-flip probability . Define

Then there is an algorithm for -channel trace reconstruction using samples and running time bounded by

Since some slight technical and notational unwieldiness is incurred by dealing with the more general channel, we defer the proof of Theorem 1.4 to Appendix A; however, we note here that the main core of the proof is unchanged from the deletion-only case. We additionally note that, as discussed in Appendix A, a curious aspect of the upper bound given by Theorem 1.4 is that having a constant insertion rate can make it possible to perform trace reconstruction in time even when the deletion rate is much higher than Theorem 1.3 could handle in the absence of insertions. A possible intuitive explanation for this is that having random insertions could serve to “smooth out” worst-case instances that are problematic for a deletion-only model.

1.3 Independent and concurrent work

At the time of writing, we have been informed [Per] that Fedor Nazarov and Yuval Peres have independently obtained results that are substantially similar to Theorems 1.1 and 1.2. Also, Elchanan Mossel has informed us [Mos] that around 2008, Mark Braverman, Avinatan Hassidim and Elchanan Mossel had independently proven (unpublished) superpolynomial lower bounds for mean-based algorithms.

1.4 Our techniques

For simplicity of discussion, we restrict our focus in this section to the question of upper bounding the sample complexity of trace reconstruction for the deletion channel, where every bit gets deleted independently with probability . (As discussed above, generalizing the results to channels which also allow for insertions and flips is essentially a technical exercise that does not require substantially new ideas.) As we discuss in Section 3.2, an efficient algorithm follows easily from a sample complexity upper bound via the observation that the minimization problem whose solution yields a sample complexity upper bound, extends to a slightly larger convex set, and thus one can use convex (in fact, linear) programming to get an algorithmic result. Hence the technical meat of the argument lies in upper bounding the sample complexity.

The key enabling idea for our work is to take an analytic view on the combinatorial process defined by the deletion channel. More precisely, consider two distinct strings . A necessary (and sufficient) condition to upper bound the sample complexity of trace reconstruction is to lower bound the statistical distance between the two distributions of traces of versus (let us write and to denote these two distributions). Since analyzing the statistical distance between the distributions and turns out to be a difficult task, we approach it by considering a limited class of statistical tests.

In [HMPW08] the authors consider “mean-based” algorithms; such algorithms correspond to statistical tests that only use -bit marginals of the distribution of the received string. More precisely, for any , consider the quantities and . The difference is a lower bound on .

Let us define the vector by

In this terminology, giving a sample complexity upper bound on mean-based algorithms correspond to showing a lower bound on A central idea in this paper is to analyze by studying the -transform of the vector . More precisely, for , we consider . Elementary complex analysis can be used to show that

Thus, for our purposes, it suffices to study . By analyzing the deletion channel and observing that is a polynomial in , we are able to characterize this supremum as the supremum of a certain polynomial (induced by and ) on a certain disk in the complex plane. Thus giving a sample complexity upper bound amounts to lower bounding across all polynomials induced by distinct (essentially, across a class of polynomials closely related to Littlewood polynomials: those polynomials with all coefficients in ). The technical heart of our sample complexity upper bound is in establishing such a lower bound. Finally, similar ideas and arguments are used to lower bound the sample complexity of mean-based algorithms, by upper bounding across all polynomials induced by distinct .

2 Preliminaries and terminology

Throughout this paper we will use two slightly nonstandard notational conventions. Bits will be written as rather than , and strings will be indexed starting from  rather than . Thus the source string will be denoted ; this is the unknown string that the reconstruction algorithm is trying to recover.

We will write for the channel through which  is transmitted. In the main body of the paper our main focus will be on the deletion channel , in which each bit of is independently eleted with probability . We will also often consider , the etention probability of each coordinate. In Appendix A we will see that a more general channel that also involves insertions and bit-flips can be handled in a similar way.

We will use boldface to denote random variables. We typically write to denote that is a random trace (or received string or sample), obtained by passing through the channel . Notice the slight inconvenience that the length of is a random variable (for the deletion channel this length is always between 0 and ); we denote this length by .

We define a trace reconstruction algorithm for channel to be an algorithm with the following property: for any unknown source string , when given access to independent strings each distributed according to , it outputs with probability at least (say) . The sample complexity of the trace reconstruction algorithm is the number of draws from that it uses (in the worst case across all and all draws from ). We are also interested in the algorithm’s (worst-case) running time.

As mentioned earlier we will use basic complex analysis. The following notation will be useful:

Notation 2.1.

We write for the closed complex disk of radius centered at ; i.e., . We write for the boundary of this disk; thus, e.g., is the complex unit circle.

3 Mean traces

We now come to a key definition, that of the mean trace. For now we restrict our focus to being the deletion channel (we consider a more general channel in Appendix A).

Although a random trace does not have a fixed length, we can simply define the mean trace of a source string  to be

(1)

where is padded with zeros so as to be of length exactly . Here “” has a natural interpretation as a “uniformly random bit” (indeed, a trace reconstruction algorithm could always pad deletion-channel traces with random bits by itself, and this would not change the definition of the mean trace ).

The following is immediate:

Proposition 3.1.

Viewing the domain of as the real vector space , is a (real-)linear function of ; that is, each can be written as for some constants .

3.1 The mean-based (deletion-channel) trace reconstruction model

One of the most basic things that a trace reconstruction algorithm can do is calculate an empirical estimate of the mean trace. A simple Chernoff/union bound shows that, with samples and time, an algorithm can compute an estimator satisfying with very high probability. The algorithm might then proceed to base its reconstruction solely on , without relying on further traces. We call such algorithms “mean-based trace reconstruction algorithms” (Holenstein et al. [HMPW08] called them algorithms based on “summary statistics”). We give a formal definition:

Definition 3.2.

An algorithm in the mean-based (deletion-channel) trace reconstruction model works as follows. Given an unknown source string , the algorithm first specifies a parameter . The algorithm is then given an estimate of the mean trace satisfying

(2)

We define the “cost” of this portion of the algorithm to be . Having been given , the algorithm has no further access to , but may do further “postprocessing” computation involving . The algorithm should end by outputting .

From the above discussion, we see that an algorithm in the mean-based trace reconstruction model with cost and postprocessing time may be converted into a normal trace reconstruction algorithm using samples and time.

3.2 The complexity of mean-based (deletion-channel) trace reconstruction

As discussed in [HMPW08], the sample complexity of mean-based trace reconstruction is essentially determined by the minimum distance between the mean traces and of two distinct source strings . Furthermore, one can get an upper bound on the time complexity of mean-based trace reconstruction if a certain “fractional relaxation” of this minimum mean trace distance is large. We state these observations from [HMPW08] here, using slightly different notation.

Definition 3.3.

Given and , we define:

In both cases, the equality on the right uses Proposition 3.1.

It’s easy to see that in the mean-based trace reconstruction model, it is information-theoretically possible for an algorithm to succeed if and only if its cost  exceeds . Thus characterizing the sample complexity of mean-based trace reconstruction essentially amounts to analyzing . For example, to establish our lower bound Theorem 1.2, it suffices to prove that the for constant .

Furthermore, as observed in [HMPW08], given an -accurate estimate of , as well as the ability to compute the linear function for any (or even estimate it to -accuracy), one can recover  exactly in time by solving a sequence of  linear programs.111If the algorithm “knows” it can efficiently compute exactly. But even if it doesn’t “know” , it can estimate to sufficient accuracy so that can be estimated to the necessary accuracy, with no significant algorithmic slowdown. Thus to establish our Theorem 1.1, it suffices to prove that for constant .

3.3 Reduction to complex analysis

Our next important definition is of a polynomial that encodes the components of in its coefficients — kind of a generating function for the channel. We think of its parameter  as a complex number.

Definition 3.4.

Given and , we define the deletion-channel polynomial

a polynomial of degree less than . We extend this definition to using the linearity of .

We now make the step to elementary complex analysis, by relating the size of a mean trace difference to the maximum modulus of on the unit complex circle (or equivalently, the unit complex disk, by the Maximum Modulus Principle):

Proposition 3.5.

For any , we have

Proof.

Recall that is the length- vector of coefficients for the polynomial . The lower bound above is immediate from the triangle inequality. For the upper bound, we use

Here the first inequality is Cauchy–Schwarz, the equality is an elementary fact about complex polynomials (or Fourier series), and the final inequality is obvious. ∎

Let us reconsider Definition 3.3. As a factor of is negligible compared to the bounds we will prove (which are of the shape , we may as well analyze rather than in the definition of and . We therefore take a closer look at the deletion-channel polynomial.

4 The deletion-channel polynomial

In this section we compute the deletion-channel polynomial. When the deletion channel is applied to some source string , each bit is either deleted with probability  or else is transmitted at some position  in the received string . Let us introduce (non-independent) random variables , where if is deleted and otherwise is the position in  at which  is transmitted. We thus have

Here we put the expectation in quotation marks because the expression should count  whenever . Observing that equals the retention probability , if we define the conditional random variable

(so is an -valued random variable), then we have

(3)

Observing that is distributed as , and letting denote independent Bernoulli random variables with “success” probability , we easily compute

Denoting

we conclude that

As ranges over the unit circle ranges over the radius- circle Recalling Definition 3.3 and Proposition 3.5, we are led to consider the following two quantities for (note that by the Maximum Modulus Principle, these quantities are unchanged whether the is over or ):

By the Maximum Modulus Principle, both and are nondecreasing functions of . It’s also easy to see that both are nonincreasing functions of their second argument for all (for , consider replacing by ) and observe that for all ). It thus follows that

Our main technical theorems are the following:

Theorem 4.1.

There is a universal constant such that:

for ,
for ,
Theorem 4.2.

There is a universal constant such that:

for ,
for ,

By Definition 3.3, Proposition 3.5, and the discussion at the end of Section 3.2, we have that Theorem 4.2 implies both Theorem 1.2 and the more general sample complexity lower bound in Theorem 1.3. Regarding the algorithmic upper bounds in Theorems 1.1 and 1.3, again from Definition 3.3 and Proposition 3.5 we get that

Thus the upper bounds Theorems 1.1 and 1.3 likewise follow from Theorem 4.1 and the discussion at the end of Section 3.2. (Note that if , we can always pay the bound for the larger value , which is .)

5 Proof of Theorem 4.1

We will need the following:

Theorem 5.1.

([BE97], Corollary 3.2, case.) Let be a polynomial with constant coefficient  and all other coefficients bounded by  in modulus. Fix any , and let be the arc . Then for some universal constant .

We remark that for any , Theorem 5.1 holds for the arc with no change in the constant . This is immediate by applying the theorem to .

Proof of Theorem 4.1..

Fix (else the hypotheses are vacuous) and . We call Case I when , and we call Case II when . Select

In Case I we have , and in Case II we have .

Let , where is a polynomial with constant coefficient  and all other coefficients bounded by  in modulus. We need to show

(4)

In Case I, the ray intersects at a unique point, call it . In Case II, the same ray intersects twice (this uses ); call the point of larger modulus . In either case, consider the triangle formed in the complex plane by the points , , and ; it has some acute angle  at  and an angle of  at . By the Law of Sines,

(The last inequality used in Case II.) Writing , Theorem 5.1 (and the subsequent remark) implies that

(5)

Thus

(the last inequality again using in Case II). Substituting in the value of  yields (4). ∎

5.1 An improved version

Although we don’t need it for our application, we can actually provide a stronger version of the results in the previous section that is also self-contained — i.e., it does not rely on Borwein and Erdélyi’s Theorem 5.1. We used that theorem to establish (5); but more strongly than (5), we can show there exists an arc such that

where the left-hand side here denotes the geometric mean of along . (Of course, this is at most the max of along .) To keep the parameters simpler, we will assume (this is the more interesting parameter regime anyway, and it is sufficient to yield our Theorem 1.1). Our alternate arc will be

where is the larger real radius such that . We remark that still , by virtue of , and it is not hard to show that the the endpoint of , call it , again satisfies . Thus instead of using Theorem 5.1 as a black box, we could have completed our proof of Theorem 4.1 using the following:

Theorem 5.2.

Let be a polynomial with constant coefficient  and all other coefficients in . Fix any , , and let be the arc . Then .

Our proof will require one standard fact from the theory of “Mahler measures”:

Fact 5.3.

Let be a complex polynomial and let be a circle in the complex plane with center . Then .

Proof.

By a linear transformation we may assume is the unit circle . Express , where the ’s are the roots of . Then — known as ’s Mahler measure, see e.g. [Smy08] — is exactly equal to , where . (Since is multiplicative, this statement follows immediately from the elementary fact that .) But clearly we have . ∎

We can now establish Theorem 5.2:

Proof of Theorem 5.2.

Using the bounds on ’s coefficients we have:

(6)
(7)

Let us apply Fact 5.3 with , writing for the complementary arc to  in . We get

(8)

And by (6) we have

(9)

where the second inequality is because the points only have larger than the points in , and the third inequality is because increasing the radius of from to only increases the value of for points on . But now for , the point has and hence

Thus

(10)

the last integral being known. (One can get a much easier integral, with a slightly worse constant, by lower-bounding .) Combining (8), (9), (10) yields the theorem. ∎

6 Proof of Theorem 4.2

The key ingredient is the following theorem from [BEK99]. (Recall that a Littlewood polynomial has all nonzero coefficients either or 1.)

Theorem 6.1 ([Bek99], Theorem 3.3).

For all there is a nonzero Littlewood polynomial of degree at most  satisfying for all real . Here is a universal constant.

By a simple use of the Hadamard Three-Circle Theorem and Maximum Modulus Principle, Borwein and Erdélyi proved in [BE97] that the polynomials in Theorem 6.1 establish tightness of their Theorem 5.1 (up to the constant ). We quote a result that appears within their proof:

Theorem 6.2 ([Be97], in the first proof of Theorem 3.3 in the “special case”, p. 11).

There are universal constants such that the following holds: For all there exists an integer such that , where is the nonzero Littlewood polynomial from Theorem 6.1.

Remark 6.3.

Actually, Borwein and Erdélyi proved this with an elliptical disk in place of , where has foci at and  and major axis . It is easy to see that , so we wrote in Theorem 6.2 for simplicity and because it loses almost nothing.

We can now prove Theorem 4.2. We state here a slightly more precise version:

Theorem 6.4.

Using the notation , and the notation for an unspecified universal constant , we have

provided . Here are universal constants.

Proof of Theorem 4.2..

With to be specified later, select

where is a universal constant to be specified later. Assuming is sufficiently large we get that , where is as in Theorem 6.2. Applying that theorem, we obtain

(11)

Here the inequality holds in Case I by assuming large enough, and in Case II by taking large enough. Now define