Synchronization Strings: Codes for Insertions and Deletions Approaching the Singleton Bound1footnote 1FootnoteFootnoteFootnotesFootnotes1footnote 1Supported in part by the National Science Foundation through grants CCF-1527110 and CCF-1618280..

Synchronization Strings: Codes for Insertions and Deletions Approaching the Singleton Bound111Supported in part by the National Science Foundation through grants CCF-1527110 and CCF-1618280..

Bernhard Haeupler
Carnegie Mellon University
haeupler@cs.cmu.edu
   Amirbehshad Shahrasbi
Carnegie Mellon University
shahrasbi@cs.cmu.edu
Abstract

We introduce synchronization strings, which provide a novel way of efficiently dealing with synchronization errors, i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to deal with than more commonly considered half-errors, i.e., symbol corruptions and erasures. For every , synchronization strings allow to index a sequence with an size alphabet such that one can efficiently transform synchronization errors into half-errors. This powerful new technique has many applications. In this paper, we focus on designing insdel codes, i.e., error correcting block codes (ECCs) for insertion-deletion channels.

While ECCs for both half-errors and synchronization errors have been intensely studied, the later has largely resisted progress. As Mitzenmacher puts it in his 2009 survey [22]: “Channels with synchronization errors …are simply not adequately understood by current theory. Given the near-complete knowledge we have for channels with erasures and errors … our lack of understanding about channels with synchronization errors is truly remarkable.” Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate insdel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regime these codes are polynomially far from the optimal rate-distance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago.

A straight forward application of our synchronization strings based indexing method gives a simple black-box construction which transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs into the realm of insdel codes. Most notably, for the complete noise spectrum we obtain efficient “near-MDS” insdel codes which get arbitrarily close to the optimal rate-distance tradeoff given by the Singleton bound. In particular, for any and we give insdel codes achieving a rate of over a constant size alphabet that efficiently correct a fraction of insertions or deletions.

\setitemize

noitemsep,topsep=3pt,parsep=3pt,partopsep=3pt

1 Introduction

Since the fundamental works of Shannon, Hamming, and others the field of coding theory has advanced our understanding of how to efficiently correct symbol corruptions and erasures. The practical and theoretical impact of error correcting codes on technology and engineering as well as mathematics, theoretical computer science, and other fields is hard to overestimate. The problem of coding for timing errors such as closely related insertion and deletion errors, however, while also studied intensely since the 60s, has largely resisted such progress and impact so far. An expert panel [8] in 1963 concluded: “There has been one glaring hole in [Shannon’s] theory; viz., uncertainties in timing, which I will propose to call time noise, have not been encompassed …. Our thesis here today is that the synchronization problem is not a mere engineering detail, but a fundamental communication problem as basic as detection itself!” however as noted in a comprehensive survey [21] in 2010: “Unfortunately, although it has early and often been conjectured that error-correcting codes capable of correcting timing errors could improve the overall performance of communication systems, they are quite challenging to design, which partly explains why a large collection of synchronization techniques not based on coding were developed and implemented over the years.” or as Mitzenmacher puts in his survey [22]: “Channels with synchronization errors, including both insertions and deletions as well as more general timing errors, are simply not adequately understood by current theory. Given the near-complete knowledge we have for channels with erasures and errors …our lack of understanding about channels with synchronization errors is truly remarkable.” We, too, believe that the current lack of good codes and general understanding of how to handle synchronization errors is the reason why systems today still spend significant resources and efforts on keeping very tight controls on synchronization while other noise is handled more efficiently using coding techniques. We are convinced that a better theoretical understanding together with practical code constructions will eventually lead to systems which naturally and more efficiently use coding techniques to address synchronization and noise issues jointly. In addition, we feel that better understanding the combinatorial structure underlying (codes for) insertions and deletions will have impact on other parts of mathematics and theoretical computer science.

In this paper, we introduce synchronization strings, a new combinatorial structure which allows efficient synchronization and indexing of streams under insertions and deletions. Synchronization strings and our indexing abstraction provide a powerful and novel way to deal with synchronization issues. They make progress on the issues raised above and have applications in a large variety of settings and problems. We already found applications to channel simulations, synchronization sequences [21], interactive coding schemes [4, 17, 15, 7, 6, 5], edit distance tree codes [2], and error correcting codes for insertion and deletions and suspect there will be many more. In this paper we focus on the last application, namely, designing efficient error correcting block codes over large alphabets for worst-case insertion-deletion channels.

The knowledge on efficient error correcting block codes for insertions and deletions, also called insdel codes, severely lacks behind what is known for codes for Hamming errors. While Levenshtein [18] introduced and pushed the study of such codes already in the 60s it took until 1999 for Schulman and Zuckerman [25] to construct the first insdel codes with constant rate, constant distance, and constant alphabet size. Very recent work of Guruswami et al. [13, 10] in 2015 and 2016 gave the first constant rate insdel codes for asymptotically large noise rates, via list decoding. These codes are however still polynomially far from optimal in their rate or decodable distance respectively. In particular, they achieve a rate of for a relative distance of or a relative distance of for a rate of , for asymptotically small (see Section 1.5 for a more detailed discussion of related work).

This paper essentially closes this line of work by designing efficient “near-MDS” insdel codes which approach the optimal rate-distance trade-off given by the Singleton bound. We prove that for any and any constant , there is an efficient insdel code over a constant size alphabet with block length and rate which can be uniquely and efficiently decoded from any insertions and deletions. The code construction takes polynomial time; and encoding and decoding can be done in linear and quadratic time, respectively. More formally, let us define the edit distance of two given strings as the minimum number of insertions and deletions required to convert one of them to the other one.

Theorem 1.1.

For any and there exists an encoding map and a decoding map such that if then . Further , , and and are explicit and can be computed in linear and quadratic time in .

We obtain this code via a black-box construction which transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This transformation, which is a straight forward application of our new synchronization strings based indexing method, is so simple that it can be summarized in one sentence:

For any efficient length ECC with alphabet bit size , attaching to every codeword, symbol by symbol, a random or suitable pseudorandom string over an alphabet of bit size results in an efficient insdel code with a rate and decodable distance that changed by at most .

Far beyond just implying Theorem 1.1, this allows to instantly transfer much of the highly developed understanding for regular ECCs into the realm of insdel codes.

Theorem 1.1 is obtained by using the “near-MDS” expander codes of Guruswami and Indyk [9] as a base ECC. These codes generalize the linear time codes of Spielman [27] and can be encoded and decoded in linear time. Our simple encoding strategy, as outlined above, introduces essentially no additional computational complexity during encoding. Our quadratic time decoding algorithm, however, is slower than the linear time decoding of the base codes from [9] but still pretty fast. In particular, a quadratic time decoding for an insdel code is generally very good given that, in contrast to Hamming codes, even computing the distance between the received and the sent/decoded string is an edit distance computation. Edit distance computations in general do usually not run in sub-quadratic time, which is not surprising given the recent SETH-conditional lower bounds [1]. For the settings of for insertion-only and deletion-only errors we furthermore achieve analogs of Theorem 1.1 with linear decoding complexities.

In terms of the dependence of the alphabet bit size on the parameter , which characterizes how close a code is to achieving an optimal rate/distance pair summing to one, our transformation seem to inherently produce an alphabet bit size that is near linear in . However, the same is true for the state of the art linear-time base ECCs [9] which have an alphabet bit size of . Existentially it is known that an alphabet bit size logarithmic in is necessary and sufficient and ECCs based on algebraic geometry [29] achieving such a bound up to constants are known, but their encoding and decoding complexities are higher.

1.1 High-level Overview, Intuition and Overall Organization

While extremely powerful, the concept and idea behind synchronization strings is easily demonstrated. In this section, we explain the high-level approach taken and provide intuition for the formal definitions and proofs to follow. This section also explains the overall organization of the rest of the paper.

1.1.1 Synchronization Errors and Half-Errors

Consider a stream of symbols over a large but constant size alphabet in which some constant fraction of symbols is corrupted.

There are two basic types of corruptions we will consider, half-errors and synchronization errors. Half-errors consist of erasures, that is, a symbol being replaced with a special “?” symbol indicating the erasure, and symbol corruptions in which a symbol is replaced with any other symbol in . The wording half-error comes from the realization that when it comes to code distances erasures are half as bad as symbol corruptions. An erasure is thus counted as one half-error while a symbol corruption counts as two half-errors (see Section 2 for more details). Synchronization errors consist of deletions, that is, a symbol being removed without replacement, and insertions, where a new symbol from is added anywhere.

It is clear that synchronization errors are strictly more general and harsher than half-errors. In particular, any symbol corruption, worth two half-errors, can also be achieved via a deletion followed by an insertion. Any erasure can furthermore be interpreted as a deletion together with the often very helpful extra information where this deletion took place. This makes synchronization errors at least as hard as half-errors. The real problem that synchronization errors bring with them however is that they cause sending and receiving parties to become “out of synch”. This easily changes how received symbols are interpreted and makes designing codes or other systems tolerant to synchronization errors an inherently difficult and significantly less well understood problem.

1.1.2 Indexing and Synchronization Strings: Reducing Synchronization Errors to Half-Errors

There is a simple folklore strategy, which we call indexing, that avoids these synchronization problems: Simply enhance any element with a time stamp or element count. More precisely, consecutively number the elements and attach this position count or index to each stream element. Now, if we deal with only deletions it is clear that the position of any deletion is easily identified via a missing index, thus transforming it into an erasure. Insertions can be handled similarly by treating any stream index which is received more than once as erased. If both insertions and deletions are allowed one might still have elements with a spoofed or incorrectly received index position caused by a deletion of an indexed symbol which is then replaced by a different symbol with the same index. This however requires two insdel errors. Generally this trivial indexing strategy can seen to successfully transform any synchronization errors into at most half-errors.

In many applications, however, this trivial indexing cannot be used, because having to attach a bit222Throughout this paper all logarithms are binary. long index description to each element of an long stream is prohibitively costly. Consider for example an error correcting code of constant rate over some potentially large but nonetheless constant size alphabet , which encodes bits into symbols from . Increasing by a factor of to allow each symbol to carry its bit index would destroy the desirable property of having an alphabet which is independent from the block length and would furthermore reduce the rate of the code from to , which approaches zero for large block lengths. For streams of unknown or infinite length such problems become even more pronounced.

This is where synchronization strings come to the rescue. Essentially, synchronization strings allow to index every element in an infinite stream using only a constant size alphabet while achieving an arbitrarily good approximate reduction from synchronization errors to half-errors. In particular, using synchronization strings synchronization errors can be transformed into at most half-errors using an alphabet of size independent of the stream length and in fact only polynomial in . Moreover, these synchronization strings have simple constructions and fast and easy decoding procedures.

Attaching our synchronization strings to the codewords of any efficient error correcting code, which efficiently tolerates the usual symbol corruptions and erasures, transforms any such code into an efficiently decodable insdel code while only requiring a negligible increasing in the alphabet size. This allows to use the decades of intense research in coding theory for Hamming-type errors to be transferred into the much harder and less well understood insertion-deletion setting.

1.2 Synchronization Strings: Definition, Construction, and Decoding

Next, we want to briefly motivate and explain how we arrive at a natural definition of these magical indexing sequences over a finite alphabet and what intuition lies behind their efficient constructions and decoding procedures.

Suppose a sender has attached some indexing sequence one-by-one to each element in a stream and consider a time at which a receiver has received a corrupted sequence of the first index descriptors, i.e., a corrupted version of the length prefix of . When the receiver tries to guess or decode the current index it should naturally consider all indexing symbols received so far and find the “best” prefix of . This suggests that the prefix of length of a synchronization string acts as a codeword for the index position and that one should think of the set of prefixes of as a code associated with the synchronization string . Naturally one would want such a code to have good distance properties between any two codewords under some distance measure. While edit distance, i.e., the number of insertions and deletions needed to transform one string into another seems like the right notion of distance for insdel errors in general, the prefix nature of the codes under consideration will guarantee that codewords for indices and will have edit distance exactly . This implies that even two very long codewords only have a tiny edit distance. On the one hand, this precludes synchronization codes with a large relative edit distance between its codewords. On the other hand, one should see this phenomenon as simply capturing the fact that at any time a simple insertion of an incorrect symbol carrying the correct next indexing symbol will lead to an unavoidable decoding error. Given this natural and unavoidable sensitivity of synchronization codes to recent corruptions, it makes sense to instead use a distance measure which captures the recent density of errors. In this spirit, we suggest the definition of a, to our knowledge, new string distance measure which we call relative suffix distance, which intuitively measures the worst fraction of insdel errors to transform suffixes, i.e., recently sent parts of two strings, into each other. This natural measure, in contrast to a similar measure defined in [2], turns out to induce a metric space on any set of strings.

With this natural definitions for an induced set of codewords and a natural distance metric associated with any such set the next task is to design a string for which the set of codewords has as large of a minimum pairwise distance as possible. When looking for (infinite) sequences that induce such a set of codewords and thus can be successfully used as synchronization strings it became apparent that one is looking for highly irregular and non-self-similar strings over a fixed alphabet . It turns out that the correct definition to capture these desired properties, which we call -synchronization property, states that any two neighboring intervals of with total length should require at least insertions and deletions to transform one into the other, where . A one line calculation also shows that this clean property also implies a large minimum relative suffix distance between any two codewords. Not surprisingly, random strings essentially satisfy this -synchronization property, except for local imperfections of self-similarity, such as, symbols repeated twice in a row, which would naturally occur in random sequences about every positions. This allows us to use the probabilistic method and the general Lovász Local Lemma to prove the existence -synchronization strings. This also leads to an efficient randomized construction.

Finally, decoding any string to the closest codeword, i.e., the prefix of the synchronization string with the smallest relative suffix distance, can be easily done in polynomial time because the set of synchronization codewords is linear and not exponential in and (edit) distance computations (to each codeword individually) can be done via the classical Wagner-Fischer dynamic programming approach.

1.3 More Sophisticated Decoding Procedures

All this provides an indexing solution which transforms any synchronization errors into at most half-errors. This already leads to insdel codes which achieve a rate approaching for any fraction of insdel errors with . While this is already a drastic improvement over the previously best rate codes from [10], which worked only for sufficiently small , it is a far less strong result than the near-MDS codes we promised in Theorem 1.1 for every .

We were able to improve upon the above strategy slightly by considering an alternative to the relative suffix distance measure, which we call relative suffix pseudo distance RSPD. RSPD was introduced in [2] and while neither being symmetric nor satisfying the triangle inequality, can act as a pseudo distance in the minimum-distance decoder. For any set of insdel errors consisting of insertions and deletions this improved indexing solution leads to at most half-errors which already implies near-MDS codes for deletion-only channels but still falls short for general insdel errors. We leave open the question whether an improved pseudo distance definition can achieve an indexing solution with negligible number of misdecodings for a minimum-distance decoder.

In order to achieve our main theorem we developed an different strategy. Fortunately, it turned out that achieving a better indexing solution and the desired insdel codes does not require any changes to the definition of synchronization codes, the indexing approach itself, or the encoding scheme but solely required a very different decoding strategy. In particular, instead of decoding indices in a streaming manner we consider more global decoding algorithms. We provide several such decoding algorithms in Section 6. In particular, we give a simple global decoding algorithm which for which the number of misdecodings goes to zero as the quality of the -synchronization string used goes to zero, irrespectively of how many insdel errors are applied.

Our global decoding algorithms crucially build on another key-property which we prove holds for any -synchronization string , namely that there is no monotone matching between and itself which mismatches more than a fraction of indices. Besides being used in our proofs, considering this -self-matching property has another advantage. We show that this property is achieved easier than the full -synchronization property and that indeed a random string satisfies it with good probability. This means that, in the context of error correcting codes, one can even use a simple uniformly random string as a “synchronization string”. Lastly, we show that even a -approximate -wise independent random strings satisfy the desired -self-matching property which, using the celebrated small sample space constructions from [24] also leads to a deterministic polynomial time construction.

Lastly, we provide simpler and faster global decoding algorithms for the setting of deletion-only and insertion-only corruptions. These algorithms are essentially greedy algorithms which run in linear time. They furthermore guarantee that their indexing decoding is error-free, i.e., they only output “I don’t know” for some indices but never produce an incorrectly decoded index. Such decoding schemes have the advantage that one can use them in conjunction with error correcting codes that efficiently recover from erasures (and not necessarily also symbol corruptions).

1.4 Organization of this Paper

The organization of this paper closely follows the flow of the high-level description above.

We start by giving more details on related work in Section 1.5 and introduce notation used in the paper in Section 2 together with a formal introduction of the two different error types as well as (efficient) error correcting codes and insdel codes. In Section 3, we formalize the indexing problem and (approximate) solutions to it. Section 4 shows how any solution to the indexing problem can be used to transform any regular error correcting codes into an insdel code. Section 5 introduces the relative suffix distance and -synchronization strings, proves the existence of -synchronization strings and provides an efficient construction. Section 5.2 shows that the minimum suffix distance decoder is efficient and leads to a good indexing solution. We elaborate on the connection between -synchronization strings and the -self-matching property in Section 6.1 and provide our improved decoding algorithms in the remainder of Section 6.

1.5 Related Work

Shannon was the first to systematically study reliable communication. He introduced random error channels, defined information quantities, and gave probabilistic existence proofs of good codes. Hamming was the first to look at worst-case errors and code distances as introduced above. Simple counting arguments on the volume of balls around codewords given in the 50’s by Hamming and Gilbert-Varshamov produce simple bounds on the rate of -ary codes with relative distance . In particular, they show the existence of codes with relative distance and rate at least where is the -ary entropy function. This means that for any and there exists codes with distance and rate approaching . Concatenated codes and the generalized minimum distance decoding procedure introduced by Forney in 1966 led to the first codes which could recover from constant error fractions while having polynomial time encoding and decoding procedures. The rate achieved by concatenated codes for large alphabets with sufficiently small distance comes out to be . On the other hand, for sufficiently close to one, one can achieve a constant rate of . Algebraic geometry codes suggested by Goppa in 1975 later lead to error correcting codes which for every achieve the optimal rate of with an alphabet size polynomial in while being able to efficiently correct for a fraction of half-errors [29].

While this answered the most basic questions, research since then has developed a tremendously powerful toolbox and selection of explicit codes. It attests to the importance of error correcting codes that over the last several decades this research direction has developed into the incredibly active field of coding theory with hundreds of researchers studying and developing better codes. A small and highly incomplete subset of important innovations include rateless codes, such as, LT codes [20], which do not require to fix a desired distance at the time of encoding, explicit expander codes [27, 9] which allow linear time encoding and decoding, polar codes [14, 12] which can approach Shannon’s capacity polynomially fast, network codes [19] which allow intermediate nodes in a network to recombine codewords, and efficiently list decodable codes [11] which allow to list-decode codes of relative distance up to a fraction of about symbol corruptions.

While error correcting codes for insertions and deletions have also been intensely studied, our understanding of them is much less well developed. We refer to the 2002 survey by Sloan [26] on single-deletion codes, the 2009 survey by Mitzenmacher [22] on codes for random deletions and the most general 2010 survey by Mercier et al. [21] for the extensive work done around codes for synchronization errors and only mention the results most closely related to Theorem 1.1 here: Insdel codes were first considered by Levenshtein [18] and since then many bounds and constructions for such codes have been given. However, while essentially the same volume and sphere packing arguments as for regular codes show that there exists insdel codes capable of correcting a fraction of insdel erros with rate , no efficient constructions anywhere close to this rate-distance tradeoff are known. Even the construction of efficient insdel codes over a constant alphabet with any (tiny) constant relative distance and any (tiny) constant rate had to wait until Schulman and Zuckerman gave the first such code in 1999 [25]. Over the last two years Guruswami et al. provided new codes improving over this state of the art the asymptotically small or large noise regime by giving the first codes which achieve a constant rate for noise rates going to one and codes which provide a rate going to one for an asymptotically small noise rate. In particular, [13] gave the first efficient codes codes over fixed alphabets to correct a deletion fraction approaching , as well as efficient binary codes to correct a small constant fraction of deletions with rate approaching . These codes could, however, only be efficiently decoded for deletions and not insertions. A follow-up work gave new and improved codes with similar rate-distance tradeoffs which can be efficiently decoded from insertions and deletions [10]. In particular, these codes achieve a rate of and while being able to efficiently recover from a fraction of insertions and deletions. These works put the current state of the art for error correcting codes for insertions and deletions pretty much equal to what was known for regular error correcting codes 50 years ago, after Forney’s 1965 doctoral thesis.

2 Definitions and Preliminaries

In this section, we provide the notation and definitions we will use throughout the rest of the paper.

2.1 String Notation and Edit Distance

String Notation. For two strings and be two strings over alphabet . We define to be their concatenation. For any positive integer we define to equal copies of concatenated together. For , we denote the substring of from the index through and including the index as . Such a consecutive substring is also called a factor of . For we define where is a special symbol not contained in . We refer to the substring from the index through, but not including, the index as . The substrings and are similarly defined. Finally, denotes the symbol of and is the length of . Occasionally, the alphabets we use are the cross-product of several alphabets, i.e. . If is a string over then we write , where .

Edit Distance. Throughout this work, we rely on the well-known edit distance metric defined as follows.

Definition 2.1 (Edit distance).

The edit distance between two strings is the minimum number of insertions and deletions required to transform into .

It is easy to see that edit distance is a metric on any set of strings and in particular is symmetric and satisfies the triangle inequality property. Furthermore, , where is the longest common substring of and .

We also use some string matching notation from  [2]:

Definition 2.2 (String matching).

Suppose that and are two strings in , and suppose that is a symbol not in . Next, suppose that there exist two strings and in such that , , , and for all . Here, is a function that deletes every in the input string and if or one of or is . Then we say that is a string matching between and (denoted ). We furthermore denote with the number of ’s in .

Note that the edit distance between strings is exactly equal to .

2.2 Error Correcting Codes

Next we give a quick summary of the standard definitions and formalism around error correcting codes. This is mainly for completeness and we remark that readers already familiar with basic notions of error correcting codes might want to skip this part.

Codes, Distance, Rate, and Half-Errors

An error correcting code is an injective function which takes an input string over alphabet of length and generates a codeword of length over alphabet . The length of a codeword is also called the block length. The two most important parameters of a code are its distance and its rate . The rate measures what fraction of bits in the codewords produced by carries non-redundant information about the input. The code distance is simply the minimum Hamming distance between any two codewords. The relative distance measures what fraction of output symbols need to be corrupted to transform one codeword into another.

It is easy to see that if a sender sends out a codeword of code with relative distance a receiver can uniquely recover if she receives a codeword in which less than a fraction of symbols are affected by an erasure, i.e., replaced by a special “” symbol. Similarly, a receiver can uniquely recover the input if less than symbol corruptions, in which a symbol is replaced by any other symbol from , occurred. More generally it is easy to see that a receiver can recover from any combination of erasures and corruptions as long as . This motivates defining half-errors to incorporate both erasures and symbol corruptions where an erasure is counted as a single half-error and a symbol corruption is counted as two half-errors. In summary, any code of distance can tolerate any error pattern of less than half-errors.

We remark that in addition to studying codes with decoding guarantees for worst-case error pattern as above one can also look at more benign error models which assume a distribution over error patterns, such as errors occurring independently at random. In such a setting one looks for codes which allow unique recovery for typical error patterns, i.e., one wants to recover the input with probability tending to rapidly as the block length grows. While synchronization strings might have applications for such codes as well, this paper focuses exclusively on codes with good distance guarantees which tolerate an arbitrary (worst-case) error pattern.

Synchronization Errors

In addition to half-errors, we study synchronization errors which consist of deletions, that is, a symbol being removed without replacement, and insertions, where a new symbol from is added anywhere. It is clear that synchronization errors are strictly more general and harsh than half-errors (see Section 1.1.1). The above formalism of codes, rate, and distance works equally well for synchronization errors if one replaces the Hamming distance with edit distance. Instead of measuring the number of symbol corruptions required to transform one string into another, edit distance measures the minimum number of insertions and deletions to do so. An insertion-deletion error correcting code, or insdel code for short, of relative distance is a set of codewords for which at least insertions and deletions are needed to transformed any codeword into another. Such a code can correct any combination of less than insertions and deletions. We remark that it is possible for two codewords of length to have edit distance up to putting the (minimum) relative edit distance between zero and two and allowing for constant rate codes which can tolerate insdel errors.

Efficient Codes

In addition to codes with a good minimum distance, one furthermore wants efficient algorithms for the encoding and error-correction tasks associated with the code. Throughout this paper we say a code is efficient if it has encoding and decoding algorithms running in time polynomial in the block length. While it is often not hard to show that random codes exhibit a good rate and distance, designing codes which can be decoded efficiently is much harder. We remark that most codes which can efficiently correct for symbol corruptions are also efficient for half-errors. For insdel codes the situation is slightly different. While it remains true that any code that can uniquely be decoded from any fraction of deletions can also be decoded from the same fraction of insertions and deletions [18] doing so efficiently is often much easier for the deletion-only setting than the fully general insdel setting. .

3 The Indexing Problem

In this section, we formally define the indexing problem. In a nutshell, this problem is that of sending a suitably chosen string of length over an insertion-deletion channel such that the receiver will be able to figure out the indices of most of the symbols he receives correctly. This problem can be trivially solved by sending the string over the alphabet of size . Interesting solution to the indexing problem, however, do almost as well while using a finite size alphabet. While very intuitive and simple, the formalization of this problem and its solutions enables an easy use in many applications.

To set up an -indexing problem, we fix , i.e., the number of symbols which are being sent, and the maximum fraction of symbols that can be inserted or deleted. We further call the string the synchronization string. Lastly, we describe the influences of the worst-case insertions and deletions which transform into the related string in terms of a string matching . In particular, is the string matching from to such that , , and for every

where and .

Definition 3.1 (-Indexing Algorithm).

The pair consisting of a synchronization string and an algorithm is called a -indexing algorithm over alphabet if for any set of insertions and deletions represented by which alter to a string , the algorithm outputs either or an index between and for every symbol in .

The symbol here represents an “I don’t know” response of the algorithm while an index output by for the symbol of should be interpreted as the -indexing algorithm guessing that this was the symbol of . One seeks algorithms that decode as many indices as possible correctly. Naturally, one can only correctly decode indices that were correctly transmitted. Next we give formal definitions of both notions:

Definition 3.2 (Correctly Decoded Index).

An indexing algorithm decodes index correctly under if outputs and there exists a such that

We remark that this definition counts any response as an incorrect decoding.

Definition 3.3 (Successfully Transmitted Symbol).

For string , which was derived from a synchronization string via , we call the symbol successfully transmitted if it stems from a symbol coming from , i.e., if there exists a such that and .

We now define the quality of an -indexing algorithm by counting the maximum number of misdecoded indices among those that were successfully transmitted. Note that the trivial indexing strategy with which outputs for each symbol the symbol itself has no misdecodings. One can therefore also interpret our quality definition as capturing how far from this ideal solution an algorithm is (stemming likely due to the smaller alphabet which is used for ).

Definition 3.4 (Misdecodings of an -Indexing Algorithm).

Let be an -indexing algorithm. We say this algorithm has at most misdecodings if for any corresponding to at most insertions and deletions the number of correctly transmitted indices that are incorrectly decoded is at most .

Now, we introduce two further useful properties that a -indexing algorithm might have.

Definition 3.5 (Error-free Solution).

We call an error-free -indexing algorithm with respect to a set of deletion or insertion patterns if every index output is either or correctly decoded. In particular, the algorithm never outputs an incorrect index, even for indices which are not correctly transmitted.

It is noteworthy that error-free solutions are essentially only obtainable when dealing with the insertion-only or deletion-only setting. In both cases, the trivial solution with which decodes any index that was received exactly once is error-free. We later give some algorithms which preserve this nice property, even over a smaller alphabet, and show how error-freeness can be useful in the context of error correcting codes.

Lastly, another very useful property of some -indexing algorithms is that their decoding process operates in a streaming manner, i.e, the decoding algorithm decides the index output for independently of where . While this property is not particularly useful for the error correcting block code application put forward in this paper, it is an extremely important and strong property which is crucial in several applications we know of, such as, rateless error correcting codes, channel simulations, interactive coding, edit distance tree codes, and other settings.

Definition 3.6 (Streaming Solutions).

We call a streaming solution if the decoded index for the th element of the received string only depends on .

Again, the trivial solution for -index decoding problem over an alphabet of size with zero misdecodings can be made streaming by outputting for every received symbols the received symbol itself as an index. This solution is also error-free for the deletion-only setting but not error-free for the insertion-only setting. In fact, it is easy to show that an algorithm cannot be both streaming and error-free in any setting which allows insertions.

Overall, the important characteristics of an -indexing algorithm are (a) its alphabet size , (b) the bound on the number of misdecodings, (c) the complexity of the decoding algorithm , (d) the preprocessing complexity of constructing the string , (e) whether the algorithm works for the insertion-only, the deletion-only or the full insdel setting, and (f) whether the algorithm satisfies the streaming or error-freeness property. Table 1 gives a summary over the different solutions for the -indexing problem we give in this paper.

Algorithm Type Misdecodings Error-free Streaming Complexity
Section 5.2 ins/del
Section 6.3 ins/del
Section 6.4 del
Section 6.5 ins
Section 6.5 del
Section 6.6 ins/del
Table 1: Properties and quality of -indexing algorithms with being a -synchronization string

4 Insdel Codes via Indexing Algorithms

Next, we show how a good -indexing algorithms over alphabet allows one to transform any regular ECC with block length over alphabet which can efficiently correct half-errors, i.e., symbol corruptions and erasures, into a good insdel code over alphabet .

To this end, we simply attach symbol-by-symbol to every codeword of . On the decoding end, we first decode the indices of the symbols arrived using the indexing part of each received symbol and then interpret the message parts as if they have arrived in the decoded order. Indices where zero or multiple symbols are received get considered as erased. We will refer to this procedure as the indexing procedure. Finally, the decoding algorithm for is used. These two straight forward algorithms are formally described as Algorithm 1 and Algorithm 2.

Theorem 4.1.

If guarantees misdecodings for the -index problem, then the indexing procedure recovers the codeword sent up to half-errors, i.e., half-error distance of the sent codeword and the one recovered by the indexing procedure is at most . If is error-free, the indexing procedure recovers the codeword sent up to half-errors.

Proof.

Consider a set insertions and deletions described by consisting of deletions and insertions. Note that among encoded symbols, at most were deleted and less than of are decoded incorrectly. Therefore, at least indices are decoded correctly. On the other hand at most of the symbols sent are not decoded correctly. Therefore, if the output only consisted of correctly decoded indices for successfully transmitted symbols, the output would have contained up to erasures and no symbol corruption, resulting into a total of half-errors. However, any symbol which is being incorrectly decoded or inserted may cause a correctly decoded index to become an erasure by making it appear multiple times or change one of original erasures into a corruption error by making the indexing procedure mistakenly decode an index. Overall, this can increase the number of half-errors by at most for a total of at most half-errors. For error-free indexing algorithms, any misdecoding does not result in an incorrect index and the number of incorrect indices is instead of leading to the reduced number of half-errors in this case. ∎

This makes it clear that applying an ECC which is resilient to half-errors enables the receiver side to fully recover .

0:  ,
1:  
2:  for  to  do
3:     
3:  
Algorithm 1 Insertion Deletion Encoder
0:  ,
1:  
2:  for  to  do
3:     if there is a unique for which  then
4:        
5:     else
6:         ?
7:  
7:  
Algorithm 2 Insertion Deletion Decoder

Next, we formally state how a good -indexing algorithm over alphabet allows one to transform any regular ECC with block length over alphabet which can efficiently correct half-errors, i.e., symbol corruptions and erasures, into a good insdel code over alphabet . The following Theorem is a corollary of Theorem 4.1 and the definition of the indexing procedure:

Theorem 4.2.

Given an (efficient) -indexing algorithm over alphabet with at most misdecodings, and decoding complexity and an (efficient) ECC over alphabet with rate , encoding complexity , and decoding complexity that corrects up to half-errors, one obtains an insdel code that can be (efficiently) decoded from up to insertions and deletions. The rate of this code is

The encoding complexity remains , the decoding complexity is and the preprocessing complexity of constructing the code is the complexity of constructing and .
Furthermore, if is error-free, then choosing a which can recover only from erasures is sufficient to produce the same quality code.

Note that if one chooses such that , the rate loss due to the attached symbols will be negligible. With all this in place one can obtain Theorem 1.1 as a consequence of Theorem 4.2.

Proof of Theorem 1.1.

Given the and from the statement of Theorem 1.1 we choose and use Theorem 6.13 to construct a string of length over alphabet of size with the -self-matching property. We then use the -indexing algorithm where given in Section 6.3 and line 2 of Table 1 which guarantees that it has at most misdecodings. Finally, we choose a near-MDS expander code [9] which can efficiently correct up to half-errors and has a rate of over an alphabet such that . This ensures that the final rate is indeed at least and the number of insdel errors that can be efficiently corrected is . The encoding and decoding complexities are furthermore straight forward and as is the polynomial time preprocessing time given Theorem 6.13 and [9]. ∎

5 Synchronization Strings

In this section, we formally define and develop -synchronization strings, which can be used as our base synchronization string in our -indexing algorithms.

As explained in Section 1.2 it makes sense to think of the prefixes of a synchronization string as codewords encoding their length , as the prefix , or a corrupted version of it, will be exactly all the indexing information that has been received by the time the symbol is communicated:

Definition 5.1 (Codewords Associated with a Synchronization String).

Given any synchronization string we define the set of codewords associated with to be the set of prefixes of , i.e., .

Next, we define a distance metric on any set of strings, which will be useful in quantifying how good a synchronization string and its associated set of codewords is:

Definition 5.2 (Relative Suffix Distance).

For any two strings we define their relative suffix distance as follows:

Next we show that RSD is indeed a distance which satisfies all properties of a metric for any set of strings. To our knowledge, this metric is new. It is, however, similar in spirit to the suffix “distance” defined in [2], which unfortunately is non-symmetric and does not satisfy the triangle inequality but can otherwise be used in a similar manner as RSD in the specific context here (see also Section 6.6).

Lemma 5.3.

For any strings we have

  • Symmetry: ,

  • Non-Negativity and Normalization: ,

  • Identity of Indiscernibles: , and

  • Triangle Inequality: .

In particular, RSD defines a metric on any set of strings.

Proof.

Symmetry and non-negativity follow directly from the symmetry and non-negativity of edit distance. Normalization follows from the fact that the edit distance between two length strings can be at most . To see the identity of indiscernibles note that if and only if for all the edit distance of the prefix of and is zero, i.e., if for every the -prefix of and are identical. This is equivalent to and being equal. Lastly, the triangle inequality also essentially follows from the triangle inequality for edit distance. To see this let and . By the definition of RSD this implies that for all the -prefixes of and have edit distance at most and the -prefixes of and have edit distance at most . By the triangle inequality for edit distance, this implies that for every the -prefix of and have edit distance at most which implies that . ∎

With these definitions in place, it remains to find synchronization strings whose prefixes induce a set of codewords, i.e., prefixes, with large RSD distance. It is easy to see that the RSD distance for any two strings ending on a different symbol is one. This makes the trivial synchronization string, which uses each symbol in only once, induce an associated set of codewords of optimal minimum-RSD-distance one. Such trivial synchronization strings, however, are not interesting as they require an alphabet size linear in the length . To find good synchronization strings over constant size alphabets, we give the following important definition of an -synchronization string. The parameter should be thought of measuring how far a string is from the perfect synchronization string, i.e., a string of distinct symbols.

Definition 5.4 (-Synchronization String).

String is an -synchronization string if for every we have that . We call the set of prefixes of such a string an -synchronization string.

The next lemma shows that the -synchronization string property is strong enough to imply a good minimum RSD distance between any two codewords associated with it.

Lemma 5.5.

If is an -synchronization string, then for any , i.e., any two codewords associated with have RSD distance of at least .

Proof.

Let . The -synchronization string property of guarantees that

Note that this holds even if . To finish the proof we note that the maximum in the definition of RSD includes the term , which implies that . ∎

5.1 Existence and Construction

The next important step is to show that the -synchronization strings we just defined exist, particularly, over alphabets whose size is independent of the length . We show the existence of -synchronization strings of arbitrary length for any using an alphabet size which is only polynomially large in . We remark that -synchronization strings can be seen as a strong generalization of square-free sequences in which any two neighboring substrings and only have to be different and not also far from each other in edit distance. Thue [28] famously showed the existence of arbitrarily large square-free strings over a trinary alphabet. Thue’s methods for constructing such strings however turns out to be fundamentally too weak to prove the existence of -synchronization strings, for any constant .

Our existence proof requires the general Lovász local lemma which we recall here first:

Lemma 5.6 (General Lovász local lemma).

Let be a set of “bad” events. The directed graph is called a dependency graph for this set of events if and each event is mutually independent of all the events .

Now, if there exists such that for all we have

then there exists a way to avoid all events simultaneously and the probability for this to happen is bounded by

Theorem 5.7.

For any , , there exists an -synchronization string of length over an alphabet of size .

Proof.

Let be a string of length obtained by concatenating two strings and , where is simply the repetition of for , and is a uniformly random string of length over alphabet . In particular, .

We prove that is an -synchronization string by showing that there is a positive probability that contains no bad triple, where is a bad triple if .

First, note that a triple for which cannot be a bad triple as it consists of completely distinct symbols by courtesy of . Therefore, it suffices to show that there is no bad triple in for such that .

Let be a bad triple and let be the longest common subsequence of and . It is straightforward to see that Since is a bad triple, we have that , which means that . With this observation in mind, we say that is a bad interval if it contains a subsequence such that .

To prove the theorem, it suffices to show that a randomly generated string does not contain any bad intervals with a non-zero probability. We first upper bound the probability that an interval of length is bad:

where the first inequality holds because if an interval of length is bad, then it must contain a repeating subsequence of length . Any such sequence can be specified via positions in the long interval and the probability that a given fixed sequence is valid for a random string is . The second inequality comes from the fact that .

The resulting inequality shows that the probability of an interval of length being bad is bounded above by , where can be made arbitrarily large by taking a sufficiently large alphabet size .

To show that there is a non-zero probability that the uniformly random string contains no bad interval of size or larger, we use the general Lovász local lemma stated in Lemma 5.6. Note that the badness of interval is mutually independent of the badness of all intervals that do not intersect . We need to find real numbers corresponding to intervals for which

We have seen that the left-hand side can be upper bounded by . Furthermore, any interval of length intersects at most intervals of length . We propose for some constant . This means that it suffices to find a constant that for all substrings satisfies

or more clearly, for all ,

which means that

(1)

For , the right-hand side of Equation (1) is maximized when and , and since we want Equation (1) to hold for all and all , it suffices to find a such that

To this end, let

Then, it suffices to have large enough so that

which means that suffices to allow us to use the Lovász local lemma. We claim that , which will complete the proof. Since ,

Therefore, we can use the fact that to show that:

(2)
(3)
(4)
(5)
(6)

Equation (3) is derived using the fact that and Equation (5) is a result of the following equality for :

One can see that for , , and therefore step (3) is legal and (6) can be upper-bounded by a constant. Hence, and the proof is complete. ∎

Remarks on the alphabet size: Theorem 5.7 shows that for any there exists an -synchronization string over alphabets of size . A polynomial dependence on is also necessary. In particular, there do not exist any -synchronization string over alphabets of size smaller than . In fact, any consecutive substring of size of an -synchronization string has to contain completely distinct elements. This can be easily proven as follows: For sake of contradiction let be a substring of an -synchronization string where for . Then, . We believe that using the Lovász Local Lemma together with a more sophisticated non-uniform probability space, which avoids any repeated symbols within a small distance, allows avoiding the use of the string in our proof and improving the alphabet size to . It seems much harder to improved the alphabet size to and we are not convinced that it is possible. This work thus leaves open the interesting question of closing the quadratic gap between and from either side.

Theorem 5.7 also implies an efficient randomized construction.

Lemma 5.8.

There exists a randomized algorithm which for any constructs a -synchronization string of length over an alphabet of size in expected time .

Proof.

Using the algorithmic framework for the Lovász local lemma given by Moser and Tardos [23] and the extensions by Haeupler et al.[16] one can get such a randomized algorithm from the proof in Theorem 5.7. The algorithm starts with a random string over any alphabet of size for some sufficiently large . It then checks all intervals for a violation of the -synchronization string property. For every interval this is an edit distance computation which can be done in time using the classical Wagner-Fischer dynamic programming algorithm. If a violating interval is found the symbols in this interval are assigned fresh random values. This is repeated until no more violations are found. [16] shows that this algorithm performs only expected number of re-samplings. This gives an expected running time of overall, as claimed. ∎

Lastly, since synchronization strings can be encoded and decoded in a streaming fashion they have many important applications in which the length of the required synchronization string is not known in advance. In such a setting it is advantageous to have an infinite synchronization string over a fixed alphabet. In particular, since every consecutive substring of an -synchronization string is also an -synchronization string by definition, having an infinite -synchronization string also implies the existence for every length , i.e., Theorem 5.7. Interestingly, a simple argument shows that the converse is true as well, i.e., the existence of an -synchronization string for every length implies the existence of an infinite -synchronization string over the same alphabet:

Lemma 5.9.

For any there exists an infinite -synchronization string over an alphabet of size .

Proof of Lemma 5.9.

Fix any . According to Theorem 5.7 there exist an alphabet of size such that there exists an at least one -synchronization strings over for every length . We will define a synchronization string with for any for which the -synchronization property holds for any . We define this string inductively. In particular, we fix an ordering on and define to be the first symbol in this ordering such that an infinite number of -synchronization strings over starts with . Given that there is an infinite number of -synchronization over such an exists. Furthermore, the set of -synchronization strings over which start with remains infinite by definition, allowing us to define to be the lexicographically first symbol in such there exists an infinite number of -synchronization strings over starting with . In the same manner, we inductively define to be the lexicographically first symbol in for which there exists and infinite number of -synchronization strings over starting with . To see that the infinite string defined in this manner does indeed satisfy the edit distance requirement of the -synchronization property defined in Definition 5.4, we note that for every with there exists, by definition, an -synchronization string, and in fact an infinite number of them, which contains and thus also as a consecutive substring implying that indeed as required. Our definition thus produces the unique lexicographically first infinite -synchronization string over . ∎

We remark that any string produced by the randomized construction of Lemma 5.8 is guaranteed to be a correct -synchronization string (not just with probability one). This randomized synchronization string construction is furthermore only needed once as a pre-processing step. The encoder or decoder of any resulting error correcting codes do not require any randomization. Furthermore, in Section 6 we will provide a deterministic polynomial time construction of a relaxed version of -synchronization strings that can still be used as a basis for good -indexing algorithms thus leading to insdel codes with a deterministic polynomial time code construction as well.

It nonetheless remains interesting to obtain fast deterministic constructions of finite and infinite -synchronization strings. In a subsequent work we achieve such efficient deterministic constructions for -synchronization strings. Our constructions even produce the infinite -synchronization string proven to exist by Lemma 5.9, which is much less explicit: While for any and any an -synchronization string of length can in principle be found using an exponential time enumeration there is no straight forward algorithm which follows the proof of Lemma 5.9 and given an produces the symbol of such an in a finite amount of time (bounded by some function in ). Our constructions require significantly more work but in the end lead to an explicit deterministic construction of an infinite -synchronization string for any for which the symbol can be computed in only time – thus satisfying one of the strongest notions of constructiveness that can be achieved.

5.2 Decoding

We now provide an algorithm for decoding synchronization strings, i.e., an algorithm that can form a solution to the indexing problem along with -synchronization strings. In the beginning of Section 5, we introduced the notion of relative suffix distance between two strings. Theorem 5.5 stated a lower bound of for relative suffix distance between any two distinct codewords associated with an -synchronization string, i.e., its prefixes. Hence, a natural decoding scheme for detecting the index of a received symbol would be finding the prefix with the closest relative suffix distance to the string received thus far. We call this algorithm the minimum relative suffix distance decoding algorithm.

We define the notion of relative suffix error density at index which presents the maximized density of errors taken place over suffixes of . We will introduce a very natural decoding approach for synchronization strings that simply works by decoding a received string by finding the codeword of a synchronization string (prefix of synchronization string) with minimum distance to the received string. We will show that this decoding procedure works correctly as long as the relative suffix error density is not larger than . Then, we will show that if adversary is allowed to perform many insertions or deletions, the relative suffix distance may exceed upon arrival of at most many successfully transmitted symbols. Finally, we will deduce that this decoding scheme decodes indices of received symbols correctly for all but many of successfully transmitted symbols. Formally, we claim that:

Theorem 5.10.

Any -synchronization string of length along with the minimum relative suffix distance decoding algorithm form a solution to -indexing problem that guarantees or less misdecodings. This decoding algorithm is streaming and can be implemented so that it works in time.

Before proceeding to the formal statement and the proofs of the claims above, we first provide the following useful definitions.

Definition 5.11 (Error Count Function).

Let be a string sent over an insertion-deletion channel. We denote the error count from index to index with and define it to be the number of insdels applied to from the moment is sent until the moment is sent. counts the potential deletion of . However, it does not count the potential deletion of .

Definition 5.12 (Relative Suffix Error Density).

Let string be sent over an insertion-deletion channel and let denote the corresponding error count function. We define the relative suffix error density of the communication as:

The following lemma relates the suffix distance of the message being sent by sender and the message being received by the receiver at any point of a communication over an insertion-deletion channel to the relative suffix error density of the communication at that point.

Lemma 5.13.

Let string be sent over an insertion-deletion channel and the corrupted message be received on the other end. The relative suffix distance between the string that was sent and the string which was received is at most the relative suffix error density of the communication.

Proof.

Let