An improved bound on the fraction of correctable deletionsA preliminary conference version of this paper [2], with a weaker bound of 1-2/(k+1) on fraction of correctable deletions, was presented at the 2016 ACM-SIAM Symposium on Discrete Algorithms (SODA) in January 2016.

An improved bound on the fraction of correctable deletions1

Abstract

We consider codes over fixed alphabets against worst-case symbol deletions. For any fixed , we construct a family of codes over alphabet of size with positive rate, which allow efficient recovery from a worst-case deletion fraction approaching . In particular, for binary codes, we are able to recover a fraction of deletions approaching . Previously, even non-constructively the largest deletion fraction known to be correctable with positive rate was , and around for the binary case.

Our result pins down the largest fraction of correctable deletions for -ary codes as , since is an upper bound even for the simpler model of erasures where the locations of the missing symbols are known.

Closing the gap between and for the limit of worst-case deletions correctable by binary codes remains a tantalizing open question.

1 Introduction

This work concerns error-correcting codes capable of correcting worst-case deletions. Specifically, consider a fixed alphabet , and suppose we transmit a sequence of symbols from over a channel that can adversarially delete an arbitrary fraction of symbols, resulting in a subsequence of length being received at the other end. The locations of the deleted symbols are unknown to the receiver. The goal is to design a code such that every can be uniquely recovered from any of its subsequences caused by up to deletions. Equivalently, for , the length of the longest common subsequence of , which we denote by , must be less than .

In this work, we are interested in the question of correcting as large a fraction of deletions as possible with codes of positive rate (bounded away from for ). That is, we would like so that the code incurs only a constant factor redundancy (this factor could depend on , which we think of as fixed).

Denote by the limit superior of all such that there is a positive rate code family over alphabet that can correct a fraction of deletions. The value of is not known for any value of . Clearly,  — indeed, one can delete all but occurrences of the most frequent symbol in a word to leave one of possible subsequences, and therefore only trivial codes with codewords can correct a fraction of deletions. This trivial limit remains the best known upper bound on . We note that this upper bound holds even for the simpler model of erasures where the locations of the missing symbols are known at the receiver (this follows from the so-called Plotkin bound in coding theory).

Whether the trivial upper bound can be improved, or whether there are in fact codes capable of correcting deletion fractions approaching is an outstanding open question concerning deletion codes and the combinatorics of longest common subsequences. Perhaps the most notable of these is the (binary) case. The current best lower bound on is around . This bound comes from the random code, in view of the fact that the expected of two random words in is at most  [8]. As the of two random words in is at least , one cannot prove any lower bound on better than using the random code. Kiwi, Loebl, and Matoušek [7] showed that, as , we have for two random words . This was used in [6] to deduce .

The above discussion only dealt with the existence of deletion codes. Turning to explicit and efficiently decodable constructions, Schulman and Zuckerman [11] constructed constant-rate binary codes which are efficiently decodable from a small constant fraction of worst-case deletions. This was improved in [6]; in the new codes, the rate approaches . Specifically, it was shown that one can correct a fraction of deletions with rate about . In terms of correcting a larger fraction of deletions, codes that are efficiently decodable from a fraction of errors over a sized alphabet were also given in [6].

Our focus in this work is exclusively on the worst-case model of deletions. For random deletions, it is known that reliable communication at positive rate is possible for deletion fractions approaching even in the binary case. We refer the reader interested in coding against random deletions to the survey by Mitzenmacher [9].

1.1 Our results

Here we state our results informally, omitting the precise computational efficiency guarantees, and omitting the important technical properties of constructed codes related to the “span” of common subsequences (see Section 2 for the definition). The precise statements are in Subsection 4.2 and in Section 5.

Our first result is a construction of codes which are combinatorially capable of correcting a larger fraction of deletions than was previously known to be possible.

Theorem 1 (Informal).

For all integers , . Furthermore, for any desired , there is an efficiently constructible family of -ary codes of rate such that the of any two distinct codewords is less than fraction of the code length. In particular, there are explicit binary codes that can correct a fraction of deletions, for any fixed .

Note that, together with the trivial upper bound , the result pins down the asymptotics of to as . Interestingly, our result shows that deletions are easier to correct than errors (for worst-case models), as one cannot correct a fraction of worst-case errors with positive rate.

In our second result we construct codes with the above guarantee together with an efficient algorithm to recover from deletions:

Theorem 2 (Informal).

For any integer and any , there is an efficiently constructible family of -ary codes of rate that can be decoded in polynomial (in fact near-linear) time from a fraction of deletions.

1.2 Our techniques

All our results are based on code concatenations, which use an outer code over a large alphabet with desirable properties, and then further encode the codeword symbols by a judicious inner code. The inner code comes in two variants, one clean and simpler form and then a dirty more complicated form giving a slightly more involved and better bounds. For simplicity let us here describe the clean construction which when analyzed gives the slightly worse bound as compared to . This weaker bound appears in the preliminary conference version [2] of this paper.

The innermost code consists of words of the form for integers with dividing , where stands for the letter repeated times. Informally, we think of these words as oscillating with amplitude (this can be made precise via Fourier transform for example, but we won’t need it in our analysis). The crucial property, that was observed in [4], is that two such words have a long common subsequence only if their amplitudes are close. This property was also exploited in [3] to show a certain weak limitation of deletion codes, namely that in any set of words in , some two of them have an LCS at least .

The effective use of these codes as inner codes in a concatenation scheme relies on a property stronger than absence of long common subsequences between codewords. Informally, the property amounts to absence of long common subsequences between subwords of codewords. For the precise notion, consult the definition of a span in the next section and the statement of Theorem 4 in the following section. Using this, we are able to show that if the outer code has a small value, then the LCS of the concatenated code approaches a fraction of the block length.

For the outer code, the simplest choice is the random code. This gives the existential result (Theorem 14). Using the explicit construction of codes to correct a large fraction of deletions over fixed alphabets from [6] gives us a polynomial (in fact near-linear) time deterministic construction (Theorem 16). While the outer code from [6] is also efficiently decodable from deletions, it is not clear how to exploit this to decode the concatenated code efficiently.

To obtain codes that are also efficiently decodable, we employ another level of concatenation, using Reed–Solomon codes at the outermost level, and the above explicit concatenated code itself as the inner code. The combinatorial LCS property of these codes is established similarly, and is in fact easier, as we may assume (by indexing each position) that all symbols in an outer codeword are distinct, and therefore the corresponding inner codewords are distinct. To decode the resulting concatenated code, we try to decode the inner code (by brute-force) for many different contiguous subwords of the received subsequence. A small fraction of these are guaranteed to succeed in producing the correct Reed–Solomon symbol. The decoding is then completed via list decoding of Reed–Solomon codes. The approach here is inspired by the algorithm for list decoding binary codes from a deletion fraction approaching in [6]. Our goal here is to recover the correct message uniquely, but by virtue of the combinatorial guarantee, there can be at most one codeword with the received word as a subsequence, so we can go over the (short) list and identify the correct codeword. Note that list decoding is used as an intermediate algorithmic primitive even though our goal is unique decoding; this is similar to [5] that gave an algorithm to decode certain low-rate concatenated codes up to half the Gilbert–Varshamov bound via a list decoding approach.

2 Preliminaries

A word is a sequence of symbols from a finite alphabet. For the problems of this paper, only the size of the alphabet and the length of the word are important. So, we will often use for a canonical -letter alphabet, and consider the words indexed by . In this case, the set of words of length over alphabet will be denoted . We treat symbols in a word as distinguishable. So, if denotes the second 1 in the word 21011 and we delete the subword 10, the variable now refers to the first 1 in the word 211.

Below we define some terminology about subsequences that we will use throughout the paper:

  • A subsequence in a word is any word obtained from by deleting one or more symbols. In contrast, a subword is a subsequence made of several consecutive symbols of .

  • The span of a subsequence in a word is the length of the smallest subword containing the subsequence. We denote it by , or simply by when no ambiguity can arise.

  • A common subsequence between words and is a pair of subsequences in and in  that are equal as words, i.e., and the ’th symbols of and are equal for each , .

  • For words , we denote by the length of the longest common subsequence of and , i.e., the largest for which there is a common subsequence between and of length .

A code of block length over the alphabet is simply a subset of . We will also call such codes as -ary codes, with binary codes referring to the case. The rate of equals .

For a code , its LCS value is defined as

Note that a code is capable of recovering from worst-case deletions if and only if .

We define the span of a common subsequence of words and as

The span will play an important role in our analysis of of the codes we construct, by virtue of the fact that if for every common subsequence of , then . Our result will be based on a construction for which we can take for long enough common subsequences of any distinct pair of codewords.

Concatenated codes. Our results heavily use the simple but useful idea of code concatenation. Given an outer code , and an injective map defining the encoding function of an inner code , the concatenated code is obtained by composing these codes as follows. If is a codeword of , the corresponding codeword in is . The words will be referred to as the inner blocks of the concatenated codeword, with the ’th block corresponding to the ’th outer codeword symbol.

3 Alphabet reduction for deletion codes

Fix to be the alphabet size of the desired deletion code. We shall show how to turn words over -letter alphabet, for , without large common subsequence into words over -letter alphabet without large common subsequence. More specifically, for any and large enough integer , we give a method to transform a deletion code with into a deletion code with . The transformation lets us transform a crude dependence between the alphabet size of the code and its LCS value (i.e., between and ), into a quantitatively strong one, namely . The code will in fact be obtained by concatenating with an inner -ary code with codewords, and therefore has the same cardinality as . The block length of will be much larger than , but the ratio will be bounded as a function of , and . The rate of will thus only be a constant factor smaller than that of .

Specifically, we prove the following.

Theorem 3.

Let be a code with , and let be an integer. Then there exists an integer satisfying , and an injective map such that the code for obtained by replacing each symbol in codewords of by its image under has the following property: if is a common subsequence between two distinct codewords , then

(1)

In particular, since , we have .

Thus, one can construct codes over a size alphabet with LCS value approaching by starting with an outer code with LCS value over any fixed size alphabet, and concatenating it with a constant-sized map. The span property will be useful in concatenated schemes to get longer, efficiently decodable codes.

The key to the above construction is the inner map, which come in two variants, one “clean” and one “dirty” form. The former is simpler to describe and we choose to do this first.

3.1 The clean construction

The aim of the clean construction is to prove the following:

Theorem 4.

Let be a code with , and let be an integer. Then there exists an integer satisfying , and an injective map such that the code for obtained by replacing each symbol in codewords of by its image under has the following property: if is a common subsequence between two distinct codewords , then

In particular, since , we have .

We start by describing the way to encode symbols from the alphabet as words over that underlies Theorem 4. Let be constant to be chosen later. For an integer dividing , define the word of “amplitude ” to be

(2)

where stands for the letter repeated times. The crucial property of these words is that and have no long common subsequence if is large (or small); for the proof see one of [4, 3]. In the present work, we will need a more general “asymmetric” version of this observation — we will need to analyze common subsequences in subwords of and (which may be of different lengths)

Let be an integer to be chosen later. For a word over alphabet denote by the word obtained from via the substitution

(3)

to each symbol of . Note that . If a symbol is obtained by expanding symbol , then we say that is a parent of .

Analysis of clean construction

Lemma 5.

For a natural number , let be the (infinite) word

Let , where be natural numbers, and suppose is a common subsequence between and . Then

Proof.

The words and are concatenations of chunks, which are subwords of the form and respectively. A chunk in is spanned by subsequence if the span of contains at least one symbol of the chunk. Similarly, we define chunks spanned by in . We will estimate how many chunks are spanned by and by .

As a word, a common subsequence is of the form where and the exponents are positive. The subsequence spans at least chunks in . Similarly, spans at least chunks in . Therefore the total number of symbols in chunks spanned by in both and in is at least

We then estimate according to whether :

In both cases we have

Note that the chunks spanned by are distinct from chunks spanned by for . So, the total number of symbols in all chunks spanned by subsequence in both and is least

The total span of might be smaller since the first and the last chunks in each of and might not be fully spanned. Subtracting to account for that gives the stated result. ∎

Let be a common subsequence between and . We say that the ’th symbol in is well-matched if the parents of and of are the same letter of . A common subsequence is badly-matched if none of its symbols are well-matched; see Figure 1 below for an example.

{tikzpicture}\node

at (0*3+0.3*0,1) 0; \nodeat (0*3+0.3*1,1) 0; \nodeat (0*3+0.3*2,1) 0; \nodeat (0*3+0.3*3,1) 0; \nodeat (0*3+0.3*4,1) 1; \nodeat (0*3+0.3*5,1) 1; \nodeat (0*3+0.3*6,1) 1; \nodeat (0*3+0.3*7,1) 1; \nodeat (0*3+3.5*0.3,1.4) 1;

\node

at (1*3+0.3*0,1) 0; \nodeat (1*3+0.3*1,1) 1; \nodeat (1*3+0.3*2,1) 0; \nodeat (1*3+0.3*3,1) 1; \nodeat (1*3+0.3*4,1) 0; \nodeat (1*3+0.3*5,1) 1; \nodeat (1*3+0.3*6,1) 0; \nodeat (1*3+0.3*7,1) 1; \nodeat (1*3+3.5*0.3,1.4) 3;

\node

at (2*3+0.3*0,1) 0; \nodeat (2*3+0.3*1,1) 1; \nodeat (2*3+0.3*2,1) 0; \nodeat (2*3+0.3*3,1) 1; \nodeat (2*3+0.3*4,1) 0; \nodeat (2*3+0.3*5,1) 1; \nodeat (2*3+0.3*6,1) 0; \nodeat (2*3+0.3*7,1) 1; \nodeat (2*3+3.5*0.3,1.4) 3;

\node

at (3*3+0.3*0,1) 0; \nodeat (3*3+0.3*1,1) 0; \nodeat (3*3+0.3*2,1) 1; \nodeat (3*3+0.3*3,1) 1; \nodeat (3*3+0.3*4,1) 0; \nodeat (3*3+0.3*5,1) 1; \nodeat (3*3+0.3*6,1) 1; \nodeat (3*3+0.3*7,1) 1; \nodeat (3*3+3.5*0.3,1.4) 2;

\node

at (4*3+0.3*0,1) 0; \nodeat (4*3+0.3*1,1) 0; \nodeat (4*3+0.3*2,1) 0; \nodeat (4*3+0.3*3,1) 0; \nodeat (4*3+0.3*4,1) 1; \nodeat (4*3+0.3*5,1) 1; \nodeat (4*3+0.3*6,1) 1; \nodeat (4*3+0.3*7,1) 1; \nodeat (4*3+3.5*0.3,1.4) 1;

\node

at (0*3+0.3*0,0) 0; \nodeat (0*3+0.3*1,0) 0; \nodeat (0*3+0.3*2,0) 1; \nodeat (0*3+0.3*3,0) 1; \nodeat (0*3+0.3*4,0) 0; \nodeat (0*3+0.3*5,0) 1; \nodeat (0*3+0.3*6,0) 1; \nodeat (0*3+0.3*7,0) 1; \nodeat (0*3+3.5*0.3,-0.4) 2;

\node

at (1*3+0.3*0,0) 0; \nodeat (1*3+0.3*1,0) 0; \nodeat (1*3+0.3*2,0) 1; \nodeat (1*3+0.3*3,0) 1; \nodeat (1*3+0.3*4,0) 0; \nodeat (1*3+0.3*5,0) 1; \nodeat (1*3+0.3*6,0) 1; \nodeat (1*3+0.3*7,0) 1; \nodeat (1*3+3.5*0.3,-0.4) 2;

\node

at (2*3+0.3*0,0) 0; \nodeat (2*3+0.3*1,0) 0; \nodeat (2*3+0.3*2,0) 0; \nodeat (2*3+0.3*3,0) 0; \nodeat (2*3+0.3*4,0) 1; \nodeat (2*3+0.3*5,0) 1; \nodeat (2*3+0.3*6,0) 1; \nodeat (2*3+0.3*7,0) 1; \nodeat (2*3+3.5*0.3,-0.4) 1;

\node

at (3*3+0.3*0,0) 0; \nodeat (3*3+0.3*1,0) 1; \nodeat (3*3+0.3*2,0) 0; \nodeat (3*3+0.3*3,0) 1; \nodeat (3*3+0.3*4,0) 0; \nodeat (3*3+0.3*5,0) 1; \nodeat (3*3+0.3*6,0) 0; \nodeat (3*3+0.3*7,0) 1; \nodeat (3*3+3.5*0.3,-0.4) 3;

\node

at (4*3+0.3*0,0) 0; \nodeat (4*3+0.3*1,0) 0; \nodeat (4*3+0.3*2,0) 0; \nodeat (4*3+0.3*3,0) 0; \nodeat (4*3+0.3*4,0) 1; \nodeat (4*3+0.3*5,0) 1; \nodeat (4*3+0.3*6,0) 1; \nodeat (4*3+0.3*7,0) 1; \nodeat (4*3+3.5*0.3,-0.4) 1;

\draw

(0*3+ 0*0.3,0.1) – (0*3+ 1*0.3,0.9);

\draw

(0*3+ 1*0.3,0.1) – (0*3+ 3*0.3,0.9);

\draw

(0*3+ 2*0.3,0.1) – (0*3+ 4*0.3,0.9);

\draw

(0*3+ 3*0.3,0.1) – (0*3+ 5*0.3,0.9);

\draw

(0*3+ 4*0.3,0.1) – (1*3+ 0*0.3,0.9);

\draw

(0*3+ 6*0.3,0.1) – (1*3+ 1*0.3,0.9);

\draw

(1*3+ 0*0.3,0.1) – (1*3+ 2*0.3,0.9);

\draw

(1*3+ 1*0.3,0.1) – (1*3+ 4*0.3,0.9);

\draw

(1*3+ 2*0.3,0.1) – (1*3+ 5*0.3,0.9);

\draw

(1*3+ 4*0.3,0.1) – (1*3+ 6*0.3,0.9);

\draw

(2*3+ 4*0.3,0.1) – (1*3+ 7*0.3,0.9);

\draw

(2*3+ 5*0.3,0.1) – (2*3+ 1*0.3,0.9);

\draw

(2*3+ 6*0.3,0.1) – (2*3+ 3*0.3,0.9);

\draw

(2*3+ 7*0.3,0.1) – (2*3+ 5*0.3,0.9);

\draw

(3*3+ 0*0.3,0.1) – (3*3+ 0*0.3,0.9);

\draw

(3*3+ 2*0.3,0.1) – (3*3+ 1*0.3,0.9);

\draw

(3*3+ 3*0.3,0.1) – (3*3+ 2*0.3,0.9);

\draw

(3*3+ 4*0.3,0.1) – (3*3+ 4*0.3,0.9);

\draw

(3*3+ 5*0.3,0.1) – (3*3+ 5*0.3,0.9);

\draw

(3*3+ 6*0.3,0.1) – (4*3+ 0*0.3,0.9);

\draw

(3*3+ 7*0.3,0.1) – (4*3+ 4*0.3,0.9);

\node

at (2*3+3.5*0.3,-1.0) Figure 1: A badly-matched common subsequence between and for and ;

Lemma 6.

Suppose are words over alphabet and is a badly-matched common subsequence between and as defined in (3). Then

Proof.

We subdivide the common subsequence into subsequences such that, for each and each , the symbols matched by in belong to the expansion of the same symbol in . We choose the subdivision to be a coarsest one with this property (see Figure 2 below for an example). That implies that pairs of symbols of and matched by and by are different. In particular, expansions of at least symbols of and (except possibly the expansions of the leftmost and rightmost symbols of each of them) are fully contained in the spans of and . Therefore, we have

Since is badly-matched, by the preceding lemma we then have

The lemma then follows from the collecting together the two terms involving , and then dividing by . ∎

{tikzpicture}

[gray] (0*3+0*0.3-0.11,-0.11) – (0*3+0*0.3-0.11,+0.11) – (0*3+1*0.3-0.11,1-0.11) – (0*3+1*0.3-0.11,1+0.11) – (0*3+5*0.3+0.11,1+0.11) – (0*3+5*0.3+0.11,1-0.11) – (0*3+3*0.3+0.11,0.11) – (0*3+3*0.3+0.11,-0.11) – cycle; [gray] (0*3+4*0.3-0.11,-0.11) – (0*3+4*0.3-0.11,+0.11) – (1*3+0*0.3-0.11,1-0.11) – (1*3+0*0.3-0.11,1+0.11) – (1*3+1*0.3+0.11,1+0.11) – (1*3+1*0.3+0.11,1-0.11) – (0*3+6*0.3+0.11,0.11) – (0*3+6*0.3+0.11,-0.11) – cycle; [gray] (1*3+0*0.3-0.11,-0.11) – (1*3+0*0.3-0.11,+0.11) – (1*3+2*0.3-0.11,1-0.11) – (1*3+2*0.3-0.11,1+0.11) – (1*3+6*0.3+0.11,1+0.11) – (1*3+6*0.3+0.11,1-0.11) – (1*3+4*0.3+0.11,0.11) – (1*3+4*0.3+0.11,-0.11) – cycle; [gray] (2*3+4*0.3-0.11,-0.11) – (2*3+4*0.3-0.11,+0.11) – (1*3+7*0.3-0.11,1-0.11) – (1*3+7*0.3-0.11,1+0.11) – (1*3+7*0.3+0.11,1+0.11) – (1*3+7*0.3+0.11,1-0.11) – (2*3+4*0.3+0.11,0.11) – (2*3+4*0.3+0.11,-0.11) – cycle; [gray] (2*3+5*0.3-0.11,-0.11) – (2*3+5*0.3-0.11,+0.11) – (2*3+1*0.3-0.11,1-0.11) – (2*3+1*0.3-0.11,1+0.11) – (2*3+5*0.3+0.11,1+0.11) – (2*3+5*0.3+0.11,1-0.11) – (2*3+7*0.3+0.11,0.11) – (2*3+7*0.3+0.11,-0.11) – cycle; [gray] (3*3+0*0.3-0.11,-0.11) – (3*3+0*0.3-0.11,+0.11) – (3*3+0*0.3-0.11,1-0.11) – (3*3+0*0.3-0.11,1+0.11) – (3*3+5*0.3+0.11,1+0.11) – (3*3+5*0.3+0.11,1-0.11) – (3*3+5*0.3+0.11,0.11) – (3*3+5*0.3+0.11,-0.11) – cycle; [gray] (3*3+6*0.3-0.11,-0.11) – (3*3+6*0.3-0.11,+0.11) – (4*3+0*0.3-0.11,1-0.11) – (4*3+0*0.3-0.11,1+0.11) – (4*3+4*0.3+0.11,1+0.11) – (4*3+4*0.3+0.11,1-0.11) – (3*3+7*0.3+0.11,0.11) – (3*3+7*0.3+0.11,-0.11) – cycle;

\node

at (0*3+0.3*0,1) 0; \nodeat (0*3+0.3*1,1) 0; \nodeat (0*3+0.3*2,1) 0; \nodeat (0*3+0.3*3,1) 0; \nodeat (0*3+0.3*4,1) 1; \nodeat (0*3+0.3*5,1) 1; \nodeat (0*3+0.3*6,1) 1; \nodeat (0*3+0.3*7,1) 1; \nodeat (0*3+3.5*0.3,1.4) 1;

\node

at (1*3+0.3*0,1) 0; \nodeat (1*3+0.3*1,1) 1; \nodeat (1*3+0.3*2,1) 0; \nodeat (1*3+0.3*3,1) 1; \nodeat (1*3+0.3*4,1) 0; \nodeat (1*3+0.3*5,1) 1; \nodeat (1*3+0.3*6,1) 0; \nodeat (1*3+0.3*7,1) 1; \nodeat (1*3+3.5*0.3,1.4) 3;

\node

at (2*3+0.3*0,1) 0; \nodeat (2*3+0.3*1,1) 1; \nodeat (2*3+0.3*2,1) 0; \nodeat (2*3+0.3*3,1) 1; \nodeat (2*3+0.3*4,1) 0; \nodeat (2*3+0.3*5,1) 1; \nodeat (2*3+0.3*6,1) 0; \nodeat (2*3+0.3*7,1) 1; \nodeat (2*3+3.5*0.3,1.4) 3;

\node

at (3*3+0.3*0,1) 0; \nodeat (3*3+0.3*1,1) 0; \nodeat (3*3+0.3*2,1) 1; \nodeat (3*3+0.3*3,1) 1; \nodeat (3*3+0.3*4,1) 0; \nodeat (3*3+0.3*5,1) 1; \nodeat (3*3+0.3*6,1) 1; \nodeat (3*3+0.3*7,1) 1; \nodeat (3*3+3.5*0.3,1.4) 2;

\node

at (4*3+0.3*0,1) 0; \nodeat (4*3+0.3*1,1) 0; \nodeat (4*3+0.3*2,1) 0; \nodeat (4*3+0.3*3,1) 0; \nodeat (4*3+0.3*4,1) 1; \nodeat (4*3+0.3*5,1) 1; \nodeat (4*3+0.3*6,1) 1; \nodeat (4*3+0.3*7,1) 1; \nodeat (4*3+3.5*0.3,1.4) 1;

\node

at (0*3+0.3*0,0) 0; \nodeat (0*3+0.3*1,0) 0; \nodeat (0*3+0.3*2,0) 1; \nodeat (0*3+0.3*3,0) 1; \nodeat (0*3+0.3*4,0) 0; \nodeat (0*3+0.3*5,0) 1; \nodeat (0*3+0.3*6,0) 1; \nodeat (0*3+0.3*7,0) 1; \nodeat (0*3+3.5*0.3,-0.4) 2;

\node

at (1*3+0.3*0,0) 0; \nodeat (1*3+0.3*1,0) 0; \nodeat (1*3+0.3*2,0) 1; \nodeat (1*3+0.3*3,0) 1; \nodeat (1*3+0.3*4,0) 0; \nodeat (1*3+0.3*5,0) 1; \nodeat (1*3+0.3*6,0) 1; \nodeat (1*3+0.3*7,0) 1; \nodeat (1*3+3.5*0.3,-0.4) 2;

\node

at (2*3+0.3*0,0) 0; \nodeat (2*3+0.3*1,0) 0; \nodeat (2*3+0.3*2,0) 0; \nodeat (2*3+0.3*3,0) 0; \nodeat (2*3+0.3*4,0) 1; \nodeat (2*3+0.3*5,0) 1; \nodeat (2*3+0.3*6,0) 1; \nodeat (2*3+0.3*7,0) 1; \nodeat (2*3+3.5*0.3,-0.4) 1;

\node

at (3*3+0.3*0,0) 0; \nodeat (3*3+0.3*1,0) 1; \nodeat (3*3+0.3*2,0) 0; \nodeat (3*3+0.3*3,0) 1; \nodeat (3*3+0.3*4,0) 0; \nodeat (3*3+0.3*5,0) 1; \nodeat (3*3+0.3*6,0) 0; \nodeat (3*3+0.3*7,0) 1; \nodeat (3*3+3.5*0.3,-0.4) 3;

\node

at (4*3+0.3*0,0) 0; \nodeat (4*3+0.3*1,0) 0; \nodeat (4*3+0.3*2,0) 0; \nodeat (4*3+0.3*3,0) 0; \nodeat (4*3+0.3*4,0) 1; \nodeat (4*3+0.3*5,0) 1; \nodeat (4*3+0.3*6,0) 1; \nodeat (4*3+0.3*7,0) 1; \nodeat (4*3+3.5*0.3,-0.4) 1;

\draw

(0*3+ 0*0.3,0.1) – (0*3+ 1*0.3,0.9);

\draw

(0*3+ 1*0.3,0.1) – (0*3+ 3*0.3,0.9);

\draw

(0*3+ 2*0.3,0.1) – (0*3+ 4*0.3,0.9);

\draw

(0*3+ 3*0.3,0.1) – (0*3+ 5*0.3,0.9);

\draw

(0*3+ 4*0.3,0.1) – (1*3+ 0*0.3,0.9);

\draw

(0*3+ 6*0.3,0.1) – (1*3+ 1*0.3,0.9);

\draw

(1*3+ 0*0.3,0.1) – (1*3+ 2*0.3,0.9);

\draw

(1*3+ 1*0.3,0.1) – (1*3+ 4*0.3,0.9);

\draw

(1*3+ 2*0.3,0.1) – (1*3+ 5*0.3,0.9);

\draw

(1*3+ 4*0.3,0.1) – (1*3+ 6*0.3,0.9);

\draw

(2*3+ 4*0.3,0.1) – (1*3+ 7*0.3,0.9);

\draw

(2*3+ 5*0.3,0.1) – (2*3+ 1*0.3,0.9);

\draw

(2*3+ 6*0.3,0.1) – (2*3+ 3*0.3,0.9);

\draw

(2*3+ 7*0.3,0.1) – (2*3+ 5*0.3,0.9);

\draw

(3*3+ 0*0.3,0.1) – (3*3+ 0*0.3,0.9);

\draw

(3*3+ 2*0.3,0.1) – (3*3+ 1*0.3,0.9);

\draw

(3*3+ 3*0.3,0.1) – (3*3+ 2*0.3,0.9);

\draw

(3*3+ 4*0.3,0.1) – (3*3+ 4*0.3,0.9);

\draw

(3*3+ 5*0.3,0.1) – (3*3+ 5*0.3,0.9);

\draw

(3*3+ 6*0.3,0.1) – (4*3+ 0*0.3,0.9);

\draw

(3*3+ 7*0.3,0.1) – (4*3+ 4*0.3,0.9);

\node

at (2*3+3.5*0.3,-1.0) Figure 2: Partition of the common subsequence from Figure 1 into subsequence as in the proof of Lemma 6;

The next step is to drop the assumption in Lemma 6 that the common subsequence is badly-matched. By doing so we incur an error term involving .

Lemma 7.

Suppose are words over alphabet and is a common subsequence between and . Then

Proof.

Without loss, the subsequence is locally optimal, i.e., every alteration of that increases also increases . Define an auxiliary bipartite graph whose two parts are the symbols in and the symbols in . For each well-matched symbol in we join the parent symbols in and by an edge.

We may assume that each vertex in has degree at most . Indeed, suppose a symbol is adjacent to three symbols with being in between and . Then we alter by first removing all matches between and , and then completely matching with . The alteration does not increase , and the result is a common subsequence that is at least as long as , and whose auxiliary graph has fewer edges. We can then repeat this process until no vertex has degree exceeding .

Consider a maximum-sized matching in . On one hand, it has at most edges. On the other hand, since the maximum degree of is at most , the maximum-sized matching has at least edges. Hence, .

Remove from all well-matched symbols to obtain a common subsequence . The new subsequence satisfies

It is also clear that is a badly-matched common subsequence. From the previous lemma

Since , the lemma follows. ∎

We are now ready to prove Theorem 4 by picking parameters suitably.

Proof of Theorem 4.

Recall that we are starting with a code with . Given and an integer , pick parameters

in the construction (2) and (3). Define and as , and let , where , be the code obtained as in the statement of Theorem 4. Note that by our choice of parameters.

By Lemma 7, we can conclude that any common subsequence of two distinct codewords of satisfies

Since and , the right hand side is at least , as desired. ∎

Remark 1 (Bottleneck for analysis).

We now explain why the analysis in Theorem 4 is limited to proving correctability of a fraction of deletions for binary codes (a similar argument holds for larger alphabet size ). Imagine subwords of length of of the form and respectively, where and . Then the word can be matched fully with (because the latter strings oscillate at a higher frequency that ), and similarly can be matched fully with . Thus we can find a common subsequence of length between the encoded bit strings and , even if and share no common subsequence.

3.2 Dirty construction

We now turn to the more complicated “dirty” construction in which small runs of dirt are interspersed in the long runs of a single symbol from the clean construction.

Dirty construction, binary case

To convey the intuition for the dirty construction let us look more closely at what happened in the binary case. We were looking for subsequences of

and

where both and are large numbers but is much larger than . We are interested in subsequences with small span. Looking more closely at the proof of Lemma 5 we see that such subsequences are obtained by taking every symbol of and discarding essentially half the symbols of as to not interrupt the very long runs in . Now suppose we introduce some “dirt” in by introducing, in the very long stretches of 1’s, some infrequent 2’s, say a 2 every 10th symbol (and similarly some infrequent 1’s in the long stretches of 2’s). Then, during construction of the LCS, when running into such a sporadic 2 we can either try to include it or discard it. As is a large number it is easy to see that while we are matching a 1-segment of we cannot profit by matching the sporadic 2’s. It is also not difficult to see that while passing through a 2-segment of it is not profitable to match more than one sporadic 2 as matching two consecutive sporadic 2’s forces us to drop the ten 1’s in between the two matched 2’s in . The net effect is that introducing some dirt hardly enables us to expand the LCS but does increase the span. We need to introduce dirt in all codewords and it should not look too similar in any two codewords. The way to achieve this is by introducing such dirty runs of different but short lengths in all codewords. Let us turn to a more formal description.

For the sake of readability we below assume that some real numbers defined are integers. Rounding these numbers to the closest integer only introduces lower order term errors. It is also not difficult to see that we can pick parameters such that all numbers are indeed integers.

Let be a such that . The reason for the upper limit on will be clarified in Remark 2 after the analysis. We define “ dirty ones at amplitude ” be the string

and let us write this as leaving implicit. We have an analogous string and we allow with the natural interpretation. Remember that in our clean solution, was coded by

In the dirty construction we replace this by

(4)

where is an integer that can be written on the form for an integer , and

(5)

We introduce dirt where the amplitude of the dirt decreases with . We call a string of the form as a segment of . The reason for the general length increase by a factor is to accommodate for dirt of frequencies that are well separated.

Lemma 8.

Let be the string (or ) and let be a subsequence of , then

Proof.

As is symmetric in 1 and 2 we can assume that . Note that consists of substrings of the form and and copies of each. For the ’th subword of ones (ignoring if it is of length or ), let us assume that 1’s are contained in . The the span of this subsequence in is at least . Similarly if the ’th string of 2’s contain symbols from then its span in is at least . Summing these inequalities, if is the total number of 1’s in and is the number of 2’s, then the span of in is at least

where the last term comes because we lose in the span for each substring of identical symbols in and there are such substrings.

As the length of is it is sufficient to establish that

(6)

We know that both and are in the range . Since we have and thus it is sufficient to establish (6) for , but in this case it follows from . ∎

The above lemma is the main ingredient in establishing the the following lemma.

Lemma 9.

Let be a subsequence of and for , then, provided ,

Proof.

We have that consists of substrings of each of the form

Now partition into substrings according to how it intersects these substrings of . The number of such strings is at most . We want to apply Lemma 8 and we need to address the fact that each might intersect more than one segment of (recall that a segment of is a substring of the form or ). As only has different segments, by refining the partition slightly we can obtain substrings for with , where each satisfies the hypothesis of Lemma 8 with , and . We therefore obtain the inequality

(7)

We have a total of inequalities and as and , summing (7) for the values of gives

Now as we can conclude that

and using , , and , the lemma follows. ∎

Let us slightly abuse notation and in this section let the word obtained from a word via the substitution

(8)

to each symbol of as opposed to (3). As Lemma 9 tells us that subsequences of codings of unequal symbols have a large span, we have the following analog of Lemma 6.

Lemma 10.

Suppose are words over alphabet and is a badly-matched common subsequence between and as defined in (8). Then

Proof.

We use the same subdivision as in the proof Lemma 6. We have

Since is badly-matched, by the preceding lemma we then have

The lemma then follows from the collecting together the two terms involving , and then dividing by . ∎

The transition to allow some well-matched symbols is done as in the clean construction and we get the lemma below. The proof is analogous to that of Lemma 7 and in particular we remove the well matched symbols which is shortening by at most and the rest of the proof is essentially identical.

Lemma 11.

Suppose are words over alphabet and is a common subsequence between and . Then

We are now ready to prove the alphabet reduction claim (Theorem 3) via concatenation with the dirty construction at the inner level.

Proof of Theorem 3 (for binary case).

All that remains to be done is to pick parameters suitably. We set to the smallest number greater than such that it can be written on the form for and integer and and we use this value of . It is not difficult to see that this is possible with . Define (recall that ) and as (as defined in (8)), and let , where , be the code obtained as in the statement of Theorem 3.

By Lemma 11, we can conclude that any common subsequence of two distinct codewords of satisfies

Since , the right hand side is at least , as claimed in (1). ∎

Remark 2.

For the level of dirt discussed here, i.e., , the analysis is optimal for the same reason as the clean one is optimal, as the analysis shows that the dirt is dropped in forming the subsequence. Indeed, in the clean construction the efficient LCS of length spans symbols in the high frequency string and symbols in the low frequency string. Introducing dirt increases the second number to for a total span of . If the value of is larger, then the efficient LCS is obtained by using all symbols, including the dirt, in the low frequency (high amplitude) string. In the high frequency string it spans around

symbols (half of the time we are taking the most common symbol, moving at speed and half the time the other symbol moving at speed ). Thus in this case the total span is and the threshold of for was chosen to maximize .

Dirty construction, general case

Let us give the highlights of the general construction for alphabet size . In this case we define “ dirty ones at frequency ” to be the string

where we assume that is positive number bounded from above by . We denote this string by and we have analogous dirty strings of other symbols.

The extension of Lemma 8 is as follows.

Lemma 12.

For , let be a string of the form and let be a subsequence of , then,

The proof of this lemma follows along the lines of Lemma 8 with some obvious modifications. If we let be the number of occurrences of ’s in and the total number of other symbols we get a lower bound for the span of the form

By the upper bound on we have