Optimal coding for the deletion channelwith small deletion probability

# Optimal coding for the deletion channel with small deletion probability

Yashodhan Kanoria     and    Andrea Montanari Department of Electrical Engineering, Stanford UniversityDepartment of Electrical Engineering and Department of Statistics, Stanford University
###### Abstract

The deletion channel is the simplest point-to-point communication channel that models lack of synchronization. Input bits are deleted independently with probability , and when they are not deleted, they are not affected by the channel. Despite significant effort, little is known about the capacity of this channel, and even less about optimal coding schemes. In this paper we develop a new systematic approach to this problem, by demonstrating that capacity can be computed in a series expansion for small deletion probability. We compute three leading terms of this expansion, and find an input distribution that achieves capacity up to this order. This constitutes the first optimal coding result for the deletion channel.

The key idea employed is the following: We understand perfectly the deletion channel with deletion probability . It has capacity 1 and the optimal input distribution is i.i.d. Bernoulli. It is natural to expect that the channel with small deletion probabilities has a capacity that varies smoothly with , and that the optimal input distribution is obtained by smoothly perturbing the i.i.d. Bernoulli process. Our results show that this is indeed the case. We think that this general strategy can be useful in a number of capacity calculations.

## 1 Introduction

The (binary) deletion channel accepts bits as inputs, and deletes each transmitted bit independently with probability . Computing or providing systematic approximations to its capacity is one of the outstanding problems in information theory [1]. An important motivation comes from the need to understand synchronization errors and optimal ways to cope with them.

In this paper we suggest a new approach. We demonstrate that capacity can be computed in a series expansion for small deletion probability, by computing the first two orders of such an expansion. Our main result is the following.

###### Theorem 1.1.

Let be the capacity of the deletion channel with deletion probability . Then, for small and any ,

 C(d)=1+dlogd−A1d+A2d2+O(d3−ϵ), (1)

where

 A1 ≡log(2e)−∞∑l=12−l−1llogl≈1.15416377 A2 =c3+c4+14ln2(2+32c22+∞∑l=12−l(llnl)2−c2∞∑l=12−ll2lnl)≈1.67814594 c2 ≡∞∑l=12−lllnl≈1.78628364 c3 ≡12(−1+∞∑l=32−l{(l2)log(l2)−l2logl+(l−1)(l−3)log(l−1)+(l−2)log(l−2)}) ≈−0.88636960 c4 ≡∞∑j=42−(2+j)(j−1)(j−3)h(1j−1) ≡ +∞∑i=2∞∑j=42−(i+j+1)(i+j−1)(j−3)h(i+1i+j−1)≈0.69001321

Here is the binary entropy function, i.e., .

Further, the binary stationary source defined by the property that the times at which it switches from to or viceversa form a renewal process with holding time distribution , achieves rate within of capacity.

Given a binary sequence, we will call ‘runs’ its maximal blocks of contiguous ’s or ’s. We shall refer to binary sources such that the switch times form a renewal process as sources (or processes) with i.i.d. runs.

The ‘rate’ of a given binary source is the maximum rate at which information can be transmitted through the deletion channel using input sequences distributed as the source. A formal definition is provided below (see Definition 2.3). Logarithms denoted by here (and in the rest of the paper) are understood to be in base . While one might be skeptical about the concrete meaning of asymptotic expansions of the type (1), they often prove surprisingly accurate. For instance at ( of the input symbols are deleted), the expression in Eq. (1) (dropping the error term ) is larger than the best lower bound [2] by about bits. The lower bound of [2] is derived using a Markov source and ‘jigsaw’ decoding. Our asymptotic analysis implies that the loss in rate due to restricting to Markov sources and jigsaw decoding (cf. Theorem 6.1 and Remark 6.2), to leading order, is . Hence, we estimate that our asymptotic approach incurs an error of about bits for computing the capacity at .

More importantly asymptotic expansions can provide useful design insight. Theorem 1.1 shows that the stationary process consisting of i.i.d. runs with the specified run length distribution, achieves capacity to within . In comparison, the best performing approach tried before this was to use a first order Markov source for coding [2]. We are able to show, in fact, that this approach incurs a loss that is , which is the same order as the loss incurred by the trivial approach of using i.i.d. Bernoulli!

###### Remark 1.2.

In this work, we prove rigorous upper and lower bounds on capacity that match up to quadratic order in (cf. Theorem 1.1), but without explicitly evaluating the constants in the error terms. It would be very interesting to obtain explicit expressions for these constants.

Before this work, there was no non-trivial optimal coding result known for the deletion channel111The trivial exception is the case , for which the i.i.d. Bernoulli process achieves capacity.. Further terms in the capacity expansion can be expected to supply even more detailed information about the optimal coding scheme and allow us to achieve capacity to higher orders.

We think that the strategy adopted here might be useful in other information theory problems. The underlying philosophy is that whenever capacity is known for a specific value of the channel parameter, and the corresponding optimal input distribution is unique and well characterized, it should be possible to compute an asymptotic expansion around that value. In the present context the special channel is the perfect channel, i.e. the deletion channel with deletion probability . The corresponding input distribution is the i.i.d. Bernoulli process.

### 1.1 Related work

Dobrushin [3] proved a coding theorem for the deletion channel, and other channels with synchronization errors. He showed that the maximum rate of reliable communication is given by the maximal mutual information per bit, and proved that this can be achieved through a random coding scheme. This characterization has so far found limited use in proving concrete estimates. An important exception is provided by the work of Kirsch and Drinea [4] who use Dobrushin coding theorem to prove lower bounds on the capacity of channels with deletions and duplications. We will also use Dobrushin theorem in a crucial way, although most of our effort will be devoted to proving upper bounds on the capacity.

Several capacity bounds have been developed over the last few years, following alternative approaches, and are surveyed in [1]. In particular, it has been proved that as [5]. The papers [6, 7] improve the upper bound in this limit obtaining . However, determining the asymptotic behavior in this limit (i.e. finding a constant such that ) is an open problem. When applied to the small regime, none of the known upper bounds actually captures the correct behavior as stated in Eq. (1). A simple calculation shows that the first upper bound in [8] has asymptotics of . Another work [6] shows that as . As we show in the present paper, this behavior can be controlled exactly, up to the third leading term of the expansion.

A short version of this paper was presented at the 2010 International Symposium on Information Theory (ISIT) [9]. At the same conference, Kalai, Mitzenmacher and Sudan [10] presented a result analogous to Theorem 1.1. The proof is based on a counting argument, very different from the the techniques employed here. Also, the result of [10] is not the same as in Theorem 1.1, since only the term of the series is established in [10]. Theorem 1.1 improves on our ISIT result [9], that contained only the first two terms in the series expansion, but not the order term. Also, we obtain a non-trivial coding scheme for the first time in this paper. The trivial i.i.d. Bernoulli coding scheme is enough to achieve capacity up to linear order as shown in our conference paper [9].

### 1.2 Numerical illustration of results

We can numerically evaluate the expression in Eq. (1) (dropping the error term) to obtain estimates of capacity for small deletion probabilities.

 C\textupest=1+dlogd−A1d+A2d2.

The values of are presented in Table 1 and Figure 1. We compare with the best known numerical lower bounds [2] and upper bounds [6, 8].

We stress here that is neither an upper nor a lower bound on capacity. It is an estimate based on taking the leading terms of the asymptotic expansion of capacity for small , and is expected to be accurate for small values of . Indeed, we see that for larger than , our estimate exceeds the upper bound. This simply indicates that we should not use as an estimate for such large . We believe that provides an excellent estimate of capacity for .

### 1.3 Notation

We borrow , and notation from the computer science literature. We define these as follows to fit our needs. Let and . We say:

• We say if there is a constant such that for all .

• We say if there is a constant such that for all .

• We say if there are constants such that for all .

Throughout this paper, we adhere to the convention that the constants above should not depend on the processes etc. under consideration, if there are such processes.

### 1.4 Outline of the paper

Section 2 contains the basic definitions and results necessary for our approach to estimating the capacity of the deletion channel. We show that it is sufficient to consider stationary ergodic input sources, and define their corresponding rate (mutual information per bit). Capacity is obtained by maximizing this quantity over stationary processes. In Section 3, we present an informal argument that contains the basic intuition leading to our main result (Theorem 1.1), and allows us to correctly guess the optimal input distribution. Section 4 states a small number of core lemmas, and shows that they imply Theorem 1.1. Finally, Section 5 states several technical results (proved in appendices) and uses them to prove the core lemmas. We conclude with a short discussion, including a list of open problems, in Section 6.

## 2 Preliminaries

For the reader’s convenience, we restate here some known results that we will use extensively, along with some definitions and auxiliary lemmas.

Consider a sequence of channels , where allows exactly inputs bits, and deletes each bit independently with probability . The output of for input is a binary vector denoted by . The length of is a binomial random variable. We want to find maximum rate at which we can send information over this sequence of channels with vanishingly small error probability.

The following characterization follows from [3].

###### Theorem 2.1.

Let

 Cn≡1nmaxpXnI(Xn;Y(Xn)).

Then, the following limit exists

 C≡limn→∞Cn=infn≥1Cn, (2)

and is equal to the capacity of the deletion channel.

A further useful remark is that, in computing capacity, we can assume to be consecutive coordinates of a stationary ergodic process. We denote by the class of stationary and ergodic processes that take binary values.

###### Lemma 2.2.

Let be a stationary and ergodic process, with taking values in . Then the limit exists and

 C=maxX∈SI(X).

We use the following natural definition of the rate achieved by a stationary ergodic process.

###### Definition 2.3.

For stationary and ergodic , we call the rate achieved by .

Proofs of Theorem 2.1 and Lemma 2.2 are provided in Appendix A.

Given a stationary process , it is convenient to consider it from the point of view of a ‘uniformly random’ block/run. Intuitively, this corresponds to choosing a large integer and selecting as reference point the beginning of a uniformly random block in . Notice that this approach naturally discounts longer blocks for finite . While such a procedure can be made rigorous by taking the limit , it is more convenient to make use of the notion of Palm measure from the theory of point processes [11, 12], which is, in this case, particularly easy to define. To a binary source , we can associate in a bijective way a subset of times , by letting if and only if is the first bit of a run. The Palm measure is then the distribution of conditional on the event .

We denote by the length of the block starting at under the Palm measure, and denote by its distribution. As an example, if is the i.i.d. Bernoulli process, we have where . We will also call the block-perspective run length distribution or simply the run length distribution, and let

 μ(X)≡E∞∑l=1pL(l)l,

be its average. Let be the length of the block containing bit in the stationary process . A standard calculation[11, 12] yields . Since is a well defined and almost surely finite (by ergodicity), we necessarily have .

In our main result, Theorem 1.1, a special role is played by processes such that the associated switch times form a stationary renewal process. We will refer to such an as a process with i.i.d. runs.

## 3 Intuition behind the main theorem

In this section, we provide a heuristic/non-rigorous explanation for our main result. The aim is build intuition and motivate our approach, without getting bogged down with the numerous technical difficulties that arise. In fact, we focus here on heuristically deriving the optimal input process , and do not actually obtain the quadratic term of the capacity expansion. We find by computing various quantities to leading order and using the following observation (cf. Remark 4.2).

Key Observation: The process that achieves capacity for small should be ‘close’ to the Bernoulli process, since must be close to .

We have

 I(Xn;Y(Xn))=H(Y)−H(Y|Xn). (3)

Let be a binary vector containing a one at position if and only if is deleted from the input vector. We can write

 H(Y|Xn)=H(Y,Dn|Xn)−H(Dn|Xn,Y).

But is a function of , leading to , where we used the fact that is i.i.d. Bernoulli(), independent of . It follows that

 H(Y|Xn)=nh(d)−H(Dn|Xn,Y). (4)

The term represents ambiguity in the location of deletions, given the input and output strings. Now, since is small, we expect that most deletions occur in ‘isolation’, i.e., far away from other deletions. Make the (incorrect) assumption that all deletions occur such that no three consecutive runs have more than one deletion in total. In this case, we can unambiguously associate runs in with runs in . Ambiguity in the location of a deletion occurs if and only if a deletion occurs in a run of length . In this case, each of locations is equally likely for the deletion, leading to a contribution of to . Now, a run of length should suffer a deletion with probability . Thus, we expect

 1nH(Dn|Xn,Y)≈dμ(X)∞∑l=1pL(l)llogl.

We know that is close to , implying is close to and is close to . This leads to

 1nH(Dn|Xn,Y) ≈d2∞∑l=1pL(l)llogl+d(μ(X)−2)4∞∑l=1p∗L(l)llogl =d2[−c2ln2+∞∑l=1pL(l)l(logl−c22ln2)]. (5)

Consider . Now, if the input is drawn from a stationary process , we expect the output to also be a segment of some stationary process . (It turns out that this is the case.) Moreover, we expect that the channel output has bits, leading to . Denote the run length distribution in by . Define . Let denote the length of a random run drawn according to . It is not hard to see that

 H(Y)≤H(LY)/μ(Y),

with equality iff consists of i.i.d. runs, which occurs iff consists of i.i.d. runs. Define . An explicit calculation yields . We know that is close to , implying is close to and is small. Thus,

 limn→∞1nH(Y)=(1−d)H(Y)≤(1−d)(1−D(qL||q∗L)/μ(Y))≈1−d−D(qL||q∗L)/2.

Notice that an i.i.d. Bernoulli input results in an i.i.d. Bernoulli output from the deletion channel. The following is made precise in Lemma 5.9: Let be the ‘distance’ between and . Then a short calculation tells us that the distance between and should be . In other words and are very nearly equal to each other.

So we obtain, to leading order,

 limn→∞1nH(Y)\apprle1−d−D(pL||p∗L)/2, (6)

with (approximate) equality iff consists of i.i.d. runs.

Putting Eqs. (3), (4), (5) and (6) together, we have

 I(X) =limn→∞1nI(Xn;Y) \apprle1−d−D(pL||p∗L)/2−h(d)+d2[−c2ln2+∞∑l=1pL(l)l(logl−c22ln2)]

Since this (approximate) upper bound on depends on input only through , we choose consisting of i.i.d. runs so that (approximate) equality holds.

We expect to be close to . A Taylor expansion gives

 D(pL||p∗L) =∞∑l=1pL(l)(l+logpL(l)) ≈1ln2∞∑l=1((pL(l)−2−l)+2l−1(pL(l)−2−l)2) =1ln2∞∑l=12l−1(pL(l)−2−l)2.

Thus, we want to maximize

 12ln2∞∑l=12l−1(pL(l)−2−l)2+d2[∞∑l=1pL(l)l(logl−c22ln2)],

subject to , in order to achieve the largest possible . A simple calculation tells us that the maximizing distribution is .

## 4 Proof of the main theorem: Outline

In this section we provide the proof of Theorem 1.1 after stating the key lemmas involved. We defer the proof of the lemmas to the next section. Sections 5.1-5.4 develop the technical machinery we use, and the proofs of the lemmas are in Section 5.6.

Given a (possibly infinite) binary sequence, a run of ’s (of ’s) is a maximal subsequence of consecutive ’s (’s), i.e. an subsequence of ’s bordered by ’s (respectively, of ’s bordered by ’s). The first step consists in proving achievability by estimating for a process having i.i.d. runs with appropriately chosen distribution.

###### Lemma 4.1.

Let be the process consisting of i.i.d. runs with distribution . Then for any , we have

 I(X†)=1+dlogd−A1d+A2d2+O(d3−ϵ).

Lemma 4.1 is proved in Section 5.6.

Lemma 2.2 allows us to restrict our attention to stationary ergodic processes in proving the converse. For a process , we denote by its entropy rate. Define

 H(YX)≡limn→∞H(Y(Xn))n(1−d). (7)

A simple argument shows that this limit exists and is bounded above by for any stationary process and any , with iff is the i.i.d. Bernoulli() process.

In light of Lemma 4.1, we can restrict consideration to processes satisfying whence :

###### Remark 4.2.

There exists such that for all , if , we have and hence also  .

We define a ‘super-run’ next.

###### Definition 4.3.

A super-run consists of a maximal contiguous sequence of runs such that all runs in the sequence after the first one (on the left) have length one. We divide a realization of into super-runs . Here is the super-run including the bit at position 1.

See Table 2 for an example showing division into super-runs.

Denote by the set of all stationary ergodic processes and by the set of stationary ergodic processes such that, with probability one, no super-run has length larger than .

Our next lemma tightens the constraint given by Remark 4.2 further for processes in .

###### Lemma 4.4.

Consider any and constant . There exists such that the following happens for any . For any , if

 I(X) ≥C−κd2−(ϵ/2),

then

 H(YX)≥1−d2−ϵ.

We show an upper bound for the restricted class of processes .

###### Lemma 4.5.

For any there exists and such that the following happens. If , for any ,

 I(X)≤1+dlogd−A1d+A2d2+κd3−ϵ.

Finally, we show a suitable reduction from the class to the class .

###### Lemma 4.6.

For any there exists such that the following happens for all , and all . For any such that and for any , there exists such that

 I(X) ≤I(XL∗)+dγ−ϵ(L∗)−1logL∗, (8) H(YX) ≥H(YXL∗)−dγ−ϵ(L∗)−1logL∗. (9)

Lemmas 4.4, 4.5 and 4.6 are proved in Section 5.6.

The proof of Theorem 1.1 follows from these lemmas with Lemma 4.6 being used twice.

###### Proof of Theorem 1.1.

Lemma 4.1 shows achievability. For the converse, we start with a process such that . By Remark 4.2, for any and . Use Lemma 4.6, with , and . It follows that for ,

 I(XL∗) >C−d2−2δ, H(YX) ≥H(YXL∗)−d2−2δ.

We now use Lemma 4.4 which yields and hence, by Eq. (9), for small . Now, we can use Lemma 4.6 again with , , . We obtain

 I(XL∗)≥C−d3−4δ.

Finally, using Lemma 4.5, we get the required upper bound on . ∎

## 5 Proofs of the Lemmas

In Section 5.1 we show that, for any stationary ergodic that achieves a rate close to capacity, the run-length distribution must be close to the distributions obtained for the i.i.d. Bernoulli process. In Section 5.2, we suitably rewrite the rate achieved by stationary ergodic process as the sum of three terms. In Section 5.3 we construct a modified deletion process that allows accurate estimation of in the small limit. Section 5.4 proves a key bound on that leads directly to Lemma 4.4. Finally, in Section 5.6 we present proofs of the Lemmas quoted in Section 4 using the tools developed.

We will often write for the random vector where the ’s are distributed according to the process .

### 5.1 Characterization in terms of runs

Let be the number of runs in . Let be the run lengths ( being the length of the intersection of that run with ). It is clear that (where one bit is needed to remove the ambiguity). By ergodicity almost surely as . Also implies . Further, . If is the entropy rate of the process , by taking the limit, it is easy to deduce that

 H(X)≤H(L)E[L], (10)

with equality if and only if is a process with i.i.d. runs with common distribution .

We know that given , the probability distribution with largest possible entropy is geometric with mean , i.e. for all , leading to

 H(L)E[L]≤−(1−1μ)log(1−1μ)−1μlog1μ≡h(1/μ). (11)

Here we introduced the notation for the binary entropy function.

Using this, we are able to obtain sharp bounds on and .

###### Lemma 5.1.

There exists such that the following occurs. For any and , if is such that , we have

 |μ(X)−2|≤7dβ/2. (12)
###### Proof.

By Eqs. (10) and (11), we have . By Pinsker’s inequality , and therefore . The claim follows from simple calculus. ∎

###### Lemma 5.2.

There exists and such that the following occurs for any and . For any such that , we have

 ∞∑l=1∣∣∣pL(l)−12l∣∣∣≤κ′dβ/2. (13)
###### Proof.

Let and recall that . An explicit calculation yields

 H(L)=μ(X)−D(pL||p∗L). (14)

Now, by Pinsker’s inequality,

 D(pL||p∗L)≥2ln2∥pL−p∗L∥2TV. (15)

Combining Lemma 5.1, and Eqs. (10), (14) and (15), we get the desired result. ∎

For the rest of Section 5.1, we only state our technical estimates, deferring proofs to Appendix B.

We now state a tighter bound on probabilities of large run lengths. We will find this useful, for instance, to control the number of bit flips in going from general to having bounded run lengths.

###### Lemma 5.3.

There exists such that the following occurs: Consider any , and define . For all , if is such that , we have

 ∞∑l=ℓ lpL(l) ≤20dβ, (16)

We use to denote the vector of lengths of a randomly selected block of consecutive runs (a ‘-block’). Formally, is the vector of lengths of the first runs starting from bit , under the Palm measure introduced in Section 2.

###### Corollary 5.4.

There exists such that the following occurs: Consider any positive integer and any , and define . For all , if is such that , we have

 ∑l1+…+lk≥kℓ(l1+…+lk)pL(k)(l1,…,lk) ≤20k2dβ. (17)

Clearly, . We have

A stronger form of Lemma 5.2 follows.

###### Lemma 5.5.

Let . For the same and as in Lemma 5.2, the following occurs. Consider any positive integer and any . For all , if is such that , we have

 ∞∑l1=1∞∑l2=1…∞∑lk=1∣∣pL(k)(l1,…,lk)−p∗L(k)(l1,…,lk)∣∣≤κ′√kdβ/2.

We now relate the run-length distribution in and in (as ). For this, we first need a characterization of in terms of a stationary ergodic process. Let be an i.i.d. Bernoulli, independent of . Construct as follows. Look at . Delete bits corresponding to . The bits remaining are in order. Similarly, in delete bits corresponding to . The bits remaining are in order.

###### Proposition 5.6.

The process is stationary and ergodic for any stationary ergodic .

Notice on the other hand that are not jointly stationary.

The channel output is then where . It is easy to check that

 H(Y)=H(YX)

(cf. Eq. (7)). We will henceforth use instead of the more cumbersome notation .

Let denote the block perspective run-length distribution for . Denote by the block perspective distribution for -blocks in . Lemmas 5.1, 5.2, 5.3, 5.5 and Corollary 5.4 hold for any stationary ergodic process, hence they hold true if we replace with .

In proving the upper bound, it turns out that we are able to establish a bound of for and small , but no corresponding bound for . Next, we establish that if is close to , this leads to tight control over the tail for . This is a corollary of Lemma 5.3.

###### Lemma 5.7.

There exists such that the following occurs: Consider any , and define . For all , if , we have

 ∞∑l=2ℓ lpL(l) ≤80dγ.

Note that refers to the block length distribution of , not .

###### Corollary 5.8.

There exists such that the following occurs: Consider any positive integer and , and define . For all , if , we have

 ∞∑l=2kl0 (l1+…+lk)pL(k)(l1,…,lk) ≤80k2dγ.

Consider being i.i.d. Bernoulli. Clearly, this corresponds to also i.i.d. Bernoulli. Hence, each has the same run length distribution . This happens irrespective of the deletion probability . Now suppose is not i.i.d. Bernoulli but approximately so, in the sense that close to . The next lemma establishes, that in this case also, the run length distribution of is very close to that of , for small run lengths and small .

###### Lemma 5.9.

There exist a function and constants , such that the following happens, for any , and .
(i) For all , for all such that , and all , we have

 |pL(l)−qL(l)|≤κ1d1+β/2−ϵ.

(ii) For all and all such that , we have

 |μ(X)−μ(Y)|≤κ2d1+β/2. (18)

Let us emphasize that do not depend at all on , where as does not depend on in the above lemma. Analogous comments apply to the remaining lemmas in this section.

As before, we are able to generalize this result to blocks of consecutive runs.

###### Lemma 5.10.

There exist a function and a constant such that the following happens, for any , and .

For all , for all integers and such that , and all such that , we have

 |pL(k)(l1,…,lk)−qL(k)(l1,…,lk)|≤κ′d1+β/2−ϵ.

In proving the lower bound, we have , but no corresponding bound for . The next lemma allows us to get tight control over the tail of .

###### Lemma 5.11.

For any , there exists such that the following occurs: Consider any , and define . For all , if , we have

 ∞∑l=ℓ lqL(l) ≤d