Fixed Error Probability Asymptotics For Erasure and List Decoding
Abstract
We derive the optimum secondorder coding rates, known as secondorder capacities, for erasure and list decoding. For erasure decoding for discrete memoryless channels, we show that secondorder capacity is where is the channel dispersion and is the total error probability, i.e., the sum of the erasure and undetected errors. This total error probability, as are other error probabilities in this paper, is fixed at a nonzero constant. We show numerically that the expected rate at finite blocklength for erasures decoding can exceed the finite blocklength channel coding rate. We also show that the analogous result also holds for lossless source coding with decoder side information, i.e., SlepianWolf coding. For list decoding, we consider list codes of deterministic sizes that scale as and show that the corresponding secondorder capacity is , where is the permissible error probability, i.e., the probability that the true message is not in the list. We also consider lists of polynomial size and derive bounds on the thirdorder coding rate in terms of the order of the polynomial . These bounds are tight for symmetric and singular channels. The direct parts of the coding theorems leverage on the simple threshold decoder and converses are proved using variants of the hypothesis testing converse.
I Introduction
In many communication scenarios, it is advantageous to allow the decoder to have the option of either not deciding at all or putting out more than one estimate of the message. These are respectively known as erasure and list decoding respectively and have been studied extensively in the information theory literature [1, 2, 3, 4, 5, 6, 7]. The erasure and list options generally allow for smaller undetected error probabilities so these options are useful in practice.
In this paper, we revisit the problem or erasure and list decoding from the viewpoint of second and thirdorder asymptotics. The study of secondorder asymptotics for fixed (nonvanishing) error probability was first done by Strassen [8, Thm. 1.2] who showed for a wellbehaved discrete memoryless channel (DMCs) that the maximum number of codewords at (average or maximum) error probability , namely , satisfies
(1) 
where and are respectively the capacity and the dispersion of and is the inverse of the standard Gaussian cumulative distribution function (cdf). Also see Kemperman [9, Thm. 11.3] for the corresponding result for constant composition codes. This line of work has been revisited by numerous authors recently and they considered various other channel models such as the additive white Gaussian (AWGN) channel [10, 11], bounds on the thirdorder (logarithmic) term [12, 13, 14, 15] and extensions to multiterminal problems such as the twoencoder SlepianWolf problem and the multipleaccess channel [16].
Ia Main Contributions
In this paper, for erasure decoding, we consider constant undetected and total (sum of undetected and erasure) error probabilities (numbers between and ) and we obtain the analogue of the secondorder term in (1). We show that the coefficient of the secondorder term, termed the secondorder capacity, is where is the total error probability. The secondorder capacity is thus completely independent of the undetected error probability. We then compute the expected rate at finite blocklength allowing erasures and show that it can exceed the finite blocklength rate without the erasure option, i.e., usual channel coding. We show that these results carry over in a straightforward manner to the problem of lossless source coding with side information, i.e., the SlepianWolf problem [17] which was previously studied in the for the case without the erasure option [16].
For list decoding, we consider lists of deterministic size of order and show that the secondorder capacity is , where is the permissible error probability. We also consider lists of polynomial size and demonstrate bounds on the thirdorder term in terms of the degree of the polynomial . These bounds turn out to tight for channels that are symmetric and singular–a canonical example being the binary erasure channel [14]. To the best of the authors’ knowledge, this is the first time that lists of size other than constant or exponential have been considered in the literature. Practically, the advantage of smaller lists is that they result in lower search complexity for the true message within the decoded list.
IB Related Work
Previously, the study of erasure and list decoding has been primarily from the error exponents perspective. We summarize some existing works here. Forney [1] derived optimal decision rules by generalizing the NeymanPearson lemma and also proved exponential upper bounds for the error probabilities using Gallager’s Chernoff bounding techniques [18]. ShannonGallagerBerlekamp [2] proved exponential lower bounds for the error probabilities and also considered lists of exponential size . They showed that sphere packing error exponent (evaluated at the code rate minus ) is an upper bound on the reliability function [19, Ex. 10.28]. Bounds for the error probabilities were derived by Telatar [3] using a general decoder parametrized by an asymmetric relation which is a function of the channel law. Blinovsky [4] studied the exponents of the list decoding problem at low (and even ) rate. CsiszárKörner [19, Thm. 10.11] present exponential upper bounds for universally attainable erasure decoding using the method of types. Moulin [5] generalized the treatment there and presented improved error exponents under some conditions. Recently, Merhav also considered alternative methods of analysis [6] and expurgated exponents for these problems [7]. The same author also derived erasure and list exponents for the SlepianWolf problem [20].
Another related line of work concerns constant error probability nonasymptotic fundamental limits of channel coding with various forms of feedback. PolyanskiyPoorVerdú [21] studied various incremental redundancy schemes and derived performance bounds under receiver and transmitterconfirmation. In incremental redundancy systems/schemes, one is allowed to transmit a sequence of coded symbols with numerous opportunities for confirmation before the codeword is resent if necessary. In contrast in our study of channel coding with the erasure option in Sec. IIIB, we analyze the expected performance of the easilyimplementable Forneystyle [1] singlecodeword repetition scheme that allows for confirmation (or erasure) at the end of a complete codeword block and repeats the same message until the erasure event no longer occurs. We compare a quantity termed the expected rate and to the ordinary channel coding rate. Inspired by [21], Chen et al. [22] derived nonasymptotic bounds for feedback systems using incremental redundancy with noiseless transmitter confirmation. WilliamsonChenWesel [23] also improved on the bounds in [21] for variablelength feedback coding.
IC Structure of Paper
This paper is structured as follows: In Sec. II, we set the stage by introducing our notation and defining relevant quantities such as the secondorder capacity. In Sec. III, we state our main results for channel coding with the decoding and list option. In Sec. IV, we show that the channel coding results carry over to the SlepianWolf problem [17]. All the proofs are detailed in Sec. V.
Ii Problem Setting and Main Definitions
Let be a random transformation (channel) from a discrete input alphabet to a discrete output alphabet . We denote length deterministic (resp. random) strings (resp. ) by lower case (resp. upper case) boldface. If satisfies for every and the sets and are finite, is said to be a DMC. We focus on DMCs in this paper but extensions to other channels such as the AWGN channel are straightforward. For a sequence , its type is the empirical distribution . For a finite alphabet , let and be the set of probability mass functions and types [19] (types with denominator at most ) respectively. For two sequences , we say that is a conditional type of given if for all . ( is not unique if for some .) The shell of a sequence , denoted as is the set of all such that the joint type of is . The set of all conditional types for which is nonempty for some with type is denoted as . All logs are to the base with the understanding that .
For informationtheoretic quantities, we will mostly follow the notation in Csiszár and Körner [19]. We denote the information capacity of the DMC as . We let be the set of capacityachieving input distributions. If has joint distribution , define to be the induced output distribution and
(2) 
to be the conditional information variance. The dispersion of the DMC [8, 11, 10] is defined as
(3) 
We will assume throughout that the DMC satisfies . For integers , we denote and . Let be the cdf of a standard Gaussian and its inverse. We now define erasure codes.
Definition 1.
An erasure code for is a pair of mappings such that and . The disjoint decoding regions are denoted as ; the erasure region is denoted as ; and the conditional undetected, erasure and total error probabilities are defined as
(4)  
(5)  
(6) 
Note that . Typically, the code is designed so that as the cost of making an undetected error is much higher than that of declaring an erasure.
Definition 2.
An erasure code for is an erasure code for the same channel where

If ,
(7) 
If
(8) 
If ,
(9) 
If ,
(10)
In Definition 2, we consider erasure codes with constraints on the undetected and total error probabilities similar to [6, 1, 5]. An alternate formulation would be to consider erasure codes where is the erasure probability. We find the former formulation more traditional and the analysis is also somewhat easier.
Definition 3.
A number is an achievable erasure secondorder coding rate for the DMC with capacity if there exists a sequence of erasure codes such that
(11)  
(12)  
(13) 
The erasure secondorder capacity is the supremum of all achievable erasure secondorder coding rates.
We now turn our attention to codes which allow their decoders to output a list of messages. Let be the set of subsets of of size . Furthermore, we use the notation to denote the set of subsets of of size not exceeding .
Definition 4.
An list code for is a pair of mappings such that and . The (notnecessarily disjoint) decoding regions are denoted as and the conditional error probability is defined as
(14) 
Definition 5.
An list code for is an list code for the same channel where if ,
(15) 
or if ,
(16) 
Definition 6.
A number is an achievable list secondorder coding rate for the DMC with capacity if there exists a sequence of list codes such that in addition to (11), the following hold
(17)  
(18) 
The list secondorder capacity is the supremum of all achievable list secondorder coding rates.
According to (17), we stipulate that the list size grows as . This differs from previous works on list decoding in which the list size is either constant [18, 7] or exponential [19, 3, 5, 2, 6, 7, 4, 20]. This scaling affects the secondorder (dispersion) term. To understand how the list size may affect the higherorder term, consider the following definition.
Definition 7.
A number is an achievable list thirdorder coding rate for the DMC with capacity and positive dispersion if there exists a sequence of list codes such that
(19)  
(20)  
(21) 
The list thirdorder capacity is the supremum of all achievable list thirdorder coding rates.
Inequality (20) implies that the size of the list grows polynomially and in particular it scales as . This scaling affects the thirdorder (logarithmic) term studied by a number of authors [13, 12, 15, 14] in the context of ordinary channel coding (without list decoding). By (19), if is achievable, there exists a sequence of codes of sizes satisfying
(22) 
and having list sizes and average error probabilities not exceeding . The more stringent condition on the sequence of error probabilities in (21) (relative to (18)) is because appears as the argument in the function, which forms part of the coefficient of the secondorder term in the asymptotic expansion of . If the weaker condition (18) were in place, the approximation error between and the target which is of the order would affect the thirdorder term, which is the object of study here. The stronger condition in (21) ensures that the thirdorder (logarithmic) term is unaffected by the approximation error between and which is now of the order .
Iii Main Results for Channel Coding
In this section, we summarize the main results of this paper concerning channel coding with an erasure or list option. For simplicity, we assume that the DMC satisfies though our results can be extended in a straightforward manner to the case where .
Iiia Decoding with Erasure Option
Theorem 1.
For any ,
(23) 
where can be any element in .
A few comments are in order: First, Theorem 1 implies that if is the maximum number of codewords that can be transmitted over with undetected and total error and respectively, then
(24) 
We see that the backoff from the capacity at blocklength is approximately independent of . (This backoff is positive for .) Observe that the secondorder term does not depend on , the undetected error probability. Only the total error probability comes into play in the asymptotic characterization of . In fact, in the proof, we first argue that it suffices to show that is an achievable erasure secondorder coding rate for , i.e., the undetected error probability is asymptotically . Clearly, any achievable erasure secondorder coding rate is also an achievable erasure secondorder coding rate for any .
Second, in the direct part of the proof of Theorem 1, we use threshold decoding, i.e. declare that message is sent if the empirical mutual information is higher than a threshold. If no message’s empirical mutual information exceeds the threshold, then an erasure is declared. This simple rule, though universal, is not the optimal one (in terms of minimizing the total error while holding the undetected error fixed). The optimal rule was derived using a generalized version of the NeymanPearson lemma by Forney [1, Thm. 1] and it is stated as
(25) 
for some threshold . However, this rule appears to be difficult for secondorder analysis and is more amenable to error exponent analysis [5, 7]. Because the likelihood of the second most likely codeword is usually much higher than the rest (excluding the first), Forney also suggested the simpler but, in general, suboptimal rule [1, Eq. (11a)]
(26) 
We analyzed this rule in the asymptotic setting (i.e., as tends to infinity) in the same way as one analyzes the random coding union (RCU) bound [10, Thm. 16] [12, Sec. 7] for ordinary channel coding but the analysis is more involved than threshold decoding which suffices for proving the secondorder achievability of (23). What was somewhat surprising to the authors is that the optimal decoding scheme in (25) (a generalization of the NeymanPearson lemma to arbitrary nonnegative measures) and the suboptimal scheme in (26) are somewhat difficult to analyze, but the simpler empirical mutual information thresholding rule can be shown to be secondorder optimal. This is in contrast to error exponent analysis of erasure decoding where the rules (25) and (26) and their variants are ubiquitously analyzed in the literature, e.g., [19, 3, 5, 6].
Third, the converse is based on Strassen’s idea [8, Eq. (4.18)], establishing a clear link between pointtopoint channel coding and binary hypothesis testing. Also see Kemperman’s general converse bound [9, Lem. 3.1] and for a more modern treatment, the various forms of the metaconverse in [10, Sec. IIIE]. The hypothesis testing converse technique only depends on the total error probability, explaining the presence of and not in (23).
IiiB The Expected Rate and An Example
We now compare the expected rate achieved using decoding with the erasure option to ordinary channel coding. Define
(27) 
as the erasure probability and note from the assumption of Theorem 1 that . Now consider sending independent blocks of information each of length . Because transmission succeeds (no erasure declared) with probability , the total number of bits we can transmit in each block is well approximated by the random variable
(28) 
This random variable has expectation
(29) 
Fix . By Hoeffding’s inequality, the total number of bits we can transmit over the blocks is in the interval
(30) 
with probability exceeding . This reduction in rate in (29)–(30) by the factor of in the socalled singlecodeword repetition scheme was first observed by Forney [1, Eq. (49)]. Essentially, one may use an automatic repeat request (ARQ) scheme^{1}^{1}1Forney [1] calls this class of retransmission schemes decision feedback schemes in his paper, but nowadays the term ARQ is more common. to resend the entire block of information if there is an erasure.
For ordinary channel coding with error probability , we can send approximately
(31) 
bits over the independent blocks and so, dividing by , the nonasymptotic channel coding rate (analogue of (29)) can be approximated [10] by
(32) 
In the analysis in (28)–(32), we have assumed that the Gaussian approximation is sufficiently accurate. It was numerically shown in [10] that the Gaussian approximation is accurate for some channels (such as the binary symmetric channel, binary erasure channel and additive white Gaussian noise channel) and moderate blocklengths (of order ) and error probabilities (or order ). Also see Fig. 1 for a precise quantification of the accuracy of the Gaussian approximation in our setting.
Clearly if are constants,
(33) 
so there is no advantage in allowing for erasures asymptotically. However, in finite blocklength (by this we mean the perblock blocklength ) regime, for “moderate” , we may have
(34) 
so erasure decoding may be advantageous in expectation. We illustrate the difference between and with a concrete example.
Example: In Fig. 1, we consider a binary symmetric channel (BSC) with crossover probability so bits/channel use and bits/channel use. (For the BSC, does not depend on .) We keep the undetected error probability at and vary the erasure error probability in the interval . We chose two blocklengths . We observe that for blocklength and a moderate erasure probability of , the gain of coding with the erasure option over ordinary channel coding is rather pronounced. This can be seen by comparing either the Gaussian approximations or the finite blocklength bounds. This gain is reduced if (i) is increased because we retransmit the whole block more often on average via the use of decision feedback or (ii) becomes large so the secondorder term becomes less significant (cf. (33)). We also note from the left plot that the Gaussian approximation is reasonably accurate when compared to the dependencetesting [10, Thm. 34] and metaconverse [10, Thm. 35] finite blocklength bounds under the current settings. The DT bound is especially close to the Gaussian approximation when the performance of coding with erasures peaks.
Finally, we remark that the advantage of coding with the erasures option over ordinary channel coding was also shown from the error exponents perspective by Forney in [1]. More precisely, Forney [1, Eq. (55)] showed that the feedback exponent has slope for rates near and below capacity, thus improving on the ordinary exponent which has slope in the same region.
IiiC List Decoding
Theorem 2.
For any , we have the secondorder result
(35) 
where . Furthermore, if and , we also have the thirdorder result
(36) 
If the DMC is symmetric in the Gallager sense^{2}^{2}2A DMC is symmetric in the Gallager sense [18, Sec. 4.5, pp. 94] if the set of channel outputs can be partitioned into subsets such that within each subset, the matrix of transition probabilities satisfies the following: every row (resp. column) is a permutation of every other row (resp. column). and singular,^{3}^{3}3A DMC is singular if for all , with it is true that . then we can make the stronger statement
(37) 
Theorem 2 shows that if we allow the list to grow as , then the secondorder capacity is increased by . This is concurs with the intuition we obtain from the analysis of error exponents for list decoding by ShannonGallagerBerlekamp [2]. See also the exercises in Gallager [18, Ex. 5.20] and CsiszárKörner [19, Ex. 10.28] which are concerned solely with firstorder (capacity) analysis.
For the thirdorder result in (36), we observe that there is a gap. This is, in part, due to the fact that we use threshold decoding to decide which messages belong to the list. This decoding strategy appears to be suboptimal in the thirdorder sense for DMCs [12, Sec. 7] [24, Thm. 53] and AWGN channels [15]. It also appears that one needs to use a version of maximumlikelihood (ML) decoding as in (26) and analyze an analogue of the RCU bound [10, Thm. 16] carefully to obtain the additional for the direct part. Whether this can be done for channel coding with list decoding to obtain a tight thirdorder result general DMCs is an open question. Nonetheless for singular and symmetric channels considered by Altuğ and Wagner [14], such as the BEC, the converse (upper bound) can be tightened and this results in the conclusive result in (37).
A byproduct of the proof of Theorem 2 is the following nonasymptotic converse bound for listdecoding which may be of independent interest. KostinaVerdú [25, Thm. 4] developed and used a version of this bound for the purposes of fixed error asymptotics of joint sourcechannel coding, just as Csiszár [26] also used a list decoder in his study of the error exponents for joint sourcechannel coding.
Proposition 3.
Let be the best (smallest) typeII error in a (deterministic) hypothesis test between and subject to the typeI error being no larger than . Every list code for satisfies
(38) 
This bound immediately reduces to the socalled metaconverse in [10, Sec. IIIE] by setting . We will see that the ratio plays a critical role in both the converse and direct parts. This is also evident in existing works such as ShannonGallagerBerlekamp [2, Sec. IV] but the nonasymptotic bound in (38) appears to be novel.
Iv An Extension to SlepianWolf Coding
In this section, we show that the techniques developed are also applicable to lossless source coding with decoder side information, i.e., the SlepianWolf problem [17]. Let be a correlated source where the alphabets and are finite. The assumption that and are finite sets can be dispensed at the cost of nonuniversality in the coding scheme [27].
Definition 8.
An code for the correlated source is defined by two functions and where represents the erasure symbol.
Definition 9.
An code for the correlated source is an code satisfying
(39)  
(40) 
The parameters and are known as the undetected and total error probabilities respectively.
We consider discrete, stationary and memoryless sources in which the source alphabet is and the sideinformation alphabet is .
Definition 10.
A number is said to be an achievable secondorder coding rate for the correlated source if there exists a sequence of codes such that
(41)  
(42)  
(43) 
The infimum of all achievable secondorder coding rates is the optimum secondorder coding rate .
Hence, we are allowing the erasure option for the SlepianWolf problem but we restrict the undetected and total errors to be at most and respectively. The secondorder asymptotics for the SlepianWolf problem with two encoders without erasures was studied by Tan and Kosut in [16].
Define to be the conditional varentropy [28] of the stationary, memoryless source . The following result here parallels Theorem 1 pertaining to channel coding with the erasure option.
Theorem 4.
For any ,
(44) 
The proof of Theorem 4 can be found in Section VC. The direct part uses random binning [29] and thresholding of the empirical conditional entropy [16] and the converse uses a nonasymptotic information spectrum converse bound by Miyake and Kanaya [30]. Also see [31, Thm. 7.2.2].
Again, we observe that the optimum secondorder coding rate does not depend on the undetected error probability as long as it is smaller than the total error probability . Hence, the observation we made in Fig. 1–namely that at finite blocklengths the expected performance with the erasure option can exceed that without the erasure option–also applies in the SlepianWolf setting.
V Proofs
In this section, we provide the proofs of the theorems in the paper.
Va Decoding with Erasure Option: Proof of Theorem 1
VA1 Converse Part
Recall that is the best (smallest) typeII error in a hypothesis test (without randomization) of versus subject to the condition that the typeI error is no larger than . This function was studied extensively in [32]. Let us fix any erasure code for . This means that
(45) 
Note that only the total error comes into play in (45) and thus the secondorder capacity in (23) only depends on . In essence, an erasure code for is an channel code for (a channel code for with codewords and maximum error probability at most ) so any converse for usual channel coding also applies here with the error probability for usual channel coding being the total error in erasure decoding. We describe some details of the converse proof to make the paper selfcontained.
We now assume that the code is constant composition, i.e. all codewords are of the same type . This only leads to a penalty in which does not affect the secondorder term. Now, for any permutation invariant output distribution (this means that for every permutation ), we have
(46)  
(47)  
(48)  
(49) 
where (46) follows because are disjoint; (47) follows from (45); (48) uses the definition of ; and finally (49) follows from the fact that does not depend on for permutation invariant (which is what we choose to be) and constant composition codes [8]. We also used to denote any element in the type class . Choose to be the product distribution . Since satisfies (13), for every , there exists sufficiently large such that . Hence, from (49),
(50) 
Now, we use [8, Sec. 2] to assert that for all ,
(51) 
Hence, by putting (50) and (51) together, we obtain
(52) 
By using the usual continuity arguments (e.g. [13, Lem. 7]),
(53) 
Thus, by letting ,
(54) 
The converse proof is complete for the case by taking . Note that even though is, in general, discontinuous at (i.e., when ), we may let and use the fact that to assert that as desired.
To get to the setting, first we let be the maximum number of codewords in an erasure with undetected and total error probabilities respectively under the setting. By using an expurgation argument as in [10, Eq. (284)] it is easy to show that for all ,
(55) 
Setting proves the claim for all the other 3 cases.
VA2 Direct Part
It suffices to prove that is an achievable erasure secondorder coding rate for . Indeed, here the total error . However, any achievable secondorder coding rate is also an achievable secondorder coding rate for any . Furthermore, by an expurgation argument similar to (55), the same statement can be proved under the setting.
For this proof, we show that the secondorder capacity in (23) is also universally attainable–i.e. the code does not require channel knowledge. A simpler proof based on thresholding the likelihood can also be be used; however, channel knowledge is required. See Sec. VB3 for an analogue of the alternative strategy.
Fix a type . Generate codewords uniformly at random from the type class . Denote the random codebook as . The number is to be chosen later. Let be some threshold to be chosen later. At the receiver, we decode to if and only if is the unique message to satisfy
(56) 
where is the empirical mutual information of , i.e. the mutual information of the random variables whose distribution is the joint type . Assume as usual that the true message . We use the following elementary result which is shown in the proof of the packing lemma [19, Lem. 10.1].
Lemma 5.
Let be any type. Let and be selected independently and uniformly at random from the type class . Let be the channel output when is the input, i.e. . Then, for every and every ,
(57) 
The undetected error probability is bounded as
(58)  
(59)  
(60) 
where (59) follows from the union bound and the fact that the codewords are generated in an identical manner, and (60) follows from Lemma 5 noting that is independent of , the channel output when is the input.
The erasure event can be expressed as
(61) 
where
(62)  
(63) 
Clearly, we have that
(64)  
(65) 
Note that the probability of was already bounded in (58)–(60). Hence it remains to upper bound the probability of defined in (64). We let the random conditional type of given be . Then, we have
(66)  
(67) 
where the final step follows by Taylor expanding around and . We also can bound the remainder term uniformly [33] yielding
(68) 
WangIngberKochman [33] computed the relevant first, second and thirdorder statistics of the random variable allowing us to apply the BerryEsseen theorem [34, Ch. XVI.5] to the probability in (68), leading to