Fixed Error Probability Asymptotics For Erasure and List Decoding

# Fixed Error Probability Asymptotics For Erasure and List Decoding

Vincent Y. F. Tan and Pierre Moulin V. .Y. F. Tan is with the Department of Electrical and Computer Engineering (ECE) and the Department of Mathematics, National University of Singapore (Email: vtan@nus.edu.sg). P. Moulin is with the Department of Electrical and Computer Engineering (ECE), University of Illinois at Urbana-Champaign (Email: moulin@ifp.uiuc.edu).
###### Abstract

We derive the optimum second-order coding rates, known as second-order capacities, for erasure and list decoding. For erasure decoding for discrete memoryless channels, we show that second-order capacity is where is the channel dispersion and is the total error probability, i.e., the sum of the erasure and undetected errors. This total error probability, as are other error probabilities in this paper, is fixed at a non-zero constant. We show numerically that the expected rate at finite blocklength for erasures decoding can exceed the finite blocklength channel coding rate. We also show that the analogous result also holds for lossless source coding with decoder side information, i.e., Slepian-Wolf coding. For list decoding, we consider list codes of deterministic sizes that scale as and show that the corresponding second-order capacity is , where is the permissible error probability, i.e., the probability that the true message is not in the list. We also consider lists of polynomial size and derive bounds on the third-order coding rate in terms of the order of the polynomial . These bounds are tight for symmetric and singular channels. The direct parts of the coding theorems leverage on the simple threshold decoder and converses are proved using variants of the hypothesis testing converse.

## I Introduction

In many communication scenarios, it is advantageous to allow the decoder to have the option of either not deciding at all or putting out more than one estimate of the message. These are respectively known as erasure and list decoding respectively and have been studied extensively in the information theory literature [1, 2, 3, 4, 5, 6, 7]. The erasure and list options generally allow for smaller undetected error probabilities so these options are useful in practice.

In this paper, we revisit the problem or erasure and list decoding from the viewpoint of second- and third-order asymptotics. The study of second-order asymptotics for fixed (non-vanishing) error probability was first done by Strassen [8, Thm. 1.2] who showed for a well-behaved discrete memoryless channel (DMCs) that the maximum number of codewords at (average or maximum) error probability , namely , satisfies

 logM∗(Wn,ϵ)=nC+√nVΦ−1(ϵ)+O(logn), (1)

where and are respectively the capacity and the dispersion of and is the inverse of the standard Gaussian cumulative distribution function (cdf). Also see Kemperman [9, Thm. 11.3] for the corresponding result for constant composition codes. This line of work has been revisited by numerous authors recently and they considered various other channel models such as the additive white Gaussian (AWGN) channel [10, 11], bounds on the third-order (logarithmic) term [12, 13, 14, 15] and extensions to multi-terminal problems such as the two-encoder Slepian-Wolf problem and the multiple-access channel .

### I-a Main Contributions

In this paper, for erasure decoding, we consider constant undetected and total (sum of undetected and erasure) error probabilities (numbers between and ) and we obtain the analogue of the second-order term in (1). We show that the coefficient of the second-order term, termed the second-order capacity, is where is the total error probability. The second-order capacity is thus completely independent of the undetected error probability. We then compute the expected rate at finite blocklength allowing erasures and show that it can exceed the finite blocklength rate without the erasure option, i.e., usual channel coding. We show that these results carry over in a straightforward manner to the problem of lossless source coding with side information, i.e., the Slepian-Wolf problem  which was previously studied in the for the case without the erasure option .

For list decoding, we consider lists of deterministic size of order and show that the second-order capacity is , where is the permissible error probability. We also consider lists of polynomial size and demonstrate bounds on the third-order term in terms of the degree of the polynomial . These bounds turn out to tight for channels that are symmetric and singular–a canonical example being the binary erasure channel . To the best of the authors’ knowledge, this is the first time that lists of size other than constant or exponential have been considered in the literature. Practically, the advantage of smaller lists is that they result in lower search complexity for the true message within the decoded list.

### I-B Related Work

Previously, the study of erasure and list decoding has been primarily from the error exponents perspective. We summarize some existing works here. Forney  derived optimal decision rules by generalizing the Neyman-Pearson lemma and also proved exponential upper bounds for the error probabilities using Gallager’s Chernoff bounding techniques . Shannon-Gallager-Berlekamp  proved exponential lower bounds for the error probabilities and also considered lists of exponential size . They showed that sphere packing error exponent (evaluated at the code rate minus ) is an upper bound on the reliability function [19, Ex. 10.28]. Bounds for the error probabilities were derived by Telatar  using a general decoder parametrized by an asymmetric relation which is a function of the channel law. Blinovsky  studied the exponents of the list decoding problem at low (and even ) rate. Csiszár-Körner [19, Thm. 10.11] present exponential upper bounds for universally attainable erasure decoding using the method of types. Moulin  generalized the treatment there and presented improved error exponents under some conditions. Recently, Merhav also considered alternative methods of analysis  and expurgated exponents for these problems . The same author also derived erasure and list exponents for the Slepian-Wolf problem .

Another related line of work concerns constant error probability non-asymptotic fundamental limits of channel coding with various forms of feedback. Polyanskiy-Poor-Verdú  studied various incremental redundancy schemes and derived performance bounds under receiver- and transmitter-confirmation. In incremental redundancy systems/schemes, one is allowed to transmit a sequence of coded symbols with numerous opportunities for confirmation before the codeword is resent if necessary. In contrast in our study of channel coding with the erasure option in Sec. III-B, we analyze the expected performance of the easily-implementable Forney-style  single-codeword repetition scheme that allows for confirmation (or erasure) at the end of a complete codeword block and repeats the same message until the erasure event no longer occurs. We compare a quantity termed the expected rate and to the ordinary channel coding rate. Inspired by , Chen et al.  derived non-asymptotic bounds for feedback systems using incremental redundancy with noiseless transmitter confirmation. Williamson-Chen-Wesel  also improved on the bounds in  for variable-length feedback coding.

### I-C Structure of Paper

This paper is structured as follows: In Sec. II, we set the stage by introducing our notation and defining relevant quantities such as the second-order capacity. In Sec. III, we state our main results for channel coding with the decoding and list option. In Sec. IV, we show that the channel coding results carry over to the Slepian-Wolf problem . All the proofs are detailed in Sec. V.

## Ii Problem Setting and Main Definitions

Let be a random transformation (channel) from a discrete input alphabet to a discrete output alphabet . We denote length- deterministic (resp. random) strings (resp. ) by lower case (resp. upper case) boldface. If satisfies for every and the sets and are finite, is said to be a DMC. We focus on DMCs in this paper but extensions to other channels such as the AWGN channel are straightforward. For a sequence , its type is the empirical distribution . For a finite alphabet , let and be the set of probability mass functions and -types  (types with denominator at most ) respectively. For two sequences , we say that is a conditional type of given if for all . ( is not unique if for some .) The -shell of a sequence , denoted as is the set of all such that the joint type of is . The set of all conditional types for which is non-empty for some with type is denoted as . All logs are to the base with the understanding that .

For information-theoretic quantities, we will mostly follow the notation in Csiszár and Körner . We denote the information capacity of the DMC as . We let be the set of capacity-achieving input distributions. If has joint distribution , define to be the induced output distribution and

 V(P,W):=EX[Var(logW(Y|X)PW(Y)∣∣X)] (2)

to be the conditional information variance. The -dispersion of the DMC [8, 11, 10] is defined as

 Vϵ:={Vmin:=minP∈ΠV(P,W)ϵ<1/2Vmax:=maxP∈ΠV(P,W)ϵ≥1/2. (3)

We will assume throughout that the DMC satisfies . For integers , we denote and . Let be the cdf of a standard Gaussian and its inverse. We now define erasure codes.

###### Definition 1.

An -erasure code for is a pair of mappings such that and . The disjoint decoding regions are denoted as ; the erasure region is denoted as ; and the conditional undetected, erasure and total error probabilities are defined as

 λu(m) :=∑~m∈[M]∖{m}W(D~m|f(m)) (4) λe(m) :=W(D0|f(m)) (5) λt(m) :=∑~m∈[0:M]∖{m}W(D~m|f(m)) (6)

Note that . Typically, the code is designed so that as the cost of making an undetected error is much higher than that of declaring an erasure.

###### Definition 2.

An -erasure code for is an -erasure code for the same channel where

1. If ,

 maxm∈[M]λu(m)≤ϵu,maxm∈[M]λt(m)≤ϵt. (7)
2. If

 maxm∈[M]λu(m)≤ϵu,1M∑m∈[M]λt(m)≤ϵt. (8)
3. If ,

 1M∑m∈[M]λu(m)≤ϵu,maxm∈[M]λt(m)≤ϵt. (9)
4. If ,

 1M∑m∈[M]λu(m)≤ϵu,1M∑m∈[M]λt(m)≤ϵt. (10)

In Definition 2, we consider erasure codes with constraints on the undetected and total error probabilities similar to [6, 1, 5]. An alternate formulation would be to consider -erasure codes where is the erasure probability. We find the former formulation more traditional and the analysis is also somewhat easier.

###### Definition 3.

A number is an -achievable erasure second-order coding rate for the DMC with capacity if there exists a sequence of -erasure codes such that

 liminfn→∞1√n(logMn−nC) ≥r, (11) limsupn→∞ϵu,n ≤ϵu,and (12) limsupn→∞ϵt,n ≤ϵt. (13)

The -erasure second-order capacity is the supremum of all -achievable erasure second-order coding rates.

We now turn our attention to codes which allow their decoders to output a list of messages. Let be the set of subsets of of size . Furthermore, we use the notation to denote the set of subsets of of size not exceeding .

###### Definition 4.

An -list code for is a pair of mappings such that and . The (not-necessarily disjoint) decoding regions are denoted as and the conditional error probability is defined as

 λ(m):=W(Y∖Dm|f(m)). (14)
###### Definition 5.

An -list code for is an -list code for the same channel where if ,

 maxm∈[M]λ(m)≤ϵ, (15)

or if ,

 1M∑m∈[M]λ(m)≤ϵ. (16)
###### Definition 6.

A number is an -achievable list second-order coding rate for the DMC with capacity if there exists a sequence of -list codes such that in addition to (11), the following hold

 limsupn→∞1√nlogLn ≤l,and (17) limsupn→∞ϵn ≤ϵ. (18)

The -list second-order capacity is the supremum of all -achievable list second-order coding rates.

According to (17), we stipulate that the list size grows as . This differs from previous works on list decoding in which the list size is either constant [18, 7] or exponential [19, 3, 5, 2, 6, 7, 4, 20]. This scaling affects the second-order (dispersion) term. To understand how the list size may affect the higher-order term, consider the following definition.

###### Definition 7.

A number is an -achievable list third-order coding rate for the DMC with capacity and positive -dispersion if there exists a sequence of -list codes such that

 liminfn→∞1logn(logMn−nC−√nVϵnΦ−1(ϵn)) ≥s, (19) limsupn→∞logLnlogn ≤α,and (20) limsupn→∞√n(ϵn−ϵ) <∞. (21)

The -list third-order capacity is the supremum of all -achievable list third-order coding rates.

Inequality (20) implies that the size of the list grows polynomially and in particular it scales as . This scaling affects the third-order (logarithmic) term studied by a number of authors [13, 12, 15, 14] in the context of ordinary channel coding (without list decoding). By (19), if is -achievable, there exists a sequence of codes of sizes satisfying

 logMn≥nC+√nVϵnΦ−1(ϵn)+slogn+o(logn), (22)

and having list sizes and average error probabilities not exceeding . The more stringent condition on the sequence of error probabilities in (21) (relative to (18)) is because appears as the argument in the function, which forms part of the coefficient of the second-order term in the asymptotic expansion of . If the weaker condition (18) were in place, the approximation error between and the target which is of the order would affect the third-order term, which is the object of study here. The stronger condition in (21) ensures that the third-order (logarithmic) term is unaffected by the approximation error between and which is now of the order .

## Iii Main Results for Channel Coding

In this section, we summarize the main results of this paper concerning channel coding with an erasure or list option. For simplicity, we assume that the DMC satisfies though our results can be extended in a straightforward manner to the case where .

### Iii-a Decoding with Erasure Option

###### Theorem 1.

For any ,

 r∗era,a,b(ϵu,ϵt)=√VϵtΦ−1(ϵt), (23)

where can be any element in .

The proof of Theorem 1 can be found in Section V-A.

A few comments are in order: First, Theorem 1 implies that if is the maximum number of codewords that can be transmitted over with undetected and total error and respectively, then

 logM∗(Wn;ϵu,ϵt)=nC+√nVϵtΦ−1(ϵt)+o(√n). (24)

We see that the backoff from the capacity at blocklength is approximately independent of . (This backoff is positive for .) Observe that the second-order term does not depend on , the undetected error probability. Only the total error probability comes into play in the asymptotic characterization of . In fact, in the proof, we first argue that it suffices to show that is an achievable -erasure second-order coding rate for , i.e., the undetected error probability is asymptotically . Clearly, any achievable -erasure second-order coding rate is also an achievable -erasure second-order coding rate for any .

Second, in the direct part of the proof of Theorem 1, we use threshold decoding, i.e. declare that message is sent if the empirical mutual information is higher than a threshold. If no message’s empirical mutual information exceeds the threshold, then an erasure is declared. This simple rule, though universal, is not the optimal one (in terms of minimizing the total error while holding the undetected error fixed). The optimal rule was derived using a generalized version of the Neyman-Pearson lemma by Forney [1, Thm. 1] and it is stated as

 D∗m:={y:Wn(y|f(m))≥ψ∑~m∈[M]∖{m}Wn(y|f(~m))}, (25)

for some threshold . However, this rule appears to be difficult for second-order analysis and is more amenable to error exponent analysis [5, 7]. Because the likelihood of the second most likely codeword is usually much higher than the rest (excluding the first), Forney also suggested the simpler but, in general, suboptimal rule [1, Eq. (11a)]

 D′m:={y:Wn(y|f(m))≥ψmax~m∈[M]∖{m}Wn(y|f(~m))}. (26)

We analyzed this rule in the asymptotic setting (i.e., as tends to infinity) in the same way as one analyzes the random coding union (RCU) bound [10, Thm. 16] [12, Sec. 7] for ordinary channel coding but the analysis is more involved than threshold decoding which suffices for proving the second-order achievability of (23). What was somewhat surprising to the authors is that the optimal decoding scheme in (25) (a generalization of the Neyman-Pearson lemma to arbitrary non-negative measures) and the suboptimal scheme in (26) are somewhat difficult to analyze, but the simpler empirical mutual information thresholding rule can be shown to be second-order optimal. This is in contrast to error exponent analysis of erasure decoding where the rules (25) and (26) and their variants are ubiquitously analyzed in the literature, e.g., [19, 3, 5, 6].

Third, the converse is based on Strassen’s idea [8, Eq. (4.18)], establishing a clear link between point-to-point channel coding and binary hypothesis testing. Also see Kemperman’s general converse bound [9, Lem. 3.1] and for a more modern treatment, the various forms of the meta-converse in [10, Sec. III-E]. The hypothesis testing converse technique only depends on the total error probability, explaining the presence of and not in (23).

Finally, we remark that Theorem 1 (as well as Theorem 2 to follow bar the statement in (37)) carries over verbatim for the AWGN channel where the capacity and dispersion are and respectively and is the signal-to-noise ratio (SNR) of the AWGN channel.

### Iii-B The Expected Rate and An Example

We now compare the expected rate achieved using decoding with the erasure option to ordinary channel coding. Define

 ϵe:=ϵt−ϵu (27)

as the erasure probability and note from the assumption of Theorem 1 that . Now consider sending independent blocks of information each of length . Because transmission succeeds (no erasure declared) with probability , the total number of bits we can transmit in each block is well approximated by the random variable

 R(n)e:={C+√Vϵt/nΦ−1(ϵt)w.p.1−ϵe0w.p.ϵe (28)

This random variable has expectation

 (29)

Fix . By Hoeffding’s inequality, the total number of bits we can transmit over the blocks is in the interval

 [(1−ϵe−δ)⋅b⋅(nC+√nVϵtΦ−1(ϵt)),(1−ϵe+δ)⋅b⋅(nC+√nVϵtΦ−1(ϵt))] (30)

with probability exceeding . This reduction in rate in (29)–(30) by the factor of in the so-called single-codeword repetition scheme was first observed by Forney [1, Eq. (49)]. Essentially, one may use an automatic repeat request (ARQ) scheme111Forney  calls this class of retransmission schemes decision feedback schemes in his paper, but nowadays the term ARQ is more common. to resend the entire block of information if there is an erasure.

For ordinary channel coding with error probability , we can send approximately

 b⋅(nC+√nVϵuΦ−1(ϵu)) (31)

bits over the independent blocks and so, dividing by , the non-asymptotic channel coding rate (analogue of (29)) can be approximated  by

 R(n)c:=C+√VϵunΦ−1(ϵu). (32)

In the analysis in (28)–(32), we have assumed that the Gaussian approximation is sufficiently accurate. It was numerically shown in  that the Gaussian approximation is accurate for some channels (such as the binary symmetric channel, binary erasure channel and additive white Gaussian noise channel) and moderate blocklengths (of order ) and error probabilities (or order ). Also see Fig. 1 for a precise quantification of the accuracy of the Gaussian approximation in our setting.

Clearly if are constants,

 C=limn→∞R(n)c>limn→∞E[R(n)e]=(1−ϵe)C, (33)

so there is no advantage in allowing for erasures asymptotically. However, in finite blocklength (by this we mean the per-block blocklength ) regime, for “moderate” , we may have

 R(n)c

so erasure decoding may be advantageous in expectation. We illustrate the difference between and with a concrete example. Fig. 1: Comparison of non-asymptotic rates and Gaussian approximationss for channel coding with and without the erasure option. Observe that E[R(n)e] can be larger R(n)c for finite n. On the left plot, DT Expected Rate and MC Expected Rate respectively stand for the dependence-testing (DT) [10, Thm. 34] and meta-converse (MC) [10, Thm. 35] finite blocklength bounds for the expected rate with the given parameters (ϵe,ϵu,n,q). DT Channel Coding and MC Channel Coding respectively stand for the DT and MC finite blocklength bounds for channel coding with ordinary decoding with the given parameters (ϵu,n,q). The finite blocklength bounds are difficult to compute numerically for large blocklengths and small error probabilities and so they are not shown on the right plot.

Example: In Fig. 1, we consider a binary symmetric channel (BSC) with crossover probability so bits/channel use and bits/channel use. (For the BSC, does not depend on .) We keep the undetected error probability at and vary the erasure error probability in the interval . We chose two blocklengths . We observe that for blocklength and a moderate erasure probability of , the gain of coding with the erasure option over ordinary channel coding is rather pronounced. This can be seen by comparing either the Gaussian approximations or the finite blocklength bounds. This gain is reduced if (i) is increased because we retransmit the whole block more often on average via the use of decision feedback or (ii) becomes large so the second-order term becomes less significant (cf. (33)). We also note from the left plot that the Gaussian approximation is reasonably accurate when compared to the dependence-testing [10, Thm. 34] and meta-converse [10, Thm. 35] finite blocklength bounds under the current settings. The DT bound is especially close to the Gaussian approximation when the performance of coding with erasures peaks.

Finally, we remark that the advantage of coding with the erasures option over ordinary channel coding was also shown from the error exponents perspective by Forney in . More precisely, Forney [1, Eq. (55)] showed that the feedback exponent has slope for rates near and below capacity, thus improving on the ordinary exponent which has slope in the same region.

### Iii-C List Decoding

###### Theorem 2.

For any , we have the second-order result

 r∗list,a(l,ϵ)=l+√VϵΦ−1(ϵ). (35)

where . Furthermore, if and , we also have the third-order result

 α≤s∗list(α,ϵ)≤12+α. (36)

If the DMC is symmetric in the Gallager sense222A DMC is symmetric in the Gallager sense [18, Sec. 4.5, pp. 94] if the set of channel outputs can be partitioned into subsets such that within each subset, the matrix of transition probabilities satisfies the following: every row (resp. column) is a permutation of every other row (resp. column). and singular,333A DMC is singular if for all , with it is true that . then we can make the stronger statement

 s∗list(α,ϵ)=α. (37)

The proof of Theorem 2 which is partly based on Proposition 3 below can be found in Section V-B.

Theorem 2 shows that if we allow the list to grow as , then the second-order capacity is increased by . This is concurs with the intuition we obtain from the analysis of error exponents for list decoding by Shannon-Gallager-Berlekamp . See also the exercises in Gallager [18, Ex. 5.20] and Csiszár-Körner [19, Ex. 10.28] which are concerned solely with first-order (capacity) analysis.

For the third-order result in (36), we observe that there is a gap. This is, in part, due to the fact that we use threshold decoding to decide which messages belong to the list. This decoding strategy appears to be suboptimal in the third-order sense for DMCs [12, Sec. 7] [24, Thm. 53] and AWGN channels . It also appears that one needs to use a version of maximum-likelihood (ML) decoding as in (26) and analyze an analogue of the RCU bound [10, Thm. 16] carefully to obtain the additional for the direct part. Whether this can be done for channel coding with list decoding to obtain a tight third-order result general DMCs is an open question. Nonetheless for singular and symmetric channels considered by Altuğ and Wagner , such as the BEC, the converse (upper bound) can be tightened and this results in the conclusive result in (37).

A by-product of the proof of Theorem 2 is the following non-asymptotic converse bound for list-decoding which may be of independent interest. Kostina-Verdú [25, Thm. 4] developed and used a version of this bound for the purposes of fixed error asymptotics of joint source-channel coding, just as Csiszár  also used a list decoder in his study of the error exponents for joint source-channel coding.

###### Proposition 3.

Let be the best (smallest) type-II error in a (deterministic) hypothesis test between and subject to the type-I error being no larger than . Every -list code for satisfies

 ML≤infQ∈P(Y)supP∈P(X)1β1−ϵ(P×W,P×Q). (38)

This bound immediately reduces to the so-called meta-converse in [10, Sec. III-E] by setting . We will see that the ratio plays a critical role in both the converse and direct parts. This is also evident in existing works such as Shannon-Gallager-Berlekamp [2, Sec. IV] but the non-asymptotic bound in (38) appears to be novel.

## Iv An Extension to Slepian-Wolf Coding

In this section, we show that the techniques developed are also applicable to lossless source coding with decoder side information, i.e., the Slepian-Wolf problem . Let be a correlated source where the alphabets and are finite. The assumption that and are finite sets can be dispensed at the cost of non-universality in the coding scheme .

###### Definition 8.

An -code for the correlated source is defined by two functions and where represents the erasure symbol.

###### Definition 9.

An -code for the correlated source is an -code satisfying

 ≤ϵu,and (39) ∑x,yPXY(x,y)1{φ(f(x),y)≠x} ≤ϵt. (40)

The parameters and are known as the undetected and total error probabilities respectively.

We consider discrete, stationary and memoryless sources in which the source alphabet is and the side-information alphabet is .

###### Definition 10.

A number is said to be an -achievable second-order coding rate for the correlated source if there exists a sequence of -codes such that

 limsupn→∞1√n(logMn−nH(X|Y)) ≤r (41) limsupn→∞ϵu,n ≤ϵu (42) limsupn→∞ϵt,n ≤ϵt. (43)

The infimum of all -achievable second-order coding rates is the optimum second-order coding rate .

Hence, we are allowing the erasure option for the Slepian-Wolf problem but we restrict the undetected and total errors to be at most and respectively. The second-order asymptotics for the Slepian-Wolf problem with two encoders without erasures was studied by Tan and Kosut in .

Define to be the conditional varentropy  of the stationary, memoryless source . The following result here parallels Theorem 1 pertaining to channel coding with the erasure option.

###### Theorem 4.

For any ,

 r∗era(ϵu,ϵt)=√V(X|Y)Φ−1(1−ϵt). (44)

The proof of Theorem 4 can be found in Section V-C. The direct part uses random binning  and thresholding of the empirical conditional entropy  and the converse uses a non-asymptotic information spectrum converse bound by Miyake and Kanaya . Also see [31, Thm. 7.2.2].

Again, we observe that the optimum second-order coding rate does not depend on the undetected error probability as long as it is smaller than the total error probability . Hence, the observation we made in Fig. 1–namely that at finite blocklengths the expected performance with the erasure option can exceed that without the erasure option–also applies in the Slepian-Wolf setting.

## V Proofs

In this section, we provide the proofs of the theorems in the paper.

### V-a Decoding with Erasure Option: Proof of Theorem 1

#### V-A1 Converse Part

Recall that is the best (smallest) type-II error in a hypothesis test (without randomization) of versus subject to the condition that the type-I error is no larger than . This function was studied extensively in . Let us fix any -erasure code for . This means that

 minm∈[Mn]Wn(Dm|f(m))≥1−ϵt,n. (45)

Note that only the total error comes into play in (45) and thus the second-order capacity in (23) only depends on . In essence, an -erasure code for is an -channel code for (a channel code for with codewords and maximum error probability at most ) so any converse for usual channel coding also applies here with the error probability for usual channel coding being the total error in erasure decoding. We describe some details of the converse proof to make the paper self-contained.

We now assume that the code is constant composition, i.e. all codewords are of the same type . This only leads to a penalty in which does not affect the second-order term. Now, for any permutation invariant output distribution (this means that for every permutation ), we have

 1 ≥Mn∑m=1Q(Dm) (46) (47) =Mn∑m=1β1−ϵt,n(Wn(⋅|f(m)),Q) (48) =Mnβ1−ϵt,n(Wn(⋅|x),Q), (49)

where (46) follows because are disjoint; (47) follows from (45); (48) uses the definition of ; and finally (49) follows from the fact that does not depend on for permutation invariant (which is what we choose to be) and constant composition codes . We also used to denote any element in the type class . Choose to be the product distribution . Since satisfies (13), for every , there exists sufficiently large such that . Hence, from (49),

 Mn≤β−11−ϵt−η(Wn(⋅|x),Q). (50)

Now, we use [8, Sec. 2] to assert that for all ,

 −logβ1−ϵ(Wn(⋅|x),Q)=nI(P,W)+√nV(P,W)Φ−1(ϵ)+O(logn). (51)

Hence, by putting (50) and (51) together, we obtain

 logMn≤maxP∈P(X)nI(P,W) +√nV(P,W)Φ−1(ϵt+η)+O(logn) (52)

By using the usual continuity arguments (e.g. [13, Lem. 7]),

 logMn≤nC+√nVϵt+ηΦ−1(ϵt+η)+O(logn). (53)

Thus, by letting ,

 limsupn→∞1√n(logMn−nC)≤√Vϵt+ηΦ−1(ϵt+η). (54)

The converse proof is complete for the case by taking . Note that even though is, in general, discontinuous at (i.e., when ), we may let and use the fact that to assert that as desired.

To get to the setting, first we let be the maximum number of codewords in an erasure with undetected and total error probabilities respectively under the -setting. By using an expurgation argument as in [10, Eq. (284)] it is easy to show that for all ,

 M∗ave,ave(Wn,ϵu,ϵt)≤(11−1/τ)2M∗max,max(Wn,τϵu,τϵt). (55)

Setting proves the claim for all the other 3 cases.

#### V-A2 Direct Part

It suffices to prove that is an achievable -erasure second-order coding rate for . Indeed, here the total error . However, any achievable -second-order coding rate is also an achievable -second-order coding rate for any . Furthermore, by an expurgation argument similar to (55), the same statement can be proved under the setting.

For this proof, we show that the second-order capacity in (23) is also universally attainable–i.e. the code does not require channel knowledge. A simpler proof based on thresholding the likelihood can also be be used; however, channel knowledge is required. See Sec. V-B3 for an analogue of the alternative strategy.

Fix a type . Generate codewords uniformly at random from the type class . Denote the random codebook as . The number is to be chosen later. Let be some threshold to be chosen later. At the receiver, we decode to if and only if is the unique message to satisfy

 ^I(X(^m)∧Y)≥γ (56)

where is the empirical mutual information of , i.e. the mutual information of the random variables whose distribution is the joint type . Assume as usual that the true message . We use the following elementary result which is shown in the proof of the packing lemma [19, Lem. 10.1].

###### Lemma 5.

Let be any -type. Let and be selected independently and uniformly at random from the type class . Let be the channel output when is the input, i.e. . Then, for every and every ,

 Pr[^I(¯¯¯¯¯X∧Y)>γ]≤(n+1)|X|+|X||Y|exp(−nγ). (57)

The undetected error probability is bounded as

 Pr[Eu] ≤Pr[maxm∈[Mn]∖{1}^I(X(m)∧Y)≥γ] (58) ≤(Mn−1)Pr[^I(X(2)∧Y)≥γ] (59) ≤(Mn−1)(n+1)|X||Y|+|X|exp(−nγ), (60)

where (59) follows from the union bound and the fact that the codewords are generated in an identical manner, and (60) follows from Lemma 5 noting that is independent of , the channel output when is the input.

The erasure event can be expressed as

 Ee:=E(1)e∪E(2)e (61)

where

 E(1)e :={^I(X(m)∧Y)<γ,∀m∈[Mn]},and (62) E(2)e :={^I(X(m)∧Y)≥γ,% for at least 2 messages m∈[Mn]}. (63)

Clearly, we have that

 E(1)e ⊂F(1)e:={^I(X(1)∧Y)<γ},and (64) E(2)e ⊂F(2)e:={maxm∈[Mn]∖{1}^I(X(m)∧Y)≥γ}. (65)

Note that the probability of was already bounded in (58)–(60). Hence it remains to upper bound the probability of defined in (64). We let the random conditional type of given be . Then, we have

 Pr[F(1)e] ≤Pr[^I(X(1)∧Y)≤γ] (66) =Pr[I(P,W)+∑x,y(U(y|x)−W(y|x))I′W(y|x)+O(∥U−W∥2)≤γ] (67)

where the final step follows by Taylor expanding around and . We also can bound the remainder term uniformly  yielding

 Pr[F(1)e]≤Pr[I(P,W)+∑x,y(U(y|x)−W(y|x))I′W(y|x)≤γ+O(lognn)]+O(n−2). (68)

Wang-Ingber-Kochman  computed the relevant first-, second- and third-order statistics of the random variable allowing us to apply the Berry-Esseen theorem [34, Ch. XVI.5] to the probability in (68), leading to

 Pr[F(1)e]≤Φ⎛⎝γ+O(lognn)−I(P,W)√V(P,W)/n⎞⎠+O(n−1/2).