Asymptotic Estimates in Information Theory with Non-Vanishing Error Probabilities

# Asymptotic Estimates in Information Theory with Non-Vanishing Error Probabilities

Vincent Y. F. Tan
Department of Electrical and Computer Engineering
Department of Mathematics
National University of Singapore
Singapore 119077
Email: vtan@nus.edu.sg

### Abstract

This monograph presents a unified treatment of single- and multi-user problems in Shannon’s information theory where we depart from the requirement that the error probability decays asymptotically in the blocklength. Instead, the error probabilities for various problems are bounded above by a non-vanishing constant and the spotlight is shone on achievable coding rates as functions of the growing blocklengths. This represents the study of asymptotic estimates with non-vanishing error probabilities.

In Part I, after reviewing the fundamentals of information theory, we discuss Strassen’s seminal result for binary hypothesis testing where the type-I error probability is non-vanishing and the rate of decay of the type-II error probability with growing number of independent observations is characterized. In Part II, we use this basic hypothesis testing result to develop second- and sometimes, even third-order asymptotic expansions for point-to-point communication. Finally in Part III, we consider network information theory problems for which the second-order asymptotics are known. These problems include some classes of channels with random state, the multiple-encoder distributed lossless source coding (Slepian-Wolf) problem and special cases of the Gaussian interference and multiple-access channels. Finally, we discuss avenues for further research.

## Part I Fundamentals

### Chapter 1 Introduction

Claude E. Shannon’s epochal “A Mathematical Theory of Communication” [141] marks the dawn of the digital age. In his seminal paper, Shannon laid the theoretical and mathematical foundations for the basis of all communication systems today. It is not an exaggeration to say that his work has had a tremendous impact in communications engineering and beyond, in fields as diverse as statistics, economics, biology and cryptography, just to name a few.

It has been more than 65 years since Shannon’s landmark work was published. Along with impressive research advances in the field of information theory, numerous excellent books on various aspects of the subject have been written. The author’s favorites include Cover and Thomas [33], Gallager [56], Csiszár and Körner [39], Han [67], Yeung [189] and El Gamal and Kim [49]. Is there sufficient motivation to consolidate and present another aspect of information theory systematically? It is the author’s hope that the answer is in the affirmative.

To motivate why this is so, let us recapitulate two of Shannon’s major contributions in his 1948 paper. First, Shannon showed that to reliably compress a discrete memoryless source (DMS) where each has the same distribution as a common random variable , it is sufficient to use bits per source symbol in the limit of large blocklengths , where is the Shannon entropy of the source. By reliable, it is meant that the probability of incorrect decoding of the source sequence tends to zero as the blocklength grows. Second, Shannon showed that it is possible to reliably transmit a message over a discrete memoryless channel (DMC) as long as the message rate is smaller than the capacity of the channel . Similarly to the source compression scenario, by reliable, one means that the probability of incorrectly decoding tends to zero as grows.

There is, however, substantial motivation to revisit the criterion of having error probabilities vanish asymptotically. To state Shannon’s source compression result more formally, let us define to be the minimum code size for which the length- DMS is compressible to within an error probability . Then, Theorem 3 of Shannon’s paper [141], together with the strong converse for lossless source coding [49, Ex. 3.15], states that

 limn→∞1nlogM∗(Pn,ε)=H(X),bits% per source symbol. (1.1)

Similarly, denoting as the maximum code size for which it is possible to communicate over a DMC such that the average error probability is no larger than , Theorem 11 of Shannon’s paper [141], together with the strong converse for channel coding [180, Thm. 2], states that

 limn→∞1nlogM∗ave(Wn,ε)=C(W),bits per channel use. (1.2)

In many practical communication settings, one does not have the luxury of being able to design an arbitrarily long code, so one must settle for a non-vanishing, and hence finite, error probability . In this finite blocklength and non-vanishing error probability setting, how close can one hope to get to the asymptotic limits and ? This is, in general a difficult question because exact evaluations of and are intractable, apart from a few special sources and channels.

In the early years of information theory, Dobrushin [45], Kemperman [91] and, most prominently, Strassen [152] studied approximations to and . These beautiful works were largely forgotten until recently, when interest in so-called Gaussian approximations were revived by Hayashi [75, 76] and Polyanskiy-Poor-Verdú [122, 123].111Some of the results in [122, 123] were already announced by S. Verdú in his Shannon lecture at the 2007 International Symposium on Information Theory (ISIT) in Nice, France. Strassen showed that the limiting statement in (1.1) may be refined to yield the asymptotic expansion

 logM∗(Pn,ε)=nH(X)−√nV(X)Φ−1(ε)−12logn+O(1), (1.3)

where is known as the source dispersion or the varentropy, terms introduced by Kostina-Verdú [97] and Kontoyiannis-Verdú [95]. In (1.3), is the inverse of the Gaussian cumulative distribution function. Observe that the first-order term in the asymptotic expansion above, namely , coincides with the (first-order) fundamental limit shown by Shannon. From this expansion, one sees that if the error probability is fixed to , the extra rate above the entropy we have to pay for operating at finite blocklength with admissible error probability is approximately . Thus, the quantity , which is a function of just like the entropy , quantifies how fast the rates of optimal source codes converge to . Similarly, for well-behaved DMCs, under mild conditions, Strassen showed that the limiting statement in (1.2) may be refined to

 logM∗ave(Wn,ε)=nC(W)+√nVε(W)Φ−1(ε)+O(logn) (1.4)

and is a channel parameter known as the -channel dispersion, a term introduced by Polyanskiy-Poor-Verdú [123]. Thus the backoff from capacity at finite blocklengths and average error probability is approximately .

#### 1.1 Motivation for this Monograph

It turns out that Gaussian approximations (first two terms of (1.3) and (1.4)) are good proxies to the true non-asymptotic fundamental limits ( and ) at moderate blocklengths and moderate error probabilities for some channels and sources as shown by Polyanskiy-Poor-Verdú [123] and Kostina-Verdú [97]. For error probabilities that are not too small (e.g., ), the Gaussian approximation is often better than that provided by traditional error exponent or reliability function analysis [39, 56], where the code rate is fixed (below the first-order fundamental limit) and the exponential decay of the error probability is analyzed. Recent refinements to error exponent analysis using exact asymptotics [10, 11, 135] or saddlepoint approximations [137] are alternative proxies to the non-asymptotic fundamental limits. The accuracy of the Gaussian approximation in practical regimes of errors and finite blocklengths gives us motivation to study refinements to the first-order fundamental limits of other single- and multi-user problems in Shannon theory.

The study of asymptotic estimates with non-vanishing error probabilities—or more succinctly, fixed error asymptotics—also uncovers several interesting phenomena that are not observable from studies of first-order fundamental limits in single- and multi-user information theory [33, 49]. This analysis may give engineers deeper insight into the design of practical communication systems. A non-exhaustive list includes:

1. Shannon showed that separating the tasks of source and channel coding is optimal rate-wise [141]. As we see in Section 4.5.2 (and similarly to the case of error exponents [35]), this is not the case when the probability of excess distortion of the source is allowed to be non-vanishing.

2. Shannon showed that feedback does not increase the capacity of a DMC [142]. It is known, however, that variable-length feedback [125] and full output feedback [8] improve on the fixed error asymptotics of DMCs.

3. It is known that the entropy can be achieved universally for fixed-to-variable length almost lossless source coding of a DMS [192], i.e., the source statistics do not have to be known. The redundancy has also been studied for prefix-free codes [27]. In the fixed error setting (a setting complementary to [27]), it was shown by Kosut and Sankar [100, 101] that universality imposes a penalty in the third-order term of the asymptotic expansion in (1.3).

4. Han showed that the output from any source encoder at the optimal coding rate with asymptotically vanishing error appears almost completely random [68]. This is the so-called folklore theorem. Hayashi [75] showed that the analogue of the folklore theorem does not hold when we consider the second-order terms in asymptotic expansions (i.e., the second-order asymptotics).

5. Slepian and Wolf showed that separate encoding of two correlated sources incurs no loss rate-wise compared to the situation where side information is also available at all encoders [151]. As we shall see in Chapter 6, the fixed error asymptotics in the vicinity of a corner point of the polygonal Slepian-Wolf region suggests that side-information at the encoders may be beneficial.

None of the aforementioned books [33, 39, 49, 56, 67, 189] focus exclusively on the situation where the error probabilities of various Shannon-theoretic problems are upper bounded by and asymptotic expansions or second-order terms are sought. This is what this monograph attempts to do.

#### 1.2 Preview of this Monograph

This monograph is organized as follows: In the remaining parts of this chapter, we recap some quantities in information theory and results in the method of types [37, 39, 74], a particularly useful tool for the study of discrete memoryless systems. We also mention some probability bounds that will be used throughout the monograph. Most of these bounds are based on refinements of the central limit theorem, and are collectively known as Berry-Esseen theorems [17, 52]. In Chapter 2, our study of asymptotic expansions of the form (1.3) and (1.4) begins in earnest by revisiting Strassen’s work [152] on binary hypothesis testing where the probability of false alarm is constrained to not exceed a positive constant. We find it useful to revisit the fundamentals of hypothesis testing as many information-theoretic problems such as source and channel coding are intimately related to hypothesis testing.

Part II of this monograph begins our study of information-theoretic problems starting with lossless and lossy compression in Chapter 3. We emphasize, in the first part of this chapter, that (fixed-to-fixed length) lossless source coding and binary hypothesis testing are, in fact, the same problem, and so the asymptotic expansions developed in Chapter 2 may be directly employed for the purpose of lossless source coding. Lossy source coding, however, is more involved. We review the recent works in [86] and [97], where the authors independently derived asymptotic expansions for the logarithm of the minimum size of a source code that reproduces symbols up to a certain distortion, with some admissible probability of excess distortion. Channel coding is discussed in Chapter 4. In particular, we study the approximation in (1.4) for both discrete memoryless and Gaussian channels. We make it a point here to be precise about the third-order term. We state conditions on the channel under which the coefficient of the term can be determined exactly. This leads to some new insights concerning optimum codes for the channel coding problem. Finally, we marry source and channel coding in the study of source-channel transmission where the probability of excess distortion in reproducing the source is non-vanishing.

Part III of this monograph contains a sparse sampling of fixed error asymptotic results in network information theory. The problems we discuss here have conclusive second-order asymptotic characterizations (analogous to the second terms in the asymptotic expansions in (1.3) and (1.4)). They include some channels with random state (Chapter 5), such as Costa’s writing on dirty paper [30], mixed DMCs [67, Sec. 3.3], and quasi-static single-input-multiple-output (SIMO) fading channels [18]. Under the fixed error setup, we also consider the second-order asymptotics of the Slepian-Wolf [151] distributed lossless source coding problem (Chapter 6), the Gaussian interference channel (IC) in the strictly very strong interference regime [22] (Chapter 7), and the Gaussian multiple access channel (MAC) with degraded message sets (Chapter 8). The MAC with degraded message sets is also known as the cognitive [44] or asymmetric [72, 167, 128] MAC (A-MAC). Chapter 9 concludes with a brief summary of other results, together with open problems in this area of research. A dependence graph of the chapters in the monograph is shown in Fig. 1.1.

This area of information theory—fixed error asymptotics—is vast and, at the same time, rapidly expanding. The results described herein are not meant to be exhaustive and were somewhat dependent on the author’s understanding of the subject and his preferences at the time of writing. However, the author has made it a point to ensure that results herein are conclusive in nature. This means that the problem is solved in the information-theoretic sense in that an operational quantity is equated to an information quantity. In terms of asymptotic expansions such as (1.3) and (1.4), by solved, we mean that either the second-order term is known or, better still, both the second- and third-order terms are known. Having articulated this, the author confesses that there are many relevant information-theoretic problems that can be considered solved in the fixed error setting, but have not found their way into this monograph either due to space constraints or because it was difficult to meld them seamlessly with the rest of the story.

#### 1.3 Fundamentals of Information Theory

In this section, we review some basic information-theoretic quantities. As with every article published in the Foundations and Trends in Communications and Information Theory, the reader is expected to have some background in information theory. Nevertheless, the only prerequisite required to appreciate this monograph is information theory at the level of Cover and Thomas [33]. We will also make extensive use of the method of types, for which excellent expositions can be found in [37, 39, 74]. The measure-theoretic foundations of probability will not be needed to keep the exposition accessible to as wide an audience as possible.

##### 1.3.1 Notation

The notation we use is reasonably standard and generally follows the books by Csiszár-Körner [39] and Han [67]. Random variables (e.g., ) and their realizations (e.g., ) are in upper and lower case respectively. Random variables that take on finitely many values have alphabets (support) that are denoted by calligraphic font (e.g., ). The cardinality of the finite set is denoted as . Let the random vector be the vector of random variables . We use bold face to denote a realization of . The set of all distributions (probability mass functions) supported on alphabet is denoted as . The set of all conditional distributions (i.e., channels) with the input alphabet and the output alphabet is denoted by . The joint distribution induced by a marginal distribution and a channel is denoted as , i.e.,

 (P×V)(x,y):=P(x)V(y|x). (1.5)

The marginal output distribution induced by and is denoted as , i.e.,

 PV(y):=∑x∈XP(x)V(y|x). (1.6)

If has distribution , we sometimes write this as .

Vectors are indicated in lower case bold face (e.g., ) and matrices in upper case bold face (e.g., ). If we write for two vectors and of the same length, we mean that for every coordinate . The transpose of is denoted as . The vector of all zeros and the identity matrix are denoted as and respectively. We sometimes make the lengths and sizes explicit. The -norm (for ) of a vector is denoted as .

We use standard asymptotic notation [29]: if and only if (iff) ; iff ; iff ; iff ; and iff . Finally, iff .

##### 1.3.2 Information-Theoretic Quantities

Information-theoretic quantities are denoted in the usual way [39, 49]. All logarithms and exponential functions are to the base . The entropy of a discrete random variable with probability distribution is denoted as

 H(X)=H(P):=−∑x∈XP(x)logP(x). (1.7)

For the sake of clarity, we will sometimes make the dependence on the distribution explicit. Similarly given a pair of random variables with joint distribution , the conditional entropy of given is written as

 H(Y|X)=H(V|P):=−∑x∈XP(x)∑y∈YV(y|x)logV(y|x). (1.8)

The joint entropy is denoted as

 H(X,Y) :=H(X)+H(Y|X),or (1.9) H(P×V) :=H(P)+H(V|P). (1.10)

The mutual information is a measure of the correlation or dependence between random variables and . It is interchangeably denoted as

 I(X;Y) :=H(Y)−H(Y|X),or (1.11) I(P,V) :=H(PV)−H(V|P). (1.12)

Given three random variables with joint distribution where and , the conditional mutual information is

 I(Y;Z|X) :=H(Z|X)−H(Z|XY),or (1.13) I(V,W|P) :=∑x∈XP(x)I(V(⋅|x),W(⋅|x,⋅)). (1.14)

A particularly important quantity is the relative entropy (or Kullback-Leibler divergence [102]) between and which are distributions on the same finite support set . It is defined as the expectation with respect to of the log-likelihood ratio , i.e.,

 D(P∥Q):=∑x∈XP(x)logP(x)Q(x). (1.15)

Note that if there exists an for which while , then the relative entropy . If for every , if then , we say that is absolutely continuous with respect to and denote this relation by . In this case, the relative entropy is finite. It is well known that and equality holds if and only if . Additionally, the conditional relative entropy between given is defined as

 D(V∥W|P):=∑x∈XP(x)D(V(⋅|x)∥W(⋅|x)). (1.16)

The mutual information is a special case of the relative entropy. In particular, we have

 I(P,V)=D(P×V∥P×PV)=D(V∥PV|P). (1.17)

Furthermore, if is the uniform distribution on , i.e., for all , we have

 D(P∥UX)=−H(P)+log|X|. (1.18)

The definition of relative entropy can be extended to the case where is not necessarily a probability measure. In this case non-negativity does not hold in general. An important property we exploit is the following: If denotes the counting measure (i.e., for ), then similarly to (1.18)

 D(P∥μ)=−H(P). (1.19)

#### 1.4 The Method of Types

For finite alphabets, a particularly convenient tool in information theory is the method of types [37, 39, 74]. For a sequence in which is finite, its type or empirical distribution is the probability mass function

 Px(x)=1nn∑i=1\leavevmode\small 1% \normalsize\kern-3.3pt1{xi=x},∀x∈X. (1.20)

Throughout, we use the notation to mean the indicator function, i.e., this function equals if “” is true and otherwise. The set of types formed from -length sequences in is denoted as . This is clearly a subset of . The type class of , denoted as , is the set of all sequences of length for which their type is , i.e.,

 TP:={x∈Xn:Px=P}. (1.21)

It is customary to indicate the dependence of on the blocklength but we suppress this dependence for the sake of conciseness throughout. For a sequence , the set of all sequences such that has joint type is the -shell, denoted as . In other words,

 TV(x):={y∈Yn:Px,y=P×V}. (1.22)

The conditional distribution is also known as the conditional type of given . Let be the set of all for which the -shell of a sequence of type is non-empty.

We will often times find it useful to consider information-theoretic quantities of empirical distributions. All such quantities are denoted using hats. So for example, the empirical entropy of a sequence is denoted as

 ^H(x):=H(Px). (1.23)

The empirical conditional entropy of given where is denoted as

 ^H(y|x):=H(V|Px). (1.24)

The empirical mutual information of a pair of sequences with joint type is denoted as

 ^I(x∧y):=I(Px,V). (1.25)

The following lemmas form the basis of the method of types. The proofs can be found in [37, 39].

###### Lemma 1.1 (Type Counting).

The sets and for satisfy

 |Pn(X)|≤(n+1)|X|,and|Vn(Y;P)|≤(n+1)|X||Y|. (1.26)

In fact, it is easy to check that but (1.26) or its slightly stronger version

 |Pn(X)|≤(n+1)|X|−1 (1.27)

usually suffices for our purposes in this monograph. This key property says that the number of types is polynomial in the blocklength .

###### Lemma 1.2 (Size of Type Class).

For a type , the type class satisfies

 |Pn(X)|−1exp(nH(P))≤|TP|≤exp(nH(P)). (1.28)

For a conditional type and a sequence , the -shell satisfies

 |Vn(Y;P)|−1exp(nH(V|P))≤|TV(x)|≤exp(nH(V|P)). (1.29)

This lemma says that, on the exponential scale,

 |TP|≅exp(nH(P)),and|TV(x)|≅exp(nH(V|P)), (1.30)

where we used the notation to mean equality up to a polynomial, i.e., there exists polynomials and such that . We now consider probabilities of sequences. Throughout, for a distribution , we let be the product distribution, i.e.,

 (1.31)
###### Lemma 1.3 (Probability of Sequences).

If and ,

 Qn(x) =exp(−nD(P∥Q)−nH(P))and (1.32) Wn(y|x) =exp(−nD(V∥W|P)−nH(V|P)). (1.33)

This, together with Lemma 1.2, leads immediately to the final lemma in this section.

###### Lemma 1.4 (Probability of Type Classes).

For a type ,

 |Pn(X)|−1exp(−nD(P∥Q))≤Qn(TP)≤exp(−nD(P∥Q)). (1.34)

For a conditional type and a sequence , we have

 |Vn(Y;P)|−1exp(−nD(V∥W|P)) ≤Wn(TV(x)|x) ≤exp(−nD(V∥W|P)). (1.35)

The interpretation of this lemma is that the probability that a random i.i.d. (independently and identically distributed) sequence generated from belongs to the type class is exponentially small with exponent , i.e.,

 Qn(TP)≅exp(−nD(P∥Q)). (1.36)

The bounds in (1.35) can be interpreted similarly.

#### 1.5 Probability Bounds

In this section, we summarize some bounds on probabilities that we use extensively in the sequel. For a random variable , we let and be its expectation and variance respectively. To emphasize that the expectation is taken with respect to a random variable with distribution , we sometimes make this explicit by using a subscript, i.e., or .

##### 1.5.1 Basic Bounds

We start with the familiar Markov and Chebyshev inequalities.

###### Proposition 1.1 (Markov’s inequality).

Let be a real-valued non-negative random variable. Then for any , we have

 Pr(X≥a)≤E[X]a. (1.37)

If we let above be the non-negative random variable , we obtain Chebyshev’s inequality.

###### Proposition 1.2 (Chebyshev’s inequality).

Let be a real-valued random variable with mean and variance . Then for any , we have

 Pr(|X−μ|≥bσ)≤1b2. (1.38)

We now consider a collection of real-valued random variables that are i.i.d. In particular, let be a collection of independent random variables where each has distribution with zero mean and finite variance .

###### Proposition 1.3 (Weak Law of Large Numbers).

For every , we have

 limn→∞Pr(∣∣∣1nn∑i=1Xi∣∣∣>ϵ)=0. (1.39)

Consequently, the average converges to in probability.

This follows by applying Chebyshev’s inequality to the random variable . In fact, under mild conditions, the convergence to zero in (1.39) occurs exponentially fast. See, for example, Cramer’s theorem in [43, Thm. 2.2.3].

##### 1.5.2 Central Limit-Type Bounds

In preparation for the next result, we denote the probability density function (pdf) of a univariate Gaussian as

 N(x;μ,σ2)=1√2πσ2e−(x−μ)2/(2σ2). (1.40)

We will also denote this as if the argument is unnecessary. A standard Gaussian distribution is one in which the mean and the standard deviation . In the multivariate case, the pdf is

 N(x;μ,Σ)=1√(2π)k|Σ|e−12(x−μ)′Σ−1(x−μ) (1.41)

where . A standard multivariate Gaussian distribution is one in which the mean is and the covariance is the identity matrix .

For the univariate case, the cumulative distribution function (cdf) of the standard Gaussian is denoted as

 Φ(y):=∫y−∞N(x;0,1)dx. (1.42)

We also find it convenient to introduce the inverse of as

 Φ−1(ε):=sup{y∈R:Φ(y)≤ε} (1.43)

which evaluates to the usual inverse for and extends continuously to take values for outside . These monotonically increasing functions are shown in Fig. 1.2.

If the scaling in front of the sum in the statement of the law of large numbers in (1.39) is instead of , the resultant random variable converges in distribution to a Gaussian random variable. As in Proposition 1.3, let be a collection of i.i.d. random variables where each has zero mean and finite variance .

###### Proposition 1.4 (Central Limit Theorem).

For any , we have

 limn→∞Pr(1σ√nn∑i=1Xi

In other words,

 1σ√nn∑i=1Xi\lx@stackreld⟶Z (1.45)

where means convergence in distribution and is the standard Gaussian random variable.

Throughout the monograph, in the evaluation of the non-asymptotic bounds, we will use a more quantitative version of the central limit theorem known as the Berry-Esseen theorem [17, 52]. See Feller [54, Sec. XVI.5] for a proof.

###### Theorem 1.1 (Berry-Esseen Theorem (i.i.d. Version)).

Assume that the third absolute moment is finite, i.e., . For every , we have

 supa∈R∣∣ ∣∣Pr(1σ√nn∑i=1Xi

Remarkably, the Berry-Esseen theorem says that the convergence in the central limit theorem in (1.44) is uniform in . Furthermore, the convergence of the distribution function of to the Gaussian cdf occurs at a rate of . The constant of proportionality in the -notation depends only on the variance and the third absolute moment and not on any other statistics of the random variables.

There are many generalizations of the Berry-Esseen theorem. One which we will need is the relaxation of the assumption that the random variables are identically distributed. Let be a collection of independent random variables where each random variable has zero mean, variance and third absolute moment . We respectively define the average variance and average third absolute moment as

 σ2:=1nn∑i=1σ2i,andT:=1nn∑i=1Ti. (1.47)
###### Theorem 1.2 (Berry-Esseen Theorem (General Version)).

For every , we have

 supa∈R∣∣ ∣∣Pr(1σ√nn∑i=1Xi

Observe that as with the i.i.d. version of the Berry-Esseen theorem, the remainder term scales as .

The proof of the following theorem uses the Berry-Esseen theorem (among other techniques). This theorem is proved in Polyanskiy-Poor-Verdú [123, Lem. 47]. Together with its variants, this theorem is useful for obtaining third-order asymptotics for binary hypothesis testing and other coding problems with non-vanishing error probabilities.

###### Theorem 1.3.

Assume the same setup as in Theorem 1.2. For any , we have

 E[exp(−n∑i=1Xi)\leavevmode% \small 1\normalsize\kern-3.3pt1{n∑i=1Xi>γ}]≤2(log2√2π+12Tσ2)exp(−γ)σ√n. (1.49)

It is trivial to see that the expectation in (1.49) is upper bounded by . The additional factor of is crucial in proving coding theorems with better third-order terms. Readers familiar with strong large deviation theorems or exact asymptotics (see, e.g., [23, Thms. 3.3 and 3.5] or [43, Thm. 3.7.4]) will notice that (1.49) is in the same spirit as the theorem by Bahadur and Ranga-Rao [13]. There are two advantages of (1.49) compared to strong large deviation theorems. First, the bound is purely in terms of and , and second, one does not have to differentiate between lattice and non-lattice random variables. The disadvantage of (1.49) is that the constant is worse but this will not concern us as we focus on asymptotic results in this monograph, hence constants do not affect the main results.

For multi-terminal problems that we encounter in the latter parts of this monograph, we will require vector (or multidimensional) versions of the Berry-Esseen theorem. The following is due to Götze [63].

###### Theorem 1.4 (Vector Berry-Esseen Theorem I).

Let be independent -valued random vectors with zero mean. Let

 Skn=1√nn∑i=1Xki. (1.50)

Assume that has the following statistics

 Cov(Skn)=E[Skn(Skn)′]=Ik×k,andξ:=1nn∑i=1E[∥Xki∥32]. (1.51)

Let be a standard Gaussian random vector, i.e., its distribution is . Then, for all , we have

 supC∈Ck∣∣Pr(Skn∈C)−Pr(Zk∈C)∣∣≤ckξ√n, (1.52)

where is the family of all convex subsets of , and where is a constant that depends only on the dimension .

Theorem 1.4 can be applied for random vectors that are independent but not necessarily identically distributed. The constant can be upper bounded by if the random vectors are i.i.d., a result by Bentkus [15]. However, its precise value will not be of concern to us in this monograph. Observe that the scalar versions of the Berry-Esseen theorems (in Theorems 1.1 and 1.2) are special cases (apart from the constant) of the vector version in which the family of convex subsets is restricted to the family of semi-infinite intervals .

We will frequently encounter random vectors with non-identity covariance matrices. The following modification of Theorem 1.4 is due to Watanabe-Kuzuoka-Tan [177, Cor. 29].

###### Corollary 1.1 (Vector Berry-Esseen Theorem II).

Assume the same setup as in Theorem 1.4, except that , a positive definite matrix. Then, for all , we have

 supC∈Ck∣∣Pr(Skn∈C)−Pr(Zk∈C)∣∣≤ckξλmin(V)3/2√n, (1.53)

where is the smallest eigenvalue of .

The final probability bound is a quantitative version of the so-called multivariate delta method [174, Thm. 5.15]. Numerous similar statements of varying generalities have appeared in the statistics literature (e.g., [24, 175]). The simple version we present was shown by MolavianJazi and Laneman [112] who extended ideas in Hoeffding and Robbins’ paper [81, Thm. 4] to provide rates of convergence to Gaussianity under appropriate technical conditions. This result essentially says that a differentiable function of a normalized sum of independent random vectors also satisfies a Berry-Esseen-type result.

###### Theorem 1.5 (Berry-Esseen Theorem for Functions of i.i.d. Random Vectors).

Assume that are -valued, zero-mean, i.i.d. random vectors with positive definite covariance and finite third absolute moment . Let be a vector-valued function from to that is also twice continuously differentiable in a neighborhood of . Let be the Jacobian matrix of evaluated at , i.e., its elements are

 Jij=∂fi(x)∂xj∣∣∣x=0, (1.54)

where and . Then, for every , we have

 supC∈Cl∣∣ ∣∣Pr(f(1nn∑i=1Xki)∈C)−Pr(Zl∈C)∣∣ ∣∣≤c√n (1.55)

where is a finite constant, and is a Gaussian random vector in with mean vector and covariance matrix respectively given as

 E[Zl]=f(0),andCov(Zl)=JCov(Xk1)J′n. (1.56)

In particular, the inequality in (1.55) implies that

 √n(f(1nn∑i=1Xki)−f(0))\lx@stackreld⟶N(0,JCov(Xk1)J′), (1.57)

which is a canonical statement in the study of the multivariate delta method [174, Thm. 5.15].

Finally, we remark that Ingber-Wang-Kochman [87] used a result similar to that of Theorem 1.5 to derive second-order asymptotic results for various Shannon-theoretic problems. However, they analyzed the behavior of functions of distributions instead of functions of random vectors as in Theorem 1.5.

### Chapter 2 Binary Hypothesis Testing

In this chapter, we review asymptotic expansions in simple (non-composite) binary hypothesis testing when one of the two error probabilities is non-vanishing. We find this useful, as many coding theorems we encounter in subsequent chapters can be stated in terms of quantities related to binary hypothesis testing. For example, as pointed out in Csiszár and Körner [39, Ch. 1], fixed-to-fixed length lossless source coding and binary hypothesis testing are intimately connected through the relation between relative entropy and entropy in (1.18). Another example is in point-to-point channel coding, where a powerful non-asymptotic converse theorem [152, Eq. (4.29)] [123, Sec. III-E] [164, Prop. 6] can be stated in terms of the so-called -hypothesis testing divergence and the -information spectrum divergence (cf. Proposition 4.4). The properties of these two quantities, as well as the relation between them are discussed. Using various probabilistic limit theorems, we also evaluate these quantities in the asymptotic setting for product distributions. A corollary of the results presented is the familiar Chernoff-Stein lemma [39, Thm. 1.2], which asserts that the exponent with growing number of observations of the type-II error for a non-vanishing type-I error in a binary hypothesis test of against is the relative entropy .

The material in this chapter is based largely on the seminal work by Strassen [152, Thm. 3.1]. The exposition is based on the more recent works by Polyanskiy-Poor-Verdú [123, App. C], Tomamichel-Tan [164, Sec. III] and Tomamichel-Hayashi [163, Lem. 12].

#### 2.1 Non-Asymptotic Quantities and Their Properties

Consider the simple (non-composite) binary hypothesis test:

 H0 :Z∼P H1 :Z∼Q (2.1)

where and are two probability distributions on the same space . We assume that the space is finite to keep the subsequent exposition simple. The notation in (2.1) means that under the null hypothesis , the random variable is distributed as while under the alternative hypothesis , it is distributed according to a different distribution . We would like to study the optimal performance of a hypothesis test in terms of the distributions and .

There are several ways to measure the performance of a hypothesis test which, in precise terms, is a mapping from the observation space to . If the observation is such that , this means the test favors the null hypothesis . Conversely, means that the test favors the alternative hypothesis (or alternatively, rejects the null hypothesis ). If , the test is called deterministic, otherwise it is called randomized. Traditionally, there are three quantities that are of interest for a given test . The first is the probability of false alarm

 PFA:=∑z∈Zδ(z)P(z)=EP[δ(Z)]. (2.2)

The second is the probability of missed detection

 PMD:=∑z∈Z(1−δ(z))Q(z)=EQ[1−δ(Z)]. (2.3)

The third is the probability of detection, which is one minus the probability of missed detection, i.e.,

 PD:=∑z∈Zδ(z)Q(z)=EQ[δ(Z)]. (2.4)

The probability of false alarm and miss detection are traditionally called the type-I and type-II errors respectively in the statistics literature. The probability of detection and the probability of false alarm are also called the power and the significance level respectively. The “holy grail” is, of course, to design a test such that while but this is clearly impossible unless and are mutually singular measures.

Since misses are usually more costly than false alarms, let us fix a number that represents a tolerable probability of false alarm (type-I error). Then define the smallest type-II error in the binary hypothesis test (2.1) with type-I error not exceeding , i.e.,

 β1−ε(P,Q):=infδ:Z→[0,1]{EQ[1−δ(Z)]:EP[δ(Z)]≤ε}. (2.5)

Observe that constrains the probability of false alarm to be no greater than . Thus, we are searching over all tests satisfying such that the probability of missed detection is minimized. Intuitively, quantifies, in a non-asymptotic fashion, the performance of an optimal hypothesis test between and .

A related quantity is the -hypothesis testing divergence

 Dεh(P∥Q):=−logβ1−ε(P,Q)1−ε. (2.6)

This is a measure of the distinguishability of from . As can be seen from (2.6), and are simple functions of each other. We prefer to express the results in this monograph mostly in terms of because it shares similar properties with the usual relative entropy , as is evidenced from the following lemma.

###### Lemma 2.1 (Properties of Dεh).

The -hypothesis testing divergence satisfies the positive definiteness condition [48, Prop. 3.2], i.e.,

 Dεh(P∥Q)≥0. (2.7)

Equality holds if and only if . In addition, it also satisfies the data processing inequality [173, Lem. 1], i.e., for any channel ,

 Dεh(PW∥QW)≤Dεh(P∥Q). (2.8)

While the -hypothesis testing divergence occurs naturally and frequently in coding problems, it is usually hard to analyze directly. Thus, we now introduce an equally important quantity. Define the -information spectrum divergence as

 (2.9)

Just as in information spectrum analysis [67], this quantity places the distribution of the log-likelihood ratio (where ), and not just its expectation, in the most prominent role. See Fig. 2.1 for an interpretation of the definition in (2.9).

As we will see, the -information spectrum divergence is intimately related to the -hypothesis testing divergence (cf. Lemma 2.4). The former is, however, easier to compute. Note that if and are product measures, then by virtue of the fact that is a sum of independent random variables, one can estimate the probability in (2.9) using various probability tail bounds. This we do in the following section.

We now state two useful properties of . The proofs of these lemmas are straightforward and can be found in [164, Sec. III.A].

###### Lemma 2.2 (Sifting from a convex combination).

Let and be an at most countable convex combination of distributions with non-negative weights summing to one, i.e., . Then,

 Dεs(P∥Q)≤infk{Dεs(P∥Qk)+log1αk}. (2.10)

In particular, Lemma 2.2 tells us that if there exists some such that for all then,

 Dεs(P∥~Q)≥Dεs(P∥Q)−logγ. (2.11)
###### Lemma 2.3 (“Symbol-wise” relaxation of Dεs).

Let and be two channels from to and let . Then,

 Dεs(P×W∥P×V)≤supx∈XDεs(W(⋅|x)∥V(⋅|x)). (2.12)

One can readily toggle between the -hypothesis testing divergence and the -information spectrum divergence because they satisfy the bounds in the following lemma. The proof of this lemma mimics that of [163, Lem. 12].

###### Lemma 2.4 (Relation between divergences).

For every and every , we have

 Dεs(P∥Q)−log11−ε ≤Dεh(P∥Q) (2.13) ≤Dε+ηs(P∥Q)+log1−εη. (2.14)
###### Proof.

The following proof is based on that for [163, Lem. 12]. For the lower bound in (2.13), consider the likelihood ratio test

 δ(z):=\leavevmode\small 1\normalsize\kern-3.3pt1{logP(z)Q(z)≤γ},whereγ:=Dεs(P∥Q)−ξ (2.15)

for some small . This test clearly satisfies by the definition of the -information spectrum divergence. On the other hand,

 EQ[1−δ(Z)] =∑z∈ZQ(z)\leavevmode\small 1\normalsize% \kern-3.3pt1{logP(z)Q(z)>γ} (2.16) ≤∑z∈ZP(z)exp(−γ)\leavevmode% \small 1\normalsize\kern-3.3pt1{logP(z)Q(z)>γ} (2.17) ≤∑z∈ZP(z)exp(−γ) (2.18) ≤exp(−γ) (2.19)

As a result, by the definition of , we have

 Dεh(P∥Q)≥γ−log11−ε=Dεs(P∥Q)−ξ−log11−ε