Asymptotic Estimates in Information Theory with NonVanishing Error Probabilities
Contents
 Abstract
 I Fundamentals

II PointToPoint Communication

3 Source Coding
 3.1 Lossless Source Coding: NonAsymptotic Bounds
 3.2 Lossless Source Coding: Asymptotic Expansions
 3.3 SecondOrder Asymptotics of Lossless Source Coding via the Method of Types
 3.4 Lossy Source Coding: NonAsymptotic Bounds
 3.5 Lossy Source Coding: Asymptotic Expansions
 3.6 SecondOrder Asymptotics of Lossy Source Coding via the Method of Types
 4 Channel Coding

3 Source Coding

III Network Information Theory
 5 Channels with Random State
 6 Distributed Lossless Source Coding
 7 A Special Class of Gaussian Interference Channels
 8 A Special Class of Gaussian Multiple Access Channels
 9 Summary, Other Results, Open Problems
 Acknowledgements
Abstract
This monograph presents a unified treatment of single and multiuser problems in Shannon’s information theory where we depart from the requirement that the error probability decays asymptotically in the blocklength. Instead, the error probabilities for various problems are bounded above by a nonvanishing constant and the spotlight is shone on achievable coding rates as functions of the growing blocklengths. This represents the study of asymptotic estimates with nonvanishing error probabilities.
In Part I, after reviewing the fundamentals of information theory, we discuss Strassen’s seminal result for binary hypothesis testing where the typeI error probability is nonvanishing and the rate of decay of the typeII error probability with growing number of independent observations is characterized. In Part II, we use this basic hypothesis testing result to develop second and sometimes, even thirdorder asymptotic expansions for pointtopoint communication. Finally in Part III, we consider network information theory problems for which the secondorder asymptotics are known. These problems include some classes of channels with random state, the multipleencoder distributed lossless source coding (SlepianWolf) problem and special cases of the Gaussian interference and multipleaccess channels. Finally, we discuss avenues for further research.
Part I Fundamentals
Chapter 1 Introduction
Claude E. Shannon’s epochal “A Mathematical Theory of Communication” [141] marks the dawn of the digital age. In his seminal paper, Shannon laid the theoretical and mathematical foundations for the basis of all communication systems today. It is not an exaggeration to say that his work has had a tremendous impact in communications engineering and beyond, in fields as diverse as statistics, economics, biology and cryptography, just to name a few.
It has been more than 65 years since Shannon’s landmark work was published. Along with impressive research advances in the field of information theory, numerous excellent books on various aspects of the subject have been written. The author’s favorites include Cover and Thomas [33], Gallager [56], Csiszár and Körner [39], Han [67], Yeung [189] and El Gamal and Kim [49]. Is there sufficient motivation to consolidate and present another aspect of information theory systematically? It is the author’s hope that the answer is in the affirmative.
To motivate why this is so, let us recapitulate two of Shannon’s major contributions in his 1948 paper. First, Shannon showed that to reliably compress a discrete memoryless source (DMS) where each has the same distribution as a common random variable , it is sufficient to use bits per source symbol in the limit of large blocklengths , where is the Shannon entropy of the source. By reliable, it is meant that the probability of incorrect decoding of the source sequence tends to zero as the blocklength grows. Second, Shannon showed that it is possible to reliably transmit a message over a discrete memoryless channel (DMC) as long as the message rate is smaller than the capacity of the channel . Similarly to the source compression scenario, by reliable, one means that the probability of incorrectly decoding tends to zero as grows.
There is, however, substantial motivation to revisit the criterion of having error probabilities vanish asymptotically. To state Shannon’s source compression result more formally, let us define to be the minimum code size for which the length DMS is compressible to within an error probability . Then, Theorem 3 of Shannon’s paper [141], together with the strong converse for lossless source coding [49, Ex. 3.15], states that
(1.1) 
Similarly, denoting as the maximum code size for which it is possible to communicate over a DMC such that the average error probability is no larger than , Theorem 11 of Shannon’s paper [141], together with the strong converse for channel coding [180, Thm. 2], states that
(1.2) 
In many practical communication settings, one does not have the luxury of being able to design an arbitrarily long code, so one must settle for a nonvanishing, and hence finite, error probability . In this finite blocklength and nonvanishing error probability setting, how close can one hope to get to the asymptotic limits and ? This is, in general a difficult question because exact evaluations of and are intractable, apart from a few special sources and channels.
In the early years of information theory, Dobrushin [45], Kemperman [91] and, most prominently, Strassen [152] studied approximations to and . These beautiful works were largely forgotten until recently, when interest in socalled Gaussian approximations were revived by Hayashi [75, 76] and PolyanskiyPoorVerdú [122, 123].^{1}^{1}1Some of the results in [122, 123] were already announced by S. Verdú in his Shannon lecture at the 2007 International Symposium on Information Theory (ISIT) in Nice, France. Strassen showed that the limiting statement in (1.1) may be refined to yield the asymptotic expansion
(1.3) 
where is known as the source dispersion or the varentropy, terms introduced by KostinaVerdú [97] and KontoyiannisVerdú [95]. In (1.3), is the inverse of the Gaussian cumulative distribution function. Observe that the firstorder term in the asymptotic expansion above, namely , coincides with the (firstorder) fundamental limit shown by Shannon. From this expansion, one sees that if the error probability is fixed to , the extra rate above the entropy we have to pay for operating at finite blocklength with admissible error probability is approximately . Thus, the quantity , which is a function of just like the entropy , quantifies how fast the rates of optimal source codes converge to . Similarly, for wellbehaved DMCs, under mild conditions, Strassen showed that the limiting statement in (1.2) may be refined to
(1.4) 
and is a channel parameter known as the channel dispersion, a term introduced by PolyanskiyPoorVerdú [123]. Thus the backoff from capacity at finite blocklengths and average error probability is approximately .
1.1 Motivation for this Monograph
It turns out that Gaussian approximations (first two terms of (1.3) and (1.4)) are good proxies to the true nonasymptotic fundamental limits ( and ) at moderate blocklengths and moderate error probabilities for some channels and sources as shown by PolyanskiyPoorVerdú [123] and KostinaVerdú [97]. For error probabilities that are not too small (e.g., ), the Gaussian approximation is often better than that provided by traditional error exponent or reliability function analysis [39, 56], where the code rate is fixed (below the firstorder fundamental limit) and the exponential decay of the error probability is analyzed. Recent refinements to error exponent analysis using exact asymptotics [10, 11, 135] or saddlepoint approximations [137] are alternative proxies to the nonasymptotic fundamental limits. The accuracy of the Gaussian approximation in practical regimes of errors and finite blocklengths gives us motivation to study refinements to the firstorder fundamental limits of other single and multiuser problems in Shannon theory.
The study of asymptotic estimates with nonvanishing error probabilities—or more succinctly, fixed error asymptotics—also uncovers several interesting phenomena that are not observable from studies of firstorder fundamental limits in single and multiuser information theory [33, 49]. This analysis may give engineers deeper insight into the design of practical communication systems. A nonexhaustive list includes:

It is known that the entropy can be achieved universally for fixedtovariable length almost lossless source coding of a DMS [192], i.e., the source statistics do not have to be known. The redundancy has also been studied for prefixfree codes [27]. In the fixed error setting (a setting complementary to [27]), it was shown by Kosut and Sankar [100, 101] that universality imposes a penalty in the thirdorder term of the asymptotic expansion in (1.3).

Han showed that the output from any source encoder at the optimal coding rate with asymptotically vanishing error appears almost completely random [68]. This is the socalled folklore theorem. Hayashi [75] showed that the analogue of the folklore theorem does not hold when we consider the secondorder terms in asymptotic expansions (i.e., the secondorder asymptotics).

Slepian and Wolf showed that separate encoding of two correlated sources incurs no loss ratewise compared to the situation where side information is also available at all encoders [151]. As we shall see in Chapter 6, the fixed error asymptotics in the vicinity of a corner point of the polygonal SlepianWolf region suggests that sideinformation at the encoders may be beneficial.
None of the aforementioned books [33, 39, 49, 56, 67, 189] focus exclusively on the situation where the error probabilities of various Shannontheoretic problems are upper bounded by and asymptotic expansions or secondorder terms are sought. This is what this monograph attempts to do.
1.2 Preview of this Monograph
This monograph is organized as follows: In the remaining parts of this chapter, we recap some quantities in information theory and results in the method of types [37, 39, 74], a particularly useful tool for the study of discrete memoryless systems. We also mention some probability bounds that will be used throughout the monograph. Most of these bounds are based on refinements of the central limit theorem, and are collectively known as BerryEsseen theorems [17, 52]. In Chapter 2, our study of asymptotic expansions of the form (1.3) and (1.4) begins in earnest by revisiting Strassen’s work [152] on binary hypothesis testing where the probability of false alarm is constrained to not exceed a positive constant. We find it useful to revisit the fundamentals of hypothesis testing as many informationtheoretic problems such as source and channel coding are intimately related to hypothesis testing.
Part II of this monograph begins our study of informationtheoretic problems starting with lossless and lossy compression in Chapter 3. We emphasize, in the first part of this chapter, that (fixedtofixed length) lossless source coding and binary hypothesis testing are, in fact, the same problem, and so the asymptotic expansions developed in Chapter 2 may be directly employed for the purpose of lossless source coding. Lossy source coding, however, is more involved. We review the recent works in [86] and [97], where the authors independently derived asymptotic expansions for the logarithm of the minimum size of a source code that reproduces symbols up to a certain distortion, with some admissible probability of excess distortion. Channel coding is discussed in Chapter 4. In particular, we study the approximation in (1.4) for both discrete memoryless and Gaussian channels. We make it a point here to be precise about the thirdorder term. We state conditions on the channel under which the coefficient of the term can be determined exactly. This leads to some new insights concerning optimum codes for the channel coding problem. Finally, we marry source and channel coding in the study of sourcechannel transmission where the probability of excess distortion in reproducing the source is nonvanishing.

1. Introduction 2. Hypothesis Testing 3. Source Coding 4. Channel Coding 5. Channels with State 6. SlepianWolf 7. Gaussian IC 8. Gaussian AMAC 
Part III of this monograph contains a sparse sampling of fixed error asymptotic results in network information theory. The problems we discuss here have conclusive secondorder asymptotic characterizations (analogous to the second terms in the asymptotic expansions in (1.3) and (1.4)). They include some channels with random state (Chapter 5), such as Costa’s writing on dirty paper [30], mixed DMCs [67, Sec. 3.3], and quasistatic singleinputmultipleoutput (SIMO) fading channels [18]. Under the fixed error setup, we also consider the secondorder asymptotics of the SlepianWolf [151] distributed lossless source coding problem (Chapter 6), the Gaussian interference channel (IC) in the strictly very strong interference regime [22] (Chapter 7), and the Gaussian multiple access channel (MAC) with degraded message sets (Chapter 8). The MAC with degraded message sets is also known as the cognitive [44] or asymmetric [72, 167, 128] MAC (AMAC). Chapter 9 concludes with a brief summary of other results, together with open problems in this area of research. A dependence graph of the chapters in the monograph is shown in Fig. 1.1.
This area of information theory—fixed error asymptotics—is vast and, at the same time, rapidly expanding. The results described herein are not meant to be exhaustive and were somewhat dependent on the author’s understanding of the subject and his preferences at the time of writing. However, the author has made it a point to ensure that results herein are conclusive in nature. This means that the problem is solved in the informationtheoretic sense in that an operational quantity is equated to an information quantity. In terms of asymptotic expansions such as (1.3) and (1.4), by solved, we mean that either the secondorder term is known or, better still, both the second and thirdorder terms are known. Having articulated this, the author confesses that there are many relevant informationtheoretic problems that can be considered solved in the fixed error setting, but have not found their way into this monograph either due to space constraints or because it was difficult to meld them seamlessly with the rest of the story.
1.3 Fundamentals of Information Theory
In this section, we review some basic informationtheoretic quantities. As with every article published in the Foundations and Trends in Communications and Information Theory, the reader is expected to have some background in information theory. Nevertheless, the only prerequisite required to appreciate this monograph is information theory at the level of Cover and Thomas [33]. We will also make extensive use of the method of types, for which excellent expositions can be found in [37, 39, 74]. The measuretheoretic foundations of probability will not be needed to keep the exposition accessible to as wide an audience as possible.
1.3.1 Notation
The notation we use is reasonably standard and generally follows the books by CsiszárKörner [39] and Han [67]. Random variables (e.g., ) and their realizations (e.g., ) are in upper and lower case respectively. Random variables that take on finitely many values have alphabets (support) that are denoted by calligraphic font (e.g., ). The cardinality of the finite set is denoted as . Let the random vector be the vector of random variables . We use bold face to denote a realization of . The set of all distributions (probability mass functions) supported on alphabet is denoted as . The set of all conditional distributions (i.e., channels) with the input alphabet and the output alphabet is denoted by . The joint distribution induced by a marginal distribution and a channel is denoted as , i.e.,
(1.5) 
The marginal output distribution induced by and is denoted as , i.e.,
(1.6) 
If has distribution , we sometimes write this as .
Vectors are indicated in lower case bold face (e.g., ) and matrices in upper case bold face (e.g., ). If we write for two vectors and of the same length, we mean that for every coordinate . The transpose of is denoted as . The vector of all zeros and the identity matrix are denoted as and respectively. We sometimes make the lengths and sizes explicit. The norm (for ) of a vector is denoted as .
We use standard asymptotic notation [29]: if and only if (iff) ; iff ; iff ; iff ; and iff . Finally, iff .
1.3.2 InformationTheoretic Quantities
Informationtheoretic quantities are denoted in the usual way [39, 49]. All logarithms and exponential functions are to the base . The entropy of a discrete random variable with probability distribution is denoted as
(1.7) 
For the sake of clarity, we will sometimes make the dependence on the distribution explicit. Similarly given a pair of random variables with joint distribution , the conditional entropy of given is written as
(1.8) 
The joint entropy is denoted as
(1.9)  
(1.10) 
The mutual information is a measure of the correlation or dependence between random variables and . It is interchangeably denoted as
(1.11)  
(1.12) 
Given three random variables with joint distribution where and , the conditional mutual information is
(1.13)  
(1.14) 
A particularly important quantity is the relative entropy (or KullbackLeibler divergence [102]) between and which are distributions on the same finite support set . It is defined as the expectation with respect to of the loglikelihood ratio , i.e.,
(1.15) 
Note that if there exists an for which while , then the relative entropy . If for every , if then , we say that is absolutely continuous with respect to and denote this relation by . In this case, the relative entropy is finite. It is well known that and equality holds if and only if . Additionally, the conditional relative entropy between given is defined as
(1.16) 
The mutual information is a special case of the relative entropy. In particular, we have
(1.17) 
Furthermore, if is the uniform distribution on , i.e., for all , we have
(1.18) 
The definition of relative entropy can be extended to the case where is not necessarily a probability measure. In this case nonnegativity does not hold in general. An important property we exploit is the following: If denotes the counting measure (i.e., for ), then similarly to (1.18)
(1.19) 
1.4 The Method of Types
For finite alphabets, a particularly convenient tool in information theory is the method of types [37, 39, 74]. For a sequence in which is finite, its type or empirical distribution is the probability mass function
(1.20) 
Throughout, we use the notation to mean the indicator function, i.e., this function equals if “” is true and otherwise. The set of types formed from length sequences in is denoted as . This is clearly a subset of . The type class of , denoted as , is the set of all sequences of length for which their type is , i.e.,
(1.21) 
It is customary to indicate the dependence of on the blocklength but we suppress this dependence for the sake of conciseness throughout. For a sequence , the set of all sequences such that has joint type is the shell, denoted as . In other words,
(1.22) 
The conditional distribution is also known as the conditional type of given . Let be the set of all for which the shell of a sequence of type is nonempty.
We will often times find it useful to consider informationtheoretic quantities of empirical distributions. All such quantities are denoted using hats. So for example, the empirical entropy of a sequence is denoted as
(1.23) 
The empirical conditional entropy of given where is denoted as
(1.24) 
The empirical mutual information of a pair of sequences with joint type is denoted as
(1.25) 
Lemma 1.1 (Type Counting).
The sets and for satisfy
(1.26) 
In fact, it is easy to check that but (1.26) or its slightly stronger version
(1.27) 
usually suffices for our purposes in this monograph. This key property says that the number of types is polynomial in the blocklength .
Lemma 1.2 (Size of Type Class).
For a type , the type class satisfies
(1.28) 
For a conditional type and a sequence , the shell satisfies
(1.29) 
This lemma says that, on the exponential scale,
(1.30) 
where we used the notation to mean equality up to a polynomial, i.e., there exists polynomials and such that . We now consider probabilities of sequences. Throughout, for a distribution , we let be the product distribution, i.e.,
(1.31) 
Lemma 1.3 (Probability of Sequences).
If and ,
(1.32)  
(1.33) 
This, together with Lemma 1.2, leads immediately to the final lemma in this section.
Lemma 1.4 (Probability of Type Classes).
For a type ,
(1.34) 
For a conditional type and a sequence , we have
(1.35) 
The interpretation of this lemma is that the probability that a random i.i.d. (independently and identically distributed) sequence generated from belongs to the type class is exponentially small with exponent , i.e.,
(1.36) 
The bounds in (1.35) can be interpreted similarly.
1.5 Probability Bounds
In this section, we summarize some bounds on probabilities that we use extensively in the sequel. For a random variable , we let and be its expectation and variance respectively. To emphasize that the expectation is taken with respect to a random variable with distribution , we sometimes make this explicit by using a subscript, i.e., or .
1.5.1 Basic Bounds
We start with the familiar Markov and Chebyshev inequalities.
Proposition 1.1 (Markov’s inequality).
Let be a realvalued nonnegative random variable. Then for any , we have
(1.37) 
If we let above be the nonnegative random variable , we obtain Chebyshev’s inequality.
Proposition 1.2 (Chebyshev’s inequality).
Let be a realvalued random variable with mean and variance . Then for any , we have
(1.38) 
We now consider a collection of realvalued random variables that are i.i.d. In particular, let be a collection of independent random variables where each has distribution with zero mean and finite variance .
Proposition 1.3 (Weak Law of Large Numbers).
For every , we have
(1.39) 
Consequently, the average converges to in probability.
1.5.2 Central LimitType Bounds
In preparation for the next result, we denote the probability density function (pdf) of a univariate Gaussian as
(1.40) 
We will also denote this as if the argument is unnecessary. A standard Gaussian distribution is one in which the mean and the standard deviation . In the multivariate case, the pdf is
(1.41) 
where . A standard multivariate Gaussian distribution is one in which the mean is and the covariance is the identity matrix .
For the univariate case, the cumulative distribution function (cdf) of the standard Gaussian is denoted as
(1.42) 
We also find it convenient to introduce the inverse of as
(1.43) 
which evaluates to the usual inverse for and extends continuously to take values for outside . These monotonically increasing functions are shown in Fig. 1.2.
If the scaling in front of the sum in the statement of the law of large numbers in (1.39) is instead of , the resultant random variable converges in distribution to a Gaussian random variable. As in Proposition 1.3, let be a collection of i.i.d. random variables where each has zero mean and finite variance .
Proposition 1.4 (Central Limit Theorem).
For any , we have
(1.44) 
In other words,
(1.45) 
where means convergence in distribution and is the standard Gaussian random variable.
Throughout the monograph, in the evaluation of the nonasymptotic bounds, we will use a more quantitative version of the central limit theorem known as the BerryEsseen theorem [17, 52]. See Feller [54, Sec. XVI.5] for a proof.
Theorem 1.1 (BerryEsseen Theorem (i.i.d. Version)).
Assume that the third absolute moment is finite, i.e., . For every , we have
(1.46) 
Remarkably, the BerryEsseen theorem says that the convergence in the central limit theorem in (1.44) is uniform in . Furthermore, the convergence of the distribution function of to the Gaussian cdf occurs at a rate of . The constant of proportionality in the notation depends only on the variance and the third absolute moment and not on any other statistics of the random variables.
There are many generalizations of the BerryEsseen theorem. One which we will need is the relaxation of the assumption that the random variables are identically distributed. Let be a collection of independent random variables where each random variable has zero mean, variance and third absolute moment . We respectively define the average variance and average third absolute moment as
(1.47) 
Theorem 1.2 (BerryEsseen Theorem (General Version)).
For every , we have
(1.48) 
Observe that as with the i.i.d. version of the BerryEsseen theorem, the remainder term scales as .
The proof of the following theorem uses the BerryEsseen theorem (among other techniques). This theorem is proved in PolyanskiyPoorVerdú [123, Lem. 47]. Together with its variants, this theorem is useful for obtaining thirdorder asymptotics for binary hypothesis testing and other coding problems with nonvanishing error probabilities.
Theorem 1.3.
Assume the same setup as in Theorem 1.2. For any , we have
(1.49) 
It is trivial to see that the expectation in (1.49) is upper bounded by . The additional factor of is crucial in proving coding theorems with better thirdorder terms. Readers familiar with strong large deviation theorems or exact asymptotics (see, e.g., [23, Thms. 3.3 and 3.5] or [43, Thm. 3.7.4]) will notice that (1.49) is in the same spirit as the theorem by Bahadur and RangaRao [13]. There are two advantages of (1.49) compared to strong large deviation theorems. First, the bound is purely in terms of and , and second, one does not have to differentiate between lattice and nonlattice random variables. The disadvantage of (1.49) is that the constant is worse but this will not concern us as we focus on asymptotic results in this monograph, hence constants do not affect the main results.
For multiterminal problems that we encounter in the latter parts of this monograph, we will require vector (or multidimensional) versions of the BerryEsseen theorem. The following is due to Götze [63].
Theorem 1.4 (Vector BerryEsseen Theorem I).
Let be independent valued random vectors with zero mean. Let
(1.50) 
Assume that has the following statistics
(1.51) 
Let be a standard Gaussian random vector, i.e., its distribution is . Then, for all , we have
(1.52) 
where is the family of all convex subsets of , and where is a constant that depends only on the dimension .
Theorem 1.4 can be applied for random vectors that are independent but not necessarily identically distributed. The constant can be upper bounded by if the random vectors are i.i.d., a result by Bentkus [15]. However, its precise value will not be of concern to us in this monograph. Observe that the scalar versions of the BerryEsseen theorems (in Theorems 1.1 and 1.2) are special cases (apart from the constant) of the vector version in which the family of convex subsets is restricted to the family of semiinfinite intervals .
We will frequently encounter random vectors with nonidentity covariance matrices. The following modification of Theorem 1.4 is due to WatanabeKuzuokaTan [177, Cor. 29].
Corollary 1.1 (Vector BerryEsseen Theorem II).
Assume the same setup as in Theorem 1.4, except that , a positive definite matrix. Then, for all , we have
(1.53) 
where is the smallest eigenvalue of .
The final probability bound is a quantitative version of the socalled multivariate delta method [174, Thm. 5.15]. Numerous similar statements of varying generalities have appeared in the statistics literature (e.g., [24, 175]). The simple version we present was shown by MolavianJazi and Laneman [112] who extended ideas in Hoeffding and Robbins’ paper [81, Thm. 4] to provide rates of convergence to Gaussianity under appropriate technical conditions. This result essentially says that a differentiable function of a normalized sum of independent random vectors also satisfies a BerryEsseentype result.
Theorem 1.5 (BerryEsseen Theorem for Functions of i.i.d. Random Vectors).
Assume that are valued, zeromean, i.i.d. random vectors with positive definite covariance and finite third absolute moment . Let be a vectorvalued function from to that is also twice continuously differentiable in a neighborhood of . Let be the Jacobian matrix of evaluated at , i.e., its elements are
(1.54) 
where and . Then, for every , we have
(1.55) 
where is a finite constant, and is a Gaussian random vector in with mean vector and covariance matrix respectively given as
(1.56) 
Chapter 2 Binary Hypothesis Testing
In this chapter, we review asymptotic expansions in simple (noncomposite) binary hypothesis testing when one of the two error probabilities is nonvanishing. We find this useful, as many coding theorems we encounter in subsequent chapters can be stated in terms of quantities related to binary hypothesis testing. For example, as pointed out in Csiszár and Körner [39, Ch. 1], fixedtofixed length lossless source coding and binary hypothesis testing are intimately connected through the relation between relative entropy and entropy in (1.18). Another example is in pointtopoint channel coding, where a powerful nonasymptotic converse theorem [152, Eq. (4.29)] [123, Sec. IIIE] [164, Prop. 6] can be stated in terms of the socalled hypothesis testing divergence and the information spectrum divergence (cf. Proposition 4.4). The properties of these two quantities, as well as the relation between them are discussed. Using various probabilistic limit theorems, we also evaluate these quantities in the asymptotic setting for product distributions. A corollary of the results presented is the familiar ChernoffStein lemma [39, Thm. 1.2], which asserts that the exponent with growing number of observations of the typeII error for a nonvanishing typeI error in a binary hypothesis test of against is the relative entropy .
The material in this chapter is based largely on the seminal work by Strassen [152, Thm. 3.1]. The exposition is based on the more recent works by PolyanskiyPoorVerdú [123, App. C], TomamichelTan [164, Sec. III] and TomamichelHayashi [163, Lem. 12].
2.1 NonAsymptotic Quantities and Their Properties
Consider the simple (noncomposite) binary hypothesis test:
(2.1) 
where and are two probability distributions on the same space . We assume that the space is finite to keep the subsequent exposition simple. The notation in (2.1) means that under the null hypothesis , the random variable is distributed as while under the alternative hypothesis , it is distributed according to a different distribution . We would like to study the optimal performance of a hypothesis test in terms of the distributions and .
There are several ways to measure the performance of a hypothesis test which, in precise terms, is a mapping from the observation space to . If the observation is such that , this means the test favors the null hypothesis . Conversely, means that the test favors the alternative hypothesis (or alternatively, rejects the null hypothesis ). If , the test is called deterministic, otherwise it is called randomized. Traditionally, there are three quantities that are of interest for a given test . The first is the probability of false alarm
(2.2) 
The second is the probability of missed detection
(2.3) 
The third is the probability of detection, which is one minus the probability of missed detection, i.e.,
(2.4) 
The probability of false alarm and miss detection are traditionally called the typeI and typeII errors respectively in the statistics literature. The probability of detection and the probability of false alarm are also called the power and the significance level respectively. The “holy grail” is, of course, to design a test such that while but this is clearly impossible unless and are mutually singular measures.
Since misses are usually more costly than false alarms, let us fix a number that represents a tolerable probability of false alarm (typeI error). Then define the smallest typeII error in the binary hypothesis test (2.1) with typeI error not exceeding , i.e.,
(2.5) 
Observe that constrains the probability of false alarm to be no greater than . Thus, we are searching over all tests satisfying such that the probability of missed detection is minimized. Intuitively, quantifies, in a nonasymptotic fashion, the performance of an optimal hypothesis test between and .
A related quantity is the hypothesis testing divergence
(2.6) 
This is a measure of the distinguishability of from . As can be seen from (2.6), and are simple functions of each other. We prefer to express the results in this monograph mostly in terms of because it shares similar properties with the usual relative entropy , as is evidenced from the following lemma.
Lemma 2.1 (Properties of ).
While the hypothesis testing divergence occurs naturally and frequently in coding problems, it is usually hard to analyze directly. Thus, we now introduce an equally important quantity. Define the information spectrum divergence as
(2.9) 
Just as in information spectrum analysis [67], this quantity places the distribution of the loglikelihood ratio (where ), and not just its expectation, in the most prominent role. See Fig. 2.1 for an interpretation of the definition in (2.9).
As we will see, the information spectrum divergence is intimately related to the hypothesis testing divergence (cf. Lemma 2.4). The former is, however, easier to compute. Note that if and are product measures, then by virtue of the fact that is a sum of independent random variables, one can estimate the probability in (2.9) using various probability tail bounds. This we do in the following section.
We now state two useful properties of . The proofs of these lemmas are straightforward and can be found in [164, Sec. III.A].
Lemma 2.2 (Sifting from a convex combination).
Let and be an at most countable convex combination of distributions with nonnegative weights summing to one, i.e., . Then,
(2.10) 
In particular, Lemma 2.2 tells us that if there exists some such that for all then,
(2.11) 
Lemma 2.3 (“Symbolwise” relaxation of ).
Let and be two channels from to and let . Then,
(2.12) 
One can readily toggle between the hypothesis testing divergence and the information spectrum divergence because they satisfy the bounds in the following lemma. The proof of this lemma mimics that of [163, Lem. 12].
Lemma 2.4 (Relation between divergences).
For every and every , we have
(2.13)  
(2.14) 
Proof.
The following proof is based on that for [163, Lem. 12]. For the lower bound in (2.13), consider the likelihood ratio test
(2.15) 
for some small . This test clearly satisfies by the definition of the information spectrum divergence. On the other hand,
(2.16)  
(2.17)  
(2.18)  
(2.19) 
As a result, by the definition of , we have