# Beta-Beta Bounds: Finite-Blocklength Analog of the Golden Formula

## Abstract

It is well known that the mutual information between two random variables can be expressed as the difference of two relative entropies that depend on an auxiliary distribution, a relation sometimes referred to as the golden formula. This paper is concerned with a finite-blocklength extension of this relation. This extension consists of two elements: (i) a finite-blocklength channel-coding converse bound by Polyanskiy and Verdú (2014), which involves the ratio between two Neyman-Pearson functions (beta-beta converse bound), and (ii) a novel beta-beta channel-coding achievability bound, expressed again as the ratio between two Neyman-Pearson functions.

To demonstrate the usefulness of this finite-blocklength extension of the golden formula, the beta-beta achievability and converse bounds are used to obtain a finite-blocklength extension of Verdú’s (2002) wideband-slope approximation. The proof parallels the elegant derivation in Verdú (2002), with the beta-beta bounds used in place of the golden formula.

The beta-beta (achievability) bound is also shown to be useful in cases where the capacity-achieving output distribution is not a product distribution due to, e.g., a cost constraint or structural constraints on the codebook, such as orthogonality or constant composition. As an example, the bound is used to characterize the channel dispersion of the additive exponential-noise channel and to obtain a finite-blocklength achievability bound (the tightest to date) for multiple-input multiple-output Rayleigh-fading channels with perfect channel state information at the receiver.

## 1Introduction

### 1.1Background

In his landmark 1948 paper [1], Shannon established the noisy channel coding theorem, which expresses the fundamental limit of reliable data transmission over a noisy channel in terms of the *mutual information* between the input and the output of the channel. More specifically, for stationary memoryless channels, the maximum rate at which data can be transmitted reliably is the channel capacity

Here, reliable transmission means that the probability of error can be made arbitrarily small by mapping the information bits into sufficiently long codewords. In the nonasymptotic regime of fixed blocklength (fixed codeword size), there exists a tension between rate and error probability, which is partly captured by the many nonasymptotic (finite-blocklength) bounds available in the literature [2]. In many of these bounds, the role of the mutual information is taken by the so-called *(mutual) information density*^{1}

or information spectrum [6]. While the various properties enjoyed by the mutual information make the evaluation of capacity relatively simple, computing the finite-blocklength bounds that involve the information density (e.g., the information-spectrum bounds [2]) can be very challenging.

One well-known property of the mutual information is that it can be expressed as a difference of relative entropies involving an arbitrary output distribution [8]:

Here, stands for the relative entropy. This identity—also known as the *golden formula*—has found many applications in information theory. One canonical application is to establish upper bounds on channel capacity [9]. Indeed, substituting into , we get an alternative expression for channel capacity

from which an upper bound on can be obtained by dropping the term . The golden formula is also used in the derivation of the capacity per unit cost [10] and the wideband slope [11], in the Blahut-Arimoto algorithm [12], and in Gallager’s formula for the minimax redundancy in universal source coding [14]. Furthermore, it is useful for characterizing the properties of good channel codes [15], and it is often used in statistics to prove lower bounds on the minimax risk via Fano’s inequality (see [17] and [18]).

The main purpose of this paper is to provide a finite-blocklength analog of that is helpful in deriving nonasymptotic results in the same way in which is helpful in the asymptotic case.^{2}

However, the resulting bounds are not very useful, because it is difficult to decouple the two random variables on the RHS of . Instead of tweaking the information-spectrum bounds via , we derive a finite-blocklength analog of from first principles.

To summarize our contribution, we need to first introduce some notation. We consider an abstract channel that consists of an input set , an output set , and a random transformation . An code for the channel comprises a message set , an encoder , and a decoder ( denotes the error event) that satisfies the *average* error probability constraint

Here, . In practical applications, we often take and to be -fold Cartesian products of two alphabets and , and the channel to be a sequence of conditional probabilities . We shall refer to an code for the channel as an code.

Binary hypothesis testing, which we introduce next, will play an important role. Given two probability measures and on a common measurable space , we define a randomized test between and as a random transformation , where indicates that the test chooses . The optimal performance achievable among all such randomized tests is given by the Neyman-Pearson function , which is defined as

where the minimum is over all tests satisfying

The Neyman-Pearson lemma [19] provides the optimal test that attains the minimum in . This test, which we shall refer to as the Neyman-Pearson test, involves thresholding the Radon-Nikodym derivative of with respect to .

### 1.2Finite-Blocklength Analog of the Golden Formula

A first step towards establishing a finite-blocklength analog of the golden formula was recently provided by Polyanskiy and Verdú, who proved that every code satisfies the following converse bound [16]:

Here, and denote the empirical input and output distributions induced by the code for the case of uniformly distributed messages. The proof of is a refinement of the meta-converse theorem [5], where the decoder is used as a suboptimal test for discriminating against . We shall refer to the converse bound as the converse bound. Note that one can obtain the minimax version of the meta-converse bound [5] by setting in .

In this paper, we provide the following achievability counterpart of : for every and every input distribution , there exists an code such that

where denotes the distribution of induced by through . The proof of the achievability bound above relies on Shannon’s random coding technique and on a suboptimal decoder that is based on the Neyman-Pearson test between and . Hypothesis testing is used twice in the proof: to relate the decoding error probability to , and to perform a change of measure from to . Fig. ? gives a pictorial summary of the connections between the bounds and various capacity and nonasymptotic bounds. The analogy between the bounds – and the golden formula follows because, for product distributions and ,

as by Stein’s lemma [20]. For example, one can prove that is achievable using as follows: set and , take the log of both sides of , use Stein’s lemma and optimize over .

### 1.3Applications

To demonstrate that the bounds and are the natural nonasymptotic equivalent of the golden formula, we use them to provide a finite-blocklength extension of the *wideband slope* approximation by Verdú [11]. More specifically, we derive a second-order characterization of the minimum energy per bit required to transmit bits with error probability and rate on additive white Gaussian noise (AWGN) channels and also on Rayleigh-fading channels with perfect channel state information available at the receiver (CSIR). Our result implies that (expressed in ) can be approximated by an affine function of the rate . Furthermore, the slope of this function coincides with the wideband slope by Verdú. Our proof parallels the elegant derivation by Verdú [11], except that the role of the golden formula is taken by the bounds and . Numerical evaluations demonstrate the accuracy of the resulting approximation.

The achievability bound is also useful in situations where is not a product distribution (although the underlying channel law is stationary and memoryless), for example due to a cost constraint, or a structural constraint on the channel inputs, such as orthogonality or constant composition. In such cases, traditional achievability bounds such as Feinstein’s bound [2] or the dependence-testing (DT) bound [5], which are expressed in terms of the information density, become difficult to evaluate. In contrast, the bound requires the evaluation of , which factorizes when is chosen to be a product distribution. This allows for an analytical computation of . Furthermore, the term , which captures the cost of the change of measure from to , can be evaluated or bounded even when a closed-form expression for is not available. To illustrate these points:

We use the achievability bound to characterize the channel dispersion [5] of the additive exponential noise channel introduced in [21].

We evaluate for the case of multiple-input multiple-output (MIMO) Rayleigh-fading channels with perfect CSIR. In this case, yields the tightest known achievability bound.

We conclude the paper by providing a generalization of bounds and to the case of list decoding.

### 1.4Notation

For an input distribution and a channel , we let denote the distribution of induced by through . The distribution of a circularly symmetric complex Gaussian random vector with covariance matrix is denoted by . We denote by the noncentral chi-squared distribution with degrees of freedom and noncentrality parameter ; furthermore, stands for the exponential distribution with mean . The Frobenius norm of a matrix is denoted by . For a vector , we let , , and denote the , , and norms of , respectively. The real part and the complex conjugate of a complex number are denoted by and , respectively. For a set , we use and to denote the set cardinality and the set complement, respectively. Finally, denotes the ceiling function, and the superscript stands for Hermitian transposition.

## 2The Achievability Bound

In this section, we formally state our achievability bound.

## 3Relation to Existing Achievability Bounds

We next discuss the relation between Theorem ? and the achievability bounds available in the literature.

#### The bound

The bound is based on Feinstein’s maximal coding approach and on a suboptimal threshold decoder. By further lower-bounding the term in the bound using [22], we can relax it to the following bound:

which holds under the *maximum* error probability constraint (cf. )

Here, denotes the permissible set of codewords, and is an arbitrary distribution on . Because of the similarity between and , one can interpret the bound as the average-error-probability counterpart of the bound. For the case in which does not depend on , by relaxing to in and by using [5] we recover under the weaker average-error-probability formalism. However, for the case in which does depend on , the achievability bound can be both easier to analyze and numerically tighter than the bound (see Section 5.2 for an example).

#### The dependence-testing (DT) bound

Setting in , using that , and rearranging terms, we obtain

Further setting and using the Neyman-Pearson lemma, we conclude that is equivalent to a weakened version of the DT bound where is replaced by . Since this weakened version of the DT bound implies both Shannon’s bound [23] and the bound in [24], the achievability bound implies these two bounds as well.

A cost-constrained version of the DT bound in which all codewords belong to a given set can be found in [5]. A slightly weakened version of [5] (with replaced by ) is

Here, , and . For , this bound can be derived from by choosing and by using the following bounds:

Here, follows from the data-processing inequality for (see, e.g., [25]) and straightforward computations, and follows from [26].

#### Refinements of the DT/Feinstein bound through change of measure

The idea of bounding the probability via a simpler-to-analyze has been applied previously in the literature to evaluate the DT bound and Feinstein’s bound. For example, the following achievability bound is suggested in [27]:

which is equivalent to

This bound can be obtained from the achievability bound by using that

The following variation of the Feinstein bound is used in [28] to deal with inputs subject to a cost constraint:

This bound can be obtained from by using [5] to lower-bound and by using [5] to upper-bound .

## 4Wideband Slope at Finite Blocklength

### 4.1Background

Many communication systems (such as deep-space communication and sensor networks) operate in the low-power regime, where both the spectral efficiency and the energy per bit are low. As shown in [11], the key asymptotic performance metrics for additive noise channels in the low-power regime are the normalized minimum energy per bit (where denotes the noise power per channel use) and the slope^{3}*wideband slope*). These two quantities determine the first-order behavior of the minimum energy per bit as a function of the spectral efficiency^{4}

For a wide range of power-constrained channels including the AWGN channel and the fading channel with/without CSIR, it is well known that the minimum energy per bit satisfies [11]

Furthermore, it is shown in [11] that for AWGN channels, and for stationary ergodic fading channels with perfect CSIR, where is distributed as one of the fading coefficients.

The asymptotic expansion is derived in [11] under the assumption that with and , where denotes the number of information bits (i.e., ). Recently, it was shown in [31] that the minimum energy per bit necessary to transmit a finite number of information bits over an AWGN channel with error probability and with no constraint on the blocklength satisfies

Furthermore, it was shown in [32] that the expansion is valid also for block-memoryless Rayleigh-fading channels (perfect CSIR), provided that .

In this section, we study the tradeoff between energy per bit and spectral efficiency in the regime where not only and , but also the blocklength is finite. In particular, we are interested in the minimum energy per bit necessary to transmit information bits with rate bits/ch. use and with error probability not exceeding . The quantity in and in can be obtained from as follows:

### 4.2AWGN Channel

We consider the complex-valued AWGN channel

where . We assume that every codeword satisfies the power constraint

For notational convenience, we shall set . Hence, in can be interpreted as the SNR at the receiver.

We next evaluate in the asymptotic regime and . Motivated by , we shall approximate (expressed in ) by an affine function of . The bounds turn out to be key tools to derive the asymptotic approximation given in Theorem ? below.

Thus, in the regime of large , the gap in between and the asymptotic limit is—up to first order—proportional to and is independent of .

We now provide some intuition for the expansion (a rigorous proof is given in Appendix Section 8). The asymptotic expression relies on the following Taylor-series expansion of the channel capacity as a function of the SNR when :

Here, and denote the first and second derivative, respectively, of the function (in nats per channel use). In particular, the first term in determines the minimum energy per bit and the second term in yields the wideband slope [11]:

Both and in can be computed directly, without the knowledge of . Indeed, set in the golden formula . One can show that [11]

and that for both AWGN channels and for fading channels with perfect CSIR,

where the minimization in is over all that achieve in and that satisfy . In other words, is determined solely by , and is determined solely by .

Moving to the nonasymptotic case, let be the maximum coding rate for a given blocklength , error probability , and SNR . Then, turns out to be equivalent to the following asymptotic expression (see Appendix Section 8.1 for the proof of this equivalence):

as , , and . In view of and , it is natural to use the bounds and , since they are nonasymptotic versions of the golden formula. Indeed, we obtain from and that

Next, we choose . The analysis in [31] implies that

which yields the first two terms in .

Furthermore, one can show through a large-deviation analysis that,

Here, the minimization in is taken with respect to all input distributions for which is close to the RHS of . Substituting and into , we recover the dominant terms in .

One may attempt to establish by using the normal approximation [5]

and then by Taylor-expanding and for . However, there are two major drawbacks with this approach. First, establishing the normal approximation is challenging for fading channels (see [33] and the remarks after Theorem ?). So this approach would work only in the AWGN case. Second, one needs to verify that the term in is uniform in , which is nontrivial.

### 4.3Rayleigh-Fading Channels With Perfect CSIR

We next consider the Rayleigh-fading channel

where both and are independent and identically distributed (i.i.d.) random variables. We assume that the channel coefficients are known to the receiver but not to the transmitter. Furthermore, we assume that every codeword satisfies the power constraint . The wideband slope of this channel is [11]

where .

Theorem ? below characterizes the minimum energy per bit for the Rayleigh-fading channel in the asymptotic limit and .

A few remarks are in order:

As in the AWGN case, the minimum energy per bit over the Rayleigh-fading channel with perfect CSIR satisfies

Again we observe that, in the low spectral efficiency regime, the gap in between and the asymptotic limit is—up to first order—proportional to and is independent of . Furthermore, the gap in the fading case coincides with that in the AWGN case up to terms.

Unlike the asymptotic wideband approximation , which holds for all fading distributions (see [11]), our result in Theorem ? relies on the Gaussianity of the fading coefficients, and does not necessarily hold for other fading distributions. In fact, as observed in [32], there are fading distributions for which the minimum energy per bit does not converge to when , , and is fixed.

For the case of nonvanishing rate (or, equivalently, nonvanishing SNR ), a normal approximation for the maximum rate achievable over the channel when CSIR is available is reported in [33]. This approximation relies on the additional constraint that every codeword satisfies . In contrast, Theorem ? does not require this additional constraint.

One of the challenges that one has to address when establishing a nonasymptotic converse bound on is that the variance of the information density depends on (see [22]). In order to obtain a tight converse bound on for fixed , one needs to expurgate the codewords whose norm is far from that of the codewords of a Gaussian code (see [22] and [33]). However, in the limit of interest in this paper, this expurgation procedure is not needed since the dominant term in the asymptotic expansion of the variance of does not depend on . Furthermore, the wideband slope is also insensitive to . Indeed, to achieve the wideband slope of a fading channel with perfect CSIR, QPSK inputs are as good as Gaussian inputs [11].

### 4.4Numerical results

In Fig. ?, we present a comparison between the approximation (with the two terms omitted), the achievability bound , and the converse bound . In both the achievability and the converse bound, is chosen to be the product distribution obtained from the capacity-achieving output distribution of the channel (i.e., ). For the parameters considered in Fig. ?, i.e., bits and , the approximation is accurate. Fig. ? provides a similar comparison for the Rayleigh fading channel . In this case, the converse bound is difficult to compute (due to the need to perform an optimization over all input distributions), and is not plotted.

## 5Other Applications of Theorem

### 5.1The Additive Exponential-Noise Channel

Consider the additive exponential-noise channel

where are i.i.d. -distributed. As in [21], we require each codeword to satisfy

The additive exponential-noise channel can be used to model a communication system where information is conveyed through the arrival time of packets, and where each packet goes through a single-server queue with exponential service time [34]. It also models a fast-varying phase-noise channel combined with an energy detector at the receiver [35].

The capacity of the channel under the input constraints specified in is given by [21]

and is achieved by the input distribution according to which takes the value zero with probability and follows an distribution conditioned on it being positive. Furthermore, the capacity-achieving output distribution is . A discrete counterpart of the exponential-noise channel is studied in [36], where a lower bound on the maximum coding rate is derived.

Theorem ? below characterizes the dispersion of the channel .

### 5.2MIMO Block-Fading Channel With Perfect CSIR

In this section, we use the achievability bound to characterize the maximum coding rate achievable over an Rayleigh MIMO block-fading channel. The channel is assumed to stay constant over channel uses (a coherence interval) and to change independently across coherence intervals. The input-output relation within the th coherence interval is given by

Here, and are the transmitted and received matrices, respectively; the entries of the fading matrix and of the noise matrix are i.i.d. . We assume that and are independent, that they take on independent realizations over successive coherence intervals, and that they do not depend on the matrices . The channel matrices are assumed to be known to the receiver but not to the transmitter. We shall also assume that each codeword spans coherence intervals, i.e., that the blocklength of the code is . Finally, each codeword is constrained to satisfy

#### Capacity and dispersion

In the asymptotic limit for fixed , the capacity of is given by [38]

If either or , the capacity is achieved by a unique input distribution, under which the matrix has i.i.d. entries [38]. In this case, we denote the capacity-achieving input distribution by . If and (i.e., multiple-input single-output channel), the capacity-achieving input distribution is not unique [39]. The capacity-achieving output distribution is always unique and is denoted by .^{5}

The channel dispersion for the single-antenna case with perfect CSIR was derived in [22]. This result was extended to multiple-antenna block-fading channels in [39] and [33]. In particular, it was shown in [33] that^{6}

Here,