Exponential Error Bounds on Parameter Modulation–Estimation for Discrete Memoryless Channels

# Exponential Error Bounds on Parameter Modulation–Estimation for Discrete Memoryless Channels

Neri Merhav
###### Abstract

We consider the problem of modulation and estimation of a random parameter to be conveyed across a discrete memoryless channel. Upper and lower bounds are derived for the best achievable exponential decay rate of a general moment of the estimation error, , , when both the modulator and the estimator are subjected to optimization. These exponential error bounds turn out to be intimately related to error exponents of channel coding and to channel capacity. While in general, there is some gap between the upper and the lower bound, they asymptotically coincide both for very small and for very large values of the moment power . This means that our achievability scheme, which is based on simple quantization of followed by channel coding, is nearly optimum in both limits. Some additional properties of the bounds are discussed and demonstrated, and finally, an extension to the case of a multidimensional parameter vector is outlined, with the principal conclusion that our upper and lower bound asymptotically coincide also for a high dimensionality.

Index Terms: Parameter estimation, modulation, discrete memoryless channels, error exponents, random coding, data processing theorem.

Department of Electrical Engineering

Technion - Israel Institute of Technology

Technion City, Haifa 32000, ISRAEL

E–mail: merhav@ee.technion.ac.il

## 1 Introduction

Consider the problem of conveying the value of a parameter across a given discrete memoryless channel

 p(\boldmathy|\boldmathx)=n∏t=1p(yt|xt), (1)

where and are the channel input and output vectors, respectively. Our main interest, in this work, is in the following questions: How well can one estimate based on when one is allowed to optimize, not only the estimator, but also the modulator, that is, the function that maps into a channel input vector? How fast does the estimation error decay as a function of when the best modulator and estimator are used?

In principle, this problem, which is the discrete–time analogue of the classical problem of “waveform communication” (in the terminology of [15, Chap. 8]), can be viewed both from the information–theoretic and the estimation–theoretic perspectives. Classical results in neither of these disciplines, however, seem to suggest satisfactory answers.

From the information–theoretic point of view, if the parameter is random, call it , this is actually a problem of joint source–channel coding, where the source emits a single variable (or a fixed number of them when is a vector), whereas the channel is allowed to be used many times ( is large). The separation theorem of classical information theory asserts that asymptotic optimality of separate source– and channel coding is guaranteed in the limit of long blocks. However, it refers to a regime of long blocks both in source coding and channel coding, whereas here the source block length is 1, and so, there is no hope to compress the source with performance that comes close to the rate–distortion function.

In the realm of estimation theory, on the other hand, there is a rich literature on Bayesian and non–Bayesian bounds, mostly concerning the mean square error (MSE) in estimating parameters from signals corrupted by an additive white Gaussian noise (AWGN) channel, as well as other channels (see, e.g., [12] and the introductions of [1], [2], and [14] for overviews on these bounds). Most of these bounds lend themselves to calculation for a given modulator and therefore they may give insights concerning optimum estimation for this specific modulator. They may not, however, be easy to use for the derivation of universal lower bounds, namely, lower bounds that depend neither on the modulator nor on the estimator, which are relevant when both optimum modulators and optimum estimators are sought. Two exceptions to this rule (although usually, not presented as such) are families of bounds that stem from generalized data processing theorems (DPT’s) [5], [6], [11], [16], [18], henceforth referred to as “DPT bounds”, and bounds based on hypothesis testing and channel coding considerations [1], [3], [17], henceforth called “channel–coding bounds.”

In this paper, we use both the channel–coding techniques and DPT techniques in order to derive lower bounds on general moments of the estimation error, , where is a random parameter, is its estimate, and the power is an arbitrary positive real (not necessarily an integer). It turns out that when is subjected to optimization, can decay exponentially rapidly as a function of , and so, our focus is on the best achievable exponential rate of decay as a function of , which we shall denote by , that is,

 inf\boldmathE|^U−U|ρ≈e−nE(ρ), (2)

where the infimum is over all modulators and estimators.111This is still an informal and non–rigorous description. More precise definitions will be given in the sequel. Interestingly, both the upper and lower bounds on are intimately related to well–known exponential error bounds associated with channel coding, such as Gallager’s random coding exponent (for small values of ) and the expurgated exponent function (for large values of ). In other words, we establish an estimation–theoretic meaning to these error exponent functions. In particular, under certain conditions, our channel–coding upper bound on (corresponding to a lower bound on ) can be presented as

 ¯¯¯¯E(ρ)={E0(ρ)ρ<ρ0Eex(0)ρ≥ρ0 (3)

where , being Gallager’s function, is the expurgated exponent at zero rate, and is value of for which (so that is continuous). In addition, we derive a DPT bound and discuss its advantages and disadvantages compared to the above bound.

We also suggest a lower bound, , on (associated with upper bounds on ), which is achieved by a simple, separation–based modulation and estimation scheme. While there is a certain gap between and for every finite , it turns out that this gap disappears (in the sense that the ratio tends to unity) both for large and for small , and so, we have exact asymptotics of in these two extremes: For large , tends to and for small , , where is the channel capacity. Our simple achievability scheme is then nearly optimum at both extremes, which means that a separation theorem essentially holds for very small and for very large values of , in spite of the earlier discussion (see also [7, Section III.D]). The results are demonstrated for the example of a “very noisy channel,” [4, Example 3, pp. 147–149], [13, pp. 155–158], which is convenient to analyze, as it admits closed–form expressions.

Finally, we suggest an extension of our results to the case of a multidimensional parameter vector . It turns out that the effect of the dimension is in reducing the effective value of by a factor of . In other words, is replaced by and the extension of the achievability result is straightforward. This means that for fixed , the limit of large (where the effective value is very small) also admits exact asymptotics, where .

The outline of the paper is as follows. In Section 2, we define the problem formally and we establish notation conventions. In Section 3, we derive our main upper and lower bounds based on channel coding considerations. In Section 4, we derive our DPT bound and discuss it. Section 5 is devoted to the example of the very noisy channel, and finally, in Section 6 the multidimensional case is considered.

## 2 Notation Conventions and Problem Formulation

Throughout this paper, scalar random variables (RV’s) will be denoted by capital letters, their sample values will be denoted by the respective lower case letters, and their alphabets will be denoted by the respective calligraphic letters. A similar convention will apply to random vectors and their sample values which will be denoted with same symbols in a bold face font. For example, is a realization of a random variable , whereas ( being a positive integer and being the –th Cartesian power of ) is a realization of a random vector .

Let be a uniformly distributed222This specific assumption concerning the density of and its support is made for convenience only. Our results extend to more general densities. random variable over the interval , which we will also denote by . We refer to as the parameter to be conveyed from the source to the destination, via a given noisy channel. A given realization of will be denoted by .

A discrete memoryless channel (DMC) is characterized by a matrix of conditional probabilities , where the channel input and output alphabets, and , are assumed finite.333The finite alphabet assumption is used mainly for reasons of simplicity. The extension to continuous alphabets is possible, though some caution should be exercised at several places. When a DMC is fed by an input vector , it produces an output vector according to

 p(\boldmathy|\boldmathx)=n∏t=1p(yt|xt). (4)

A modulator is a measurable mapping from to and an estimator is a mapping from back to . The random vector will also be denoted by . Similarly, the random variable will also be denoted by . Our basic figure of merit for communication systems is the expectation of –th power of the estimation error, i.e., , where is a positive real (not necessarily an integer) and is the expectation operator with respect to (w.r.t.) the randomness of and . The capability of attaining an exponential decay in by certain choices of a modulator and an estimator , motivates the definition of the following exponential rates

 ¯¯¯E(ρ)=limsupn→∞[−1nln(inffn,gn\boldmathE{|^U−U|ρ})] (5)

and

 E––(ρ)=liminfn→∞[−1nln(inffn,gn\boldmathE{|^U−U|ρ})]. (6)

This paper is basically about the derivation of upper bounds on and lower bounds on , with special interest in situations where these upper and lower bounds come close to each other.

## 3 Upper and Lower Bounds Based on Channel Coding

Let be a given probability vector of a random variable taking on values in , and let define the given DMC. Let be the Gallager function [4, p. 138, eq. (5.6.14)], [13, p. 133, eq. (3.1.18)], defined as

 E0(ρ,q)=−ln⎛⎝∑y∈Y[∑x∈Xq(x)p(y|x)1/(1+ρ)]1+ρ⎞⎠,    ρ≥0. (7)

Next, we define

 E0(ρ)=maxqE0(ρ,q), (8)

where the maximum is over the entire simplex of probability vectors, and let be the upper concave envelope444While the Gallager function is known to be concave in for every fixed [13, p. 134, eq. (3.2.5a)], we are not aware of an argument asserting that is concave in general. On the other hand, there are many situations where is, in fact, concave and then , for example, when the achiever of is independent of , like the case of the binary input output–symmetric (BIOS) channel [13, p. 153]. (UCE) of . Next define

 Ex(ϱ)=−ϱln⎛⎝∑x,x′∈Xq(x)q(x′)[∑y∈Y√p(y|x)p(y|x′)]ϱ⎞⎠ (9)

where the parameter should be distinguished from the power of the estimation error in discussion. The expurgated exponent function [4, p. 153, eq. (5.7.11)], [13, p. 146, eq. (3.3.13)] is defined as

 Eex(R)=supϱ≥1[Ex(ϱ)−ϱR]. (10)

It is well known (and a straightforward exercise to show) that

 Eex(0)=supϱ≥1Ex(ϱ)=limϱ→∞Ex(ϱ)=−∑x,x′∈Xq(x)q(x′)ln[∑y∈Y√p(y|x)p(y|x′)]. (11)

Finally, define

 ¯E(ρ)={¯¯¯¯E0(ρ)ρ≤ρ0Eex(0)ρ>ρ0 (12)

where is the (unique) solution to the equation .

Our first theorem (see Appendix A for the proof) asserts that is an upper bound on the best achievable exponential decay rate of –th moment of the estimation error.

###### Theorem 1

Let be uniformly distributed over and let be a given DMC. Then, for every

 ¯¯¯E(ρ)≤¯¯¯¯E(ρ). (13)

We now proceed to present a lower bound to . Let be the smallest such that is attained with and let denote the largest such that

 Er(R)=max0≤ρ≤1[E0(ϱ,q)−ϱR] (14)

is attained for .555For example, in the case of the BSC with a crossover parameter , , with , and , where [13, pp. 151–152]. Next, define

 ρ+ = E0(1)−R+R+ (15) ρ− = E0(1)−R−R− (16)

and finally,

 (17)

Our next theorem (see Appendix B for the proof) tells us that is a lower bound on the best attainable exponential decay rate of .

###### Theorem 2

Let be uniformly distributed over and let be a given DMC. Then, for every

 E––(ρ)≥E––(ρ). (18)

The derivations of both and rely on channel coding considerations. In particular, the derivation of builds strongly on the method of [7], which extends the derivation of the Ziv–Zakai bound [17] and the Chazan–Zakai–Ziv bound [3]. While the two latter bounds are based on considerations associated with binary hypotheses testing, here and in [7], the general idea is extended to exponentially many hypotheses pertaining to channel decoding.

We see that both bounds exhibit different types of behavior in different ranges of (i.e., “phase transitions”), but in a different manner. For both and the behavior is related to the ordinary Gallager function in some range of small , and to the expurgated exponent in a certain range of large .

As can be seen in the proof of Theorem 2 (Appendix B), the communication system that achieves works as follows (see also [7], [8]): Define

 R(ρ)=E––(ρ)ρ=⎧⎪ ⎪⎨⎪ ⎪⎩sup0≤ϱ≤1E0(ϱ)/(ϱ+ρ)ρ≤ρ+E0(1)/(1+ρ)=Ex(1)/(1+ρ)ρ+<ρ≤ρ−supϱ≥1Ex(ϱ)/(ϱ+ρ)ρ>ρ− (19)

Construct a uniform grid of evenly spaced points along , denoted . If assign to each grid point a codeword of a code of rate that achieves the expurgated exponent (see [4, Theorem 5.7.1] or [13, Theorem 3.3.1]). If , do the same with a code that achieves (see [4, p. 139, Corollary 1] or [13, Theorem 3.2.1]). Given , let be the codeword that is assigned to the grid point , which is closest to . Given , let be the grid point that corresponds to the codeword that has been decoded based on using the ML decoder for the given DMC.

Let us examine the behavior of these bounds as and as . For very large values of , where the upper bound is obviously given by , the lower bound is given by

 limρ→∞E––(ρ) = limρ→∞supϱ≥1ρEx(ϱ)ϱ+ρ (20) ≥ limρ→∞ρEx(√ρ)√ρ+ρ (21) = limρ→∞Ex(√ρ)=Eex(0), (22)

which means that for large all the exponents asymptotically coincide:

 limρ→∞E––(ρ)=limρ→∞E––(ρ)=limρ→∞¯¯¯E(ρ)=limρ→∞¯¯¯¯E(ρ)=Eex(0). (23)

In the achievability scheme described above, is a very low coding rate. On the other hand, for very small values of , where , being the channel capacity, we have

 limρ→0E––(ρ)ρ = limρ→0sup0≤ϱ≤1E0(ϱ)ϱ+ρ (24) ≥ limρ→0E0(√ρ)√ρ+ρ (25) = limρ→0E0(√ρ)√ρ⋅11+√ρ (26) = limρ→0E0(√ρ)√ρ=C, (27)

which means that for small all the exponents behave like , i.e.,

 limρ→0E––(ρ)ρ=limρ→0E––(ρ)ρ=limρ→0¯¯¯E(ρ)ρ=limρ→0¯¯¯¯E(ρ)ρ=C. (28)

It is then interesting to observe that not only channel–coding error exponents, but also channel capacity plays a role in the characterization of the best achievable modulation–estimation performance. In the achievability scheme described above, is a very high coding rate, very close to the capacity .

## 4 Upper Bound Based on Data Processing Inequalities

We next derive an alternative upper bound on that is based on generalized data processing inequalities, following Ziv and Zakai [18] and Zakai and Ziv [16]. The idea behind these works is that it is possible to define generalized mutual information functionals satisfying a DPT, by replacing the negative logarithm function of the ordinary mutual information, by a general convex function. This enables to obtain tighter distortion bounds for communication systems with short block length.

In [6] it was shown that the following generalized mutual information functional, between two generic random variables, and , admits a DPT for every positive integer and for every vector whose components are non–negative and sum to unity:

 ~I(A;B)=−\boldmathE{∑b∈Bk∏i=1p(b|Ai)αi}=−∑b∈Bk∏i=1∑ai∈Aq(ai)p(b|ai)αi. (29)

In particular, since is a Markov chain, then by the generalized DPT,

 ~I(U;^U)≤~I(U;\boldmathY). (30)

The idea is to further upper bound and to further lower bound subject to the constraint , which leads to a generalized rate–distortion function, and thereby to obtain an inequality on . Specifically, is upper bounded as follows:

 ~I(U;\boldmathY) = −∑\boldmathy∈Ynk∏i=1∫+1/2−1/2duip(\boldmathy|fn(ui))αi (31) = −∑\boldmathy∈Ynk∏i=1∫+1/2−1/2duin∏t=1p(yt|[fn(ui)]t)αi (32) = −n∏t=1∑y∈Yk∏i=1∫+1/2−1/2duip(yt|[fn(ui)]t)αi (33) ≤ −minqn∏t=1∑y∈Yk∏i=1∑xi∈Xq(xi)p(yt|xi)αi (34) = −minq⎡⎣∑y∈Yk∏i=1∑xi∈Xq(xi)p(y|xi)αi⎤⎦n (35) = −exp{−nmaxqE(α1,…,αk,q)}, (36)

where denotes the –th component of the vector and where

 E(α1,…,αk,q)=−ln⎡⎣∑y∈Yk∏i=1⎛⎝∑xi∈Xq(xi)p(y|xi)αi⎞⎠⎤⎦. (37)

Note that for ( – integer),

 ^E(11+ϱ,…,11+ϱ,q)=E0(ϱ,q). (38)

In Appendix C we show that

 min{~I(U,^U): \boldmathE|^U−U|ρ=D}\lx@stackrelΔ=~R(D)≥−c⋅D∑ki=1ζρ(αi) (39)

where is a constant that depends solely on , and , and where

 ζρ(α)=⎧⎨⎩α0≤α≤11+ρ1−αρ11+ρ≤α≤1=min{α,1−αρ}. (40)

The function in eq. (39) is referred to as a “generalized rate–distortion function” in the terminology of [18] and [16]. Thus, from the generalized DPT,

 \boldmathE|^U−U|ρ≡D≥c′⋅e−n¯¯¯¯EDPT(ρ) (41)

where is another constant and

 ¯¯¯¯EDPT(ρ)\lx@stackrelΔ=infk>1infα1,…,αksupqE(α1,…,αk,q)k∑i=1ζρ(αi). (42)

As an example, assume that the channel is such that the function is concave, so that . In this case, since and is monotonically increasing. Now, let be an integer (for example, is always a legitimate choice). Then,

 ¯¯¯¯E(ρ) = E0(ρ) (43) = supqE(1/(1+ρ),…,1/(1+ρ),q)(1+ρ)ζρ(1/(1+ρ)) (44) ≥ infk>1infα1,…,αksupq^E(α1,…,αk,q)∑ki=1ζρ(αi) (45) = ¯¯¯¯EDPT(ρ). (46)

Thus, at least in this case, the DPT bound is guaranteed to be no worse than the channel–coding bound . Nonetheless, in our numerical studies, we have not found an example where the DPT bound strictly improves on the channel–coding bound, i.e., , and it remains an open question whether the DPT bound can offer improvement in any situation, thanks to its additional degrees of freedom. It should be pointed out that the vector that achieves is not always given by because the function is not convex in . At any rate, in all cases where the two bounds are equivalent, namely, , this is interesting on its own right since the two bounds are obtained by two different techniques that are based on completely different considerations. One advantage of the DPT approach is that it seems to lend itself more comfortably to extensions that account for moments of more general functions of the estimation error, i.e., , for a large class of monotonically increasing functions . On the other hand, the optimization associated with calculation of the DPT bound is not trivial.

## 5 Example: Very Noisy Channel

As an example, we consider the so called very noisy channel, which is characterized by

 p(y|x)=p(y)[1+ϵ(x,y)],      |ϵ(x,y)|≪1,   ∀x∈X, y∈Y. (47)

As is shown in [13, Sect. pp. 155–158], to the first order, we have the following relations

 C=12maxq∑x,yq(x)p(y)ϵ2(x,y) (48)
 E0(ϱ)=ϱ1+ϱ⋅C, (49)

and therefore

 Er(R)=max0≤ϱ≤1(ϱ1+ϱ⋅C−ϱR)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩C2−RRC (50)

As for the expurgated exponent, we have

 Ex(ϱ)=E0(1)=C2 (51)

and so,

 Eex(R)=supϱ≥1[Ex(ϱ)−ϱR]=C2−R (52)

which means that expurgation does not help for very noisy channels. This implies that and so

 ¯¯¯¯E(ρ)=⎧⎨⎩ρ1+ρ⋅Cρ≤1C2ρ>1 (53)

As for the lower bound, we have the following: For ,

 E––(ρ)=sup0≤ϱ≤1ρρ+ϱ⋅ϱ1+ϱ⋅C=ρ(1+√ρ)2⋅C. (54)

The same result is obtained, of course, from the solution to the equation . For ,

 E––(ρ)=supϱ≥1ρEx(ϱ)ϱ+ρ=supϱ≥1ρϱ+ρ⋅C2=ρ1+ρ⋅C2. (55)

Thus, in summary

 E––(ρ)=⎧⎪⎨⎪⎩ρ(1+√ρ)2⋅Cρ<1ρ1+ρ⋅C2ρ≥1 (56)

We see how the bounds asymptotically coincide (in the sense that ) both for very large values of and for very small values of (see Fig. 1).

As for the DPT bound, we have the following approximate analysis:

 e−supqE(α1,…,αk,q) = infq∑y∈Yk∏i=1⎡⎣∑xi∈Xq(xi)p(y|xi)αi⎤⎦ (57) = (58) = infq∑y∈Yp(y)k∏i=1⎡⎣∑xi∈Xq(xi)[1+ϵ(xi,y)]αi⎤⎦ (59) ≈ infq∑y∈Yp(y)k∏i=1⎛⎝∑xi∈Xq(xi)[1+αiϵ(xi,y)−12αi(1−αi)ϵ2(xi,y)]⎞⎠ (60) = infq∑y∈Yp(y)k∏i=1⎡⎣1−12αi(1−αi)∑xi∈Xq(xi)ϵ2(xi,y)⎤⎦ (61) ≈ infq∑y∈Yp(y)⎡⎣1−12k∑i=1αi(1−αi)∑xi∈Xq(xi)ϵ2(xi,y)⎤⎦ (62) = 1−12k∑i=1αi(1−αi)supq∑xi∈X∑y∈Yq(xi)p(y)ϵ2(xi,y) (63) ≈ 1−Ck∑i=1αi(1−αi) (64) = 1−C(1−k∑i=1α2i). (65)

where in the fifth line, we have used the identity for all with [13, p. 156, eq. (3.4.28)]. Thus,

 supqE(α1,…,αk,q)=−ln[1−C(1−k∑i=1α2i)]≈C(1−k∑i=1α2i), (66)

and then

 ¯¯¯¯EDPT(ρ)≈C⋅infk>1infα1,…,αk1−∑ki=1α2i∑ki=1ζρ(αi). (67)

The very same expressions are obtained for the continuous–time AWGN channel with unlimited bandwidth, where , being the signal power and being the one–sided noise spectral density. For and , we have :

 ¯¯¯¯EDPT(1) ≤ C⋅inf0≤α≤11−α2−(1−α)22min{α,1−α} (68) = C⋅inf0≤α≤1/22α(1−α)2α=C2, (69)

which agrees with . For and , the minimum is attained for , and the result is . However for , the bound improves to .

## 6 Extension to the Multidimensional Case

Consider now the case of a parameter vector , uniformly distributed across the unit hypercube . A reasonable figure of merit in this case would be a linear combination of , . Since each one of these terms is exponential in , it makes sense to let the coefficients of this linear combination also be exponential functions of , as otherwise, the results will be exponentially insensitive to the choice of the coefficients. This means that we consider the criterion

 d∑i=1enri⋅\boldmathE{|^Ui−Ui|ρ}, (70)

where, without loss of generality, we take , .

The derivation below is an extension of the derivation of the channel coding bound, given in Appendix A for the case . Therefore, a reader who is interested in the details is advised to read Appendix A first, or otherwise to skip directly to the final result in eq. (81) and the discussion that follows.

Let us define for some constant . Consider the following chain of inequalities:

 d∑i=1enri⋅\boldmathE{|^Ui−Ui|ρ} ≥ d∑i=1enri⋅e−nρRiPr{|^Ui−Ui|≥e−nRi} (71) = d∑i=1e−n(ρRi−ri)Pr{|^Ui−Ui|≥e−nRi} (72) = e−γnd∑i=1Pr{|^Ui−Ui|≥e−n(ri+γ)/ρ} (73) ≥ e−γn⋅Prd⋃i=1{|^Ui−Ui|≥e−n(ri+γ)/ρ} (74) ≥ e−γn⋅exp{−nEsl(1ρ[d∑i=1ri+γd])}, (75)

where the second line follows from Chebychev’s inequality, the fifth line follows from the union bound, and the last line follows from the same arguments as in [7, Sect. IV.A]. Maximizing over , we get

 d∑i=1enri⋅\boldmathE{|^Ui−Ui|ρ}≥exp{−nminγ≥0[γ+Esl(1ρ[d∑i=1ri+γd])]}. (76)

Defining , and , the above minimization at the exponent becomes equivalent to

 minR≥Rmin[ρR−∑irid+Esl(R)] (77) = minR≥Rmin[ρd