Information-Theoretic Lower Bounds on Bayes Risk in Decentralized Estimation

Information-Theoretic Lower Bounds on Bayes Risk in Decentralized Estimation

Aolin Xu and Maxim Raginsky  This work was supported by the NSF under grant CCF-1017564, CAREER award CCF-1254041, by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370, and by ONR under grant N00014-12-1-0998. The material in this paper was presented in part at the IEEE International Symposium on Information Theory (ISIT), Hong Kong, June 2015. The authors are with the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801, USA. E-mail: {aolinxu2,maxim}@illinois.edu.
Abstract

We derive lower bounds on the Bayes risk in decentralized estimation, where the estimator does not have direct access to the random samples generated conditionally on the random parameter of interest, but only to the data received from local processors that observe the samples. The received data are subject to communication constraints, due to quantization and the noise in the communication channels from the processors to the estimator. We first derive general lower bounds on the Bayes risk using information-theoretic quantities, such as mutual information, information density, small ball probability, and differential entropy. We then apply these lower bounds to the decentralized case, using strong data processing inequalities to quantify the contraction of information due to communication constraints. We treat the cases of a single processor and of multiple processors, where the samples observed by different processors may be conditionally dependent given the parameter, for noninteractive and interactive communication protocols. Our results recover and improve recent lower bounds on the Bayes risk and the minimax risk for certain decentralized estimation problems, where previously only conditionally independent sample sets and noiseless channels have been considered. Moreover, our results provide a general way to quantify the degradation of estimation performance caused by distributing resources to multiple processors, which is only discussed for specific examples in existing works.

Bayes risk, decentralized estimation, small ball probability, Neyman-Pearson converse, strong data processing inequalities

I Introduction

I-a Decentralized estimation

In decentralized estimation, the estimator does not have direct access to the samples generated according to the parameter of interest, but only to the data received from local processors that observe the samples. In this paper, we consider a general model of decentralized estimation, where each local processor observes a set of samples generated according to a common random parameter , quantizes the samples to a fixed-length binary message, then encodes and sends the message to the estimator over an independent and possibly noisy communication channel. When the communication channels are noiseless and feedback from the estimator to the local processors is available, the processors can operate in an interactive protocol by taking turns to send messages, where the message sent by each processor can depend on the previous messages sent by the other processors. An estimate is then computed based on the messages received from the local processors. The estimation performance is measured by the expected distortion between and , with respect to some distortion function. The minimum possible expected distortion is defined as the Bayes risk. We derive lower bounds on the Bayes risk for this estimation problem, and gain insight into the fundamental limits of decentralized estimation.

There are three types of constraints inherent in decentralized estimation. The first, and the most fundamental one, is the statistical constraint, determined by the joint distribution of the parameter and the samples. The statistical constraint exists even in the centralized estimation, where the estimator can directly observe the samples. To study how the estimation performance is limited by the statistical constraint, we start with deriving lower bounds on the Bayes risk for centralized estimation in Section II. The results obtained in Section II apply to the decentralized estimation as well, but, more importantly, they also serve as the basis for the refined lower bounds for the decentralized estimation in Section IV and Section V.

The second is the communication constraint, due to the separation between the local processors and the estimator. The communication constraint arises even when there is only one local processor. It can be caused by the finite precision of analog-to-digital conversion, limitations on the storage of intermediate results, limited transmission blocklength, channel noise, etc. In Section IV, we present a detailed study of decentralized estimation with a single processor and reveal the influence of the communication constraint on the estimation performance. Section III contains background information on strong data processing inequalities, the major tool used in our analysis of the communication constraint.

The third constraint appears when there are more than one local processors. It is the penalty of decentralization, caused by distributing the samples and communication resources to multiple processors. We study decentralized estimation with multiple processors in Section V, where we show that, regardless of whether or not the sample sets seen by different local processors are conditionally independent given the parameter, the degradation of estimation performance becomes more pronounced when the resources are distributed to more processors. We also provide lower bounds on the Bayes risk for interactive protocols, where the processors take turns to send their messages, and each processor sends one message based on its sample set and the previous messages sent by other processors.

I-B Method of analysis

Our method of analysis is information-theoretic in nature. The major quantity we examine is the conditional mutual information with a judiciously chosen auxiliary random variable .

We first lower-bound this quantity according to the estimation performance, such as the probability of excess distortion or the expected distortion. The lower bounds will also depend on the a priori uncertainty about , measured either by its small ball probability or by its differential entropy. Any such lower bound can be viewed as a generalization of Fano’s inequality, which indicates the least amount of information about that must be contained in in order to achieve a certain estimation performance. We also analyze the probability of excess distortion and the expected distortion via the distribution of the conditional information density .

On the other hand, various constraints inherent in decentralized estimation impose upper bounds on . According to the statistical constraint, is upper-bounded by the conditional mutual information between and the samples. The communication constraint further implies that the amount of information about contained in the estimator’s indirect observation of the samples will be a contraction of the amount contained in the samples. We use strong data processing inequalities to quantify this contraction of information and to couple the communication constraint and the statistical constraint together in the upper bounds on . When there are multiple processors, strong data processing inequalities also give an upper bound that decreases as the samples and communication resources are distributed to more processors, which reflects the penalty of decentralization. In addition, we rely on a cutset analysis that chooses the conditioning random variable to consist of all the samples seen by only a subset of the processors; this choice is useful for analyzing the situation where the processors observe sample sets that are dependent conditional on .

Finally, by combining the upper and lower bounds on , we obtain lower bounds on the Bayes risk.

I-C Related works

The early works on the fundamental limits of decentralized estimation mainly focused on the asymptotic setting, e.g., determining the error exponent in multiterminal hypothesis testing with fixed quantization rates. Those works are surveyed by Han and Amari [1]. In recent years, the focus has shifted towards determining explicit dependence of the estimation performance on the communication constraint (see, e.g., [2, 3, 4, 5, 6] and references therein). For instance, Zhang et al. [2] and Duchi et al. [3] derived lower bounds on the minimax risk of several decentralized estimation problems with noiseless communication channels. Their results also provide lower bounds on the number of bits needed in quantization to achieve the same minimax rate as in the centralized estimation. Garg et al. [4] extended the lower bound for interactive protocols in [2], which centered on the one-dimensional Gaussian location model, to the setting of high-dimensional Gaussian location models. Braverman et al. [5] presented lower bounds for decentralized estimation of a sparse multivariate Gaussian mean. Their derivation is based on a “distributed data processing inequality,” which quantifies the information loss in decentralized binary hypothesis testing under the Gaussian location model. Shamir [6] showed that the analysis of several decentralized estimation and online learning problems can be reduced to a certain meta-problem involving discrete parameter estimation with interactive protocols, and derived minimax lower bounds for this meta-problem.

The main idea underlying all of the above works is that one has to quantify the contraction of information due to the communication constraint; however, this is often done in a case-by-case manner for each particular problem, and the resulting contraction coefficients are generally not sharp. Additionally, these works only consider the situation where the sample sets are conditionally independent given the parameter and where the communication channels connecting the processors to the estimator are noiseless.

By contrast, we derive general lower bounds on the Bayes risk, which automatically serve as lower bounds on the minimax risk. We use strong data processing inequalities as a unifiying general method for quantifying the contraction of mutual information in decentralized estimation. Our results apply to general priors, sample generating models, and distortion functions. When particularized to the examples in the existing works, our results can lead to sharper lower bounds on both the Bayes and the minimax risk. For example, we improve the lower bound for the mean estimation on the unit cube studied in [2], as well as the lower bound for the meta-problem of Shamir [6]. Moreover, we consider the situations where the sample sets are conditionally dependent and where the communication channels are noisy. We also provide a general way to quantify the degradation of estimation performance caused by distributing resources to multiple processors, which is only discussed for specific examples in existing works.

I-D Notation

In this paper, all logarithms are binary, unless stated otherwise. A vector like may be abbreviated as . For , . For an integer , . For functions and , means that , while means that . We use and to denote the binary entropy and the binary relative entropy functions.

Ii Bayes risk lower bounds for centralized estimation

In the standard Bayesian estimation framework, is a family of distributions on an observation space , where the parameter space is endowed with a prior distribution . Given , a sample is generated from . In centralized estimation, the unknown random parameter is estimated from as , via an estimator . Given a non-negative distortion function , define the Bayes risk for estimating from with respect to as

(1)

In this section, we derive lower bounds on the Bayes risk in the context of centralized estimation. These bounds serve as lower bounds for the decentralized setting as well, but they can also be used to derive refined lower bounds for decentralized estimation, as shown in Sections IV and V. We first present lower bounds on the Bayes risk based on small ball probability, mutual information, and information density in Sections II-A and II-B. These lower bounds apply to estimation problems with an arbitrary joint distribution and an arbitrary distortion function , and also provide generalizations of Fano’s inequality, as discussed in Section II-C. Next, in Section II-D, we present a lower bound based on mutual information and differential entropy, which applies to parameter estimation problems in , with distortion functions of the form for some norm and some .

Ii-a Lower bounds based on mutual information and small ball probability

The small ball probability of with respect to distortion function is defined as

(2)

Given another random variable jointly distributed with , the conditional small ball probability of given is defined as

(3)

These two quantities measure the spread of or , respectively. The smaller the small ball probability, the more spread the corresponding distribution is w.r.t. the distortion function . We give a lower bound on the probability of excess distortion in terms of conditional mutual information and conditional small ball probability:

Lemma 1.

For any estimate of , any , and any auxiliary random variable ,

(4)
Proof:

The inequality (4) is a direct consequence of the following lower bound on the conditional mutual information obtained in [7]: whenever ,

In Appendix A, we present an alternative unified proof of Lemmas 1 and 2 using properties of the Neyman–Pearson function. ∎

Our first lower bound on the Bayes risk for centralized estimation is an immediate consequence of Lemma 1:

Theorem 1.

The Bayes risk for estimating the parameter based on the sample with respect to the distortion function satisfies

(5)

In particular,

(6)
Proof:

For an arbitrary estimator ,

(7)

by the data processing inequality. It follows from Lemma 1 that

(8)

Theorem 1 follows from Markov’s inequality and from the arbitrariness of , , and . ∎

Remark 1.

Precise evaluation of the expected conditional small ball probability in Theorem 1 can be difficult. The following technique may sometimes be useful: Suppose we can upper-bound by some increasing function , which has an inverse function . Given some , choosing a suitable such that

(9)

guarantees

(10)

It then follows from Theorem 1 that

(11)

A similar methodology for deriving lower bounds on the Bayes risk has been recently proposed by Chen et al. [8], who obtained unconditional lower bounds similar to (6) in terms of general -informativities [9] and a quantity essentially the same as the small ball probability. However, as will be shown later, the conditional lower bound (5) can lead to tighter results compared to the unconditional version (6), and is also useful in the context of decentralized estimation problems.

For the problem of estimating based on samples conditionally i.i.d. given , we can choose the conditioning random variable in (5) to be an independent copy of conditional on , denoted as — that is, and . This choice leads to

(12)

We then need to evaluate or upper-bound and . For example, in the smooth parametric case when is a subset of a finite-dimensional exponential family and has a density supported on a compact subset of , it was shown by Clarke and Barron [10, 11] that

(13)

where is the differential entropy of , and is the Fisher information matrix about contained in . When (13) holds, we have

(14)
(15)

meaning that in (12) is asymptotically independent of . Upper-bounding is more problem-specific. We give two examples below, in both of which we consider the absolute distortion , such that the Bayes risk gives the Minimum Mean Absolute Error (MMAE). A benefit of lower-bounding MMAE is that the square of the resulting lower bound also serves as a lower bound for the Minimum Mean Squared Error (MMSE).

Example 1 (Estimating Gaussian mean with Gaussian prior).

Consider the case where the parameter , the samples are with independent of for , and .

From the conditional lower bound (12), we get the following lower bound for Example 1:

Corollary 1.

In Example 1, the Bayes risk is lower bounded by

(16)
Proof:

Appendix B. ∎

Note that the MMAE in Example 1 is upper-bounded by

(17)

which is achieved by . Thus the non-asymptotic lower bound on the Bayes risk in (16) captures the correct dependence on , and is off from the true Bayes risk by a constant factor. If we apply the unconditional lower bound (6) to Example 1, we can only get an asymptotic lower bound

(18)

which differs from the upper bound by a logarithmic factor in . This example shows that the conditional lower bound (5) can provide tighter results than its unconditional counterpart (6).

Example 2 (Estimating Bernoulli bias with uniform prior).

Consider the example where the parameter , the samples conditional on for , and .

Corollary 2.

In Example 2, the Bayes risk is lower bounded by

(19)
Proof:

Appendix B. ∎

Note that the MMAE in Example 2 is upper bounded by

(20)

achieved by the sample mean estimator . Thus, the lower bound in (19) asymptotically captures the correct dependence on , and is off from the true Bayes risk by a constant factor.

Ii-B Lower bounds based on information density and small ball probability

For a joint distribution on , define the conditional information density as

(21)

We give a lower bound on the probability of excess distortion in terms of conditional information density and conditional small ball probability:

Lemma 2.

For any estimate of based on the sample , any , and any auxiliary random variable ,

(22)
Proof:

The proof, inspired by the metaconverse technique from [12], is given in Appendix A. ∎

Our second Bayes risk lower bound for centralized estimation is a consequence of Lemma 2:

Theorem 2.

The Bayes risk for estimating the parameter based on the sample with respect to the distortion function satisfies

(23)

In particular,

(24)
Proof:

With the aid of Markov’s inequality, (2) leads to the inequality

(25)

The lower bound in (23) follows by replacing with . ∎

We give a high-dimensional example to illustrate the usefulness of Theorem 2:

Example 3 (Estimating -dimensional Gaussian mean with uniform prior on -ball).

Consider the case where the parameter is distributed uniformly on the ball , the samples are with independent of for , and .

Corollary 3.

In Example 3, for any , , and , the Bayes risk is lower bounded by

(26)
Proof:

Appendix C. ∎

Note that the Bayes risk in Example 3 is upper bounded by

(27)

achieved by the sample mean estimator . Thus, the lower bound in (26) captures the correct dependence on (asymptotically) and (non-asymptotically), and is off from the true Bayes risk by a constant factor. Moreover, by squaring (26), we get a lower bound on the MMSE that also captures the correct dependence on and .

Ii-C Generalizations of Fano’s inequality

The lower bounds on the probability of excess distortion in Lemmas 1 and 2 can be viewed as generalizations of Fano’s inequality.

When takes values on and , setting in (4) without conditioning on recovers the following generalization of Fano’s inequality due to Han and Verdú [13]:

(28)

Similarly, setting in (2) without conditioning on , we get

(29)

When is uniformly distributed on , (28) reduces to the usual Fano’s inequality

(30)

while (29) reduces to the Poor–Verdú bound [14]

(31)

When is continuous, Eqs. (4) and (2) provide continuum generalizations of Fano’s inequality. For example, when and , (4) leads to

(32)

which is also obtained by Chen et al. [8], and generalizes the result of Duchi and Wainwright [15]. Similarly, (2) leads to

(33)

Ii-D Lower bounds based on mutual information and differential entropy

For the problem of estimating a real-valued parameter with respect to the quadratic distortion , it can be shown that (see, e.g., [16, Lemma 5]), if , then

(34)

Upper-bounding by , we obtain a lower bound on the MMSE

(35)

More generally, for the problem of estimating a parameter taking values in , the Shannon lower bound on the rate-distortion function (see, e.g., [17, Chap. 4.8]) can be used to show that, if with an arbitrary norm in and an arbitrary , then

(36)

where is the volume of the unit ball in and is the gamma function. For example, this method can be used to recover the lower bounds of Seidler [18] for the problem of estimating a parameter in with respect to squared weighted norms, and gives tight lower bounds on the Bayes risk and the minimax risk in high-dimensional estimation problems [19, Lec. 13]. A simple extension of (36) via an auxiliary random variable gives

(37)

As a consequence, we obtain a lower bound on the Bayes risk in terms of conditional mutual information and conditional differential entropy:

Theorem 3.

For an arbitrary norm in and any , the Bayes risk for estimating the parameter based on the sample with respect to the distortion function satisfies

(38)

In particular, for estimating a real-valued with respect to ,

(39)

The advantage of Theorem 3 is that its unconditional version can yield tighter Bayes risk lower bounds than the unconditional version of Theorem 1. For example, consider the case where is uniformly distributed on , and is estimated based on with respect to the absolute distortion. Setting in Remark 1 and optimizing in (11), the unconditional version of Theorem 1 yields an asymptotic lower bound

(40)

By contrast, the unconditional version of Theorem 3 yields a tighter and non-asymptotic lower bound

(41)

Iii Mutual information contraction via SDPI

While the results in Section II all apply to general estimation problems, either centralized or decentralized, the results in terms of mutual information (Theorems 1 and 3) are particularly amenable to tightening in the context of the decentralized estimation. For example, Theorem 1 reveals two sources of the difficulty of estimating : the spread of the prior distribution or its conditional counterpart , captured by or , and the amount of information about contained in the sample , captured by or . When an estimator does not have direct access to , but can only receive information about it from one or more local processors, the amount of information about contained in the estimator’s indirect observations will contract relative to or . The contraction is caused by the communication constraints between the local processors and the estimator, such as finite precision of analog-to-digital conversion, storage limitations of intermediate results, limited transmission blocklength, channel noise, etc.

We will quantify this contraction of mutual information through strong data processing inequalities, or SDPI’s, for the relative entropy (see [20] and references therein). Given a stochastic kernel (or channel) with input alphabet and output alphabet and a reference input distribution on , we say that satisfies an SDPI at with constant if for any other input distribution on . Here, denotes the marginal distribution of the channel output when the input has distribution . The SDPI constants of are defined by

It is shown in [21] that the SDPI constants are also the maximum contraction ratios of mutual information in a Markov chain: for a Markov chain ,

(42)

if the joint distribution is fixed, and

(43)

if only the channel is fixed. This fact leads to the following SDPI’s for mutual information:

(44)

It is generally hard to compute the SDPI constant for an arbitrary pair of and , except for some special cases:

  • For the binary symmetric channel, [22].

  • For the binary erasure channel, .

  • If and are jointly Gaussian with correlation coefficient , then [23]

    (45)

In the remainder of this section, we collect a few upper bounds and properties of the SDPI constants, which will be used in the sequel. The first upper bound is due to Cohen et al. [24]:

Lemma 3.

Define the Dobrushin contraction coefficient of a stochastic kernel by

(46)

Then

(47)

The next upper bound is proved in [20, Remark 3.2] for arbitrary -divergences:

Lemma 4.

Suppose there exist a constant and a distribution , such that111In Markov chain theory, this is known as a Doeblin minorization condition.

(48)

Then

(49)

Lemma 4 leads to the following property:

Lemma 5.

For a joint distribution , suppose there is a constant such that the forward channel satisfies

(50)

Then the SDPI constants of the forward channel and the backward channel satisfy

(51)
Proof:

To prove the claim for the forward channel, pick any and let . Then the condition in Lemma 4 is satisfied with this . To prove the claim for the backward channel, consider any and . Then

(52)
(53)
(54)
(55)

where (54) uses the fact that , due to the assumption in (50). Using Lemma 4 with , we get the result. ∎

In decentralized estimation, we will encounter the SDPI constant . The following lemma gives an upper bound for this SDPI constant, which is often easier to compute:

Lemma 6.

If form a Markov chain, then

(56)

In particular, can be any sufficient statistic of for estimating .

Proof:

It suffices to show that for any such that form a Markov chain,

(57)

Indeed, by the definition of and the fact that form a Markov chain,

(58)
(59)

which proves (57) and the lemma. ∎

For product input distributions and product channels, the SDPI constant tensorizes [21] (see [20] for a more general result for other -divergences):

Lemma 7.

For distributions on and channels with input alphabet ,

Finally, the following lemma due to Polyanskiy and Wu [25] gives an SDPI for multiple uses of a channel:

Lemma 8.

Consider sending a message through uses of a memoryless channel with feedback, where with some encoder for . Then for any random variable such that form a Markov chain,

(60)

In particular, the result holds when the channel is used times without feedback.

Proof:

Let . Then

(61)
(62)
(63)
(64)

where (62) follows from the Markov chain and a conditional version of SDPI [16, Lemma 1]; (64) follows from the Markov chain . Unrolling the above recursive upper bound on and noting that , we get (60). ∎

Using the same proof technique, it can be shown that [16, Lemma 2] for the  th product of a channel ,

(65)

Iv Decentralized estimation: single processor setup

We start the discussion of decentralized estimation with the single-processor setup. Consider the following decentralized estimation problem with one local processor, shown schematically in Fig. 1:

Fig. 1: Model of decentralized estimation (single processor).
  • is an unknown parameter (discrete or continuous, scalar or vector) with prior distribution .

  • Conditional on , samples are independently drawn from the distribution .

  • The local processor observes and maps it to a -bit message .

  • The encoder maps to a codeword with blocklength , and transmits over a discrete memoryless channel (DMC) . We allow the possibility of feedback from the estimator to the processor, in which case , .

  • The estimator computes as an estimate of , based on the received codeword .

The Bayes risk in the single processor setup is defined as

(66)

which depends on the problem specification including , , , , , and . We can use the unconditional versions of Theorems 1 and 3 to obtain lower bounds for , by replacing with . To reveal the dependence of on various problem specifications, we need an upper bound on which is independent of and :

Theorem 4.

In decentralized estimation with a single processor, for any choice of and ,

(67)

where is the Shannon capacity of the channel , and

(68)
Proof:

When the channel is used with feedback, the problem setup gives rise to the Markov chain . With , as a consequence of Lemma 8, we have

(69)

Alternatively,

(70)
(71)
(72)

where (71) is from the SDPI in (44); (72) is because and . Lastly, from the SDPI and following the proof that feedback does not increase the capacity of a discrete memoryless channel [26],

We complete the proof for the case with feedback by taking the minimum of the three resulting estimates to get the tightest bound on .

When the channel is used without feedback, we have the Markov chain . In this case, (69) holds with as a conequence of the SDPI. The rest of the proof for this case is the same as the case with feedback. ∎

Note that, with the ordinary data processing equality, we can only get the upper bound

(73)

where the first term reflects the statistical constraint due to the finite number of samples, the second term reflects the communication constraint due to the quantization, and the third term reflects the communication constraint due to the noisy channel. All of these terms are tightened in (67) via the multiplication by various contraction coefficients. Thus, using the SDPI, we can tighten the results of Theorems 1 and 3 in the setting of decentralized estimation by quantifying the communication constraint, and by coupling the statistical constraint and the communication constraint together.

Next we study a few examples of this problem setup to illustrate the effectiveness of using Theorem 4 to derive lower bounds on the Bayes risk.

Iv-a Transmitting a bit over a BSC

Example 4.

Consider the case where the parameter takes values and with equal probabilities, the local processor directly observes , and communicates the value of to the estimator through uses of the channel . Formally, is