On Binary Distributed Hypothesis Testing

On Binary Distributed Hypothesis Testing

Abstract

We consider the problem of distributed binary hypothesis testing of two sequences that are generated by an i.i.d. doubly-binary symmetric source. Each sequence is observed by a different terminal. The two hypotheses correspond to different levels of correlation between the two source components, i.e., the crossover probability between the two. The terminals communicate with a decision function via rate-limited noiseless links. We analyze the tradeoff between the exponential decay of the two error probabilities associated with the hypothesis test and the communication rates. We first consider the side-information setting where one encoder is allowed to send the full sequence. For this setting, previous work exploits the fact that a decoding error of the source does not necessarily lead to an erroneous decision upon the hypothesis. We provide improved achievability results by carrying out a tighter analysis of the effect of binning error; the results are also more complete as they cover the full exponent tradeoff and all possible correlations. We then turn to the setting of symmetric rates for which we utilize Körner-Marton coding to generalize the results, with little degradation with respect to the performance with a one-sided constraint (side-information setting).

\IEEEoverridecommandlockouts\normalem

1 Introduction

We consider the distributed hypothesis testing (DHT) problem, where there are two distributed sources, and , and the hypotheses are given by

(1a)
(1b)

where and are different joint distributions of and . The test is performed based on information sent from two distributed terminals (over noiseless links), each observing i.i.d. realizations of a different source, where the rate of the information sent from each terminal is constrained. This setup, introduced in [[1], [2]], introduces a tradeoff between the information rates and the probabilities of the two types of error events. In this work we focus on the exponents of these error probabilities, with respect to the number of observations .

When at least one of the marginal distributions depends on the hypothesis, a test can be constructed based only on the type of the corresponding sequence. Although this test may not be optimal, it results in non-trivial performance (positive error exponents) with zero rate. In contrast, when the marginal distributions are the same under both hypotheses, a positive exponent cannot be achieved using a zero-rate scheme, see [[3]].

One may achieve positive exponents while maintaining low rates, by effectively compressing the sources and then basing the decision upon their compressed versions. Indeed, many of the works that have considered the distributed hypothesis testing problem bear close relation to the distributed compression problem.

Ahlswede and Csiszár [[4]] have suggested a scheme based on compression without taking advantage of the correlation between the sources; Han [[5]] proposed an improved scheme along the same lines. Correlation between the sources is exploited by Shimokawa et al. [[6], [7]] to further reduce the coding rate, incorporating random binning following the Slepian-Wolf [[8]] and Wyner-Ziv [[9]] schemes. Rahman and Wagner [[10]] generalized this setting and also derived an outer bound. They also give a “quantize and bin” interpretation to the results of [[6]]. Other related works include [[11], [12], [13], [14], [15]]. See [[16], [10]] for further references.

We note that in spite of considerable efforts over the years, the problem remains open. In many cases, the gap between the achievability results and the few known outer bounds is still large. Specifically, some of the stronger results are specific to testing against independence (i.e., under one of the hypotheses and are independent), or specific to the case where one of the error exponents is zero (“Stein’s-Lemma” setting). The present work significantly goes beyond previous works, extending and improving the achievability bounds. Nonetheless, the refined analysis comes at a price. Namely, in order to facilitate analysis, we choose to restrict attention to a simple source model.

To that end, we consider the case where is a doubly symmetric binary source (DSBS). That is, and are each binary and symmetric. Let be the modulo-two difference between the sources.1 We consider the following two hypotheses:

(2a)
(2b)

where we assume throughout that . Note that a sufficient statistic for hypothesis testing in this case is the weight (which is equivalent to the type) of the noise sequence Z. Under communication rate constraints, a plausible approach would be to use a distributed compression scheme that allows lossy reconstruction of the sequence , and then base the decision upon that sequence.

We first consider a one-sided rate constraint. That is, the -encoder is allocated the full rate of one bit per source sample, so that the Y sequence is available as side information at the decision function. In this case, compression of Z amounts to compression of X; a random binning scheme is optimal for this task of compression, lossless or lossy.2 Indeed, in this case, the best known achievability result is due to [[6]], which basically employs a random binning scheme.3

A natural question that arises when using binning as part of the distributed hypothesis testing scheme is the effect of a “bin decoding error” on the decision error between the hypotheses. The connection between these two errors is non-trivial as a bin decoding error inherently results in a “large” noise reconstruction error, much in common with errors in channel coding (in the context of syndrome decoding). Specifically, when a binning error occurs, the reconstruction of the noise sequence Z is roughly consistent with an i.i.d. Bernoulli distribution. Thus, if one feeds the weight of this reconstructed sequence to a simple threshold test, it would typically result in deciding that the noise was distributed according to , regardless of whether that is the true distribution or not. This effect causes an asymmetry between the two error probabilities associated with the hypothesis test. Indeed, as the Stein exponent corresponds to highly asymmetric error probabilities, the exponent derived in [[6]] may be interpreted as taking advantage of this effect.4

The contribution of the present work is twofold. First we extend and strengthen the results of [[6]]. By explicitly considering and leveraging the properties of good codes, we bound the probability that the sequence Z happens to be such that is very close to some wrong yet “legitimate” X, much like an undetected error event in erasure decoding [[17]]. This allows us to derive achievability results for the full tradeoff region, namely the tradeoff between the error exponents corresponding to the two types of hypothesis testing errors.

The second contribution is in considering a symmetric-rate constraint. For this case, the optimal distributed compression scheme for is the Körner-Marton scheme [[18]], which requires each of the users to communicate at a rate ; hence, the sum-rate is strictly smaller than the one of Slepian-Wolf, unless is symmetric. Thus, the Körner-Marton scheme is a natural candidate for this setting. Indeed, it was observed in [[4], [16]] that a standard information-theoretic solution such as Slepian-Wolf coding may not always be the way to go, and [[16]] mentions the the Körner-Marton scheme in this respect. Further, Shimokawa and Amari [[19]] point out the possible application of the Körner-Marton scheme to distributed parameter estimation in a similar setting and a similar observation is made in [[20]]. However, to the best of our knowledge, the present work is the first to propose an actual Körner-Marton-based scheme for distributed hypothesis testing and to analyze its performance. Notably, the performance tradeoff obtained recovers the achievable tradeoff derived for a one-sided constraint.

The rest of this paper is organized as follows. In Section 2 we formally state the problem, define notations and present some basic results. Section  3 and 4 provide necessary background: the first surveys known results for the case of a one-sided rate constraint while the latter provides definitions and properties of good linear codes. In Section 5 we present the derivation of a new achievable exponents tradeoff region. Then, in Section 6 we present our results for a symmetric-rate constraint. Numerical results and comparisons appear in Section 7. Finally, Section 8 concludes the paper.

2 Problem Statement and Notations

2.1 Problem Statement

XY
Figure 1: Problem setup.

Consider the setup depicted in Figure 1. X and Y are random vectors of blocklength , drawn from the (finite) source alphabets and , respectively. Recalling the hypothesis testing problem (1), we have two possible i.i.d. distributions. In the sequel we will take a less standard notational approach, and define the hypotheses by random variable which takes the values , and assume a probability distribution function ; Therefore refers to of (1) and (2).5 We still use for the distribution (for ) the shortened notation . Namely, for any and , and for ,

A scheme for the problem is defined as follows.

Definition 1

A scheme consists of encoders and which are mappings from the set of length- source vectors to the messages sets and :

(3a)
(3b)

and a decision function, which is a mapping from the set of possible message pairs to one of the hypotheses:

(4)
Definition 2

For a given scheme , denote the decision given the pair by

(5)

The decision error probabilities of are given by

(6a)
Definition 3

For any and , the exponent pair is said to be achievable at rates if there exists a sequence of schemes

(7)

with corresponding sequences of message sets and and error probabilities , , such that6

(8a)
(8b)
and
(8c)

The achievable exponent region is the closure of the set of all achievable exponent pairs.7

The case where only one of the error probabilities decays exponentially is of special interest; we call the resulting quantity the Stein exponent after Stein’s Lemma (see, e.g., [[21], Chapter 12]). When is exponential, the Stein exponent is defined as:

(9)

is defined similarly.

We will concentrate on this work on two special cases of rate constraints, where for simplicity we can make the notation more concise.

  1. One-sided constraint where . We shall denote the achievable region and Stein exponents as , and .

  2. Symmetric constraint where . We shall denote the achievable region and Stein exponents as , and .

Note that for any we have that .

Whenever considering a specific source distribution, we will take to be a DSBS. Recalling (2), that means that and are binary symmetric, and the “noise” satisfies:

(10)

for some parameters (note that there is loss of generality in assuming that both probabilities are on the same side of ).

2.2 Further Notations

The following notations of probability distribution functions are demonstrated for random variables and over alphabets and , respectively. The probability distribution function of a random variable is denoted by , and the conditional probability distribution function of a random variable given a random variable is denoted by . A composition and is denoted by , leading to the following joint probability distribution function of and :

(11)

for any pair and .

The Shannon entropy of a random variable is denoted by , and the Kullback-Leibler divergence of a pair of probability distribution functions is denoted by . The mutual information of a pair of random variables is denoted by . The similar conditional functionals of the entropy, divergence and mutual information are defined by an expectation over the a-priori distribution: the conditional entropy of a random variable given a random variable is denoted by

(12)

The divergence of a pair of conditional probability distribution functions and is denoted by

The conditional mutual information of a pair of random variables given a random variable is denoted by

and notice that it is equal to

If there is a Markov chain , then we can omit the from and the expression becomes

Since we concentrate on a binary case, we need the following. Denote the binary divergence of a pair , where , by

(13)

which is the Kullback-Leibler divergence of the pair of probability distributions . Denote the binary entropy of by

(14)

which is the entropy function of the probability distribution . Denote the Gilbert-Varshamov relative distance of a code of rate , by

(15)

The operator denotes addition over the binary field. The operator is equivalent to the operator over the binary field, but nevertheless, we keep them for the sake of consistency.

The Hamming weight of a vector is denoted by

(16)

where denotes the indicator function, and the sum is over the reals. The normalized Hamming weight of this vector is denoted by

(17)

Denote the dimensional Hamming ball with center c and normalized radius by

(18)

The binary convolution of is defined by

(19)
Definition 4 (Bernoulli Noise)

A Bernoulli random variable with is denoted by . An dimensional random vector Z with i.i.d. entries for is called a Bernoulli noise, and denoted by

(20)
Definition 5 (Fixed-Type Noise)

Denote the set of vectors with type by

(21a)
A noise
(21b)

is called an -dimensional fixed-type noise of type .

For any two sequences, and , we write if . We write if .

For any two sequences of random vectors (), we write

(22)

if

(23)

uniformly over , that is,

(24)

uniformly over . We write if

(25)

uniformly over .

The set of non-negative integers are denoted by , and the set of natural numbers, i.e., , by .

2.3 Some Basic Results

When the rate is not constrained, the decision function has access to the full source sequences. The optimal tradeoff of the two types of errors is given by the following decision function, depending on the parameter (Neyman-Pearson [[22]]),8

(26)
Proposition 1 (Unconstrained Case)

Consider the hypothesis testing problem as defined in Section 2.1, where there is no rate constraint, i.e. , then if and only if there exists a distribution function over the pair such that

(27)

For proof, see e.g. [[21]]. Note that in fact rates equal to the logarithms of the alphabet sizes suffice.

For the DSBS, the Neyman-Pearson decision function is a threshold on the weight of the noise sequence. We denote it (with some abuse of notations) by

where is a threshold test,

(28)

It leads to the following performance:

Corollary 1 (Unconstrained Case, DSBS)

For the DSBS, , and they consist of all pairs satisfying that for some ,

(29)

We now note a time-sharing result, which is general to any given achievable set.

Proposition 2 (time-sharing)

Suppose that . Then

(30)

The proof is standard, by noting that any scheme may be applied to an -portion of the source blocks, ignoring the additional samples. Applying this technique to Corollary 1, we have a simple scheme where each encoder sends only a fraction of its observed vector.

Corollary 2

Consider the DSBS hypothesis testing problem as defined in Section 2.1. For any rate constraint , for any

(31)

Specializing to Stein’s exponents, we have:

(32a)
(32b)

Of course we may apply the same result to the one-sided constrained case, i.e., and the corresponding Stein exponents.

3 One-Sided Constraint: Previous Results

In this section we review previous results for the one-sided constraint case . We first present them for general distributions and then specialize to the DSBS.

3.1 General Sources

Ahlswede and Csiszár have established the following achievable Stein’s exponent.

Proposition 3 ([[4], Theorem 5])

For any ,

(33)

where and are the marginals of

and

respectively.

The first term of (3) reflects the contribution of the type of X (which can be conveyed with zero rate), while the second reflects the contribution of the lossy version of X sent with rate . Interestingly, this exponent is optimal for case , known as test against independence.

Han has improved upon this exponent by conveying the joint type of the source sequence and its quantized version (represented by ) to the decision function.9

Proposition 4 ([[5], Theorems 2,3])

For any ,

(34a)
where
(34b)

and where and are the marginals of

(35a)
and
(35b)

respectively.

The following result by Shimokawa et al., gives a tighter achievable bound by using the side information Y when encoding X.

Proposition 5 ([[6], Corollary III.2],[[16], Theorem 4.3])
Define
(36a)
where and are the marginals of the distributions defined in (35a) and (35b), respectively. Then, for any ,
(36b)

Notice that for such that , the bound of the last proposition will be not greater than the bound of Proposition 4. Therefore the overall bound yields by taking the maximal one.

It is worth pointing out that for distributed rate-distortion problem, the bound in Proposition 5 is in general suboptimal [[23]].

A non-trivial outer bound derived by Rahman and Wagner [[10]] using an additional information at the decoder, which does not exist in the original problem.

Proposition 6 ([[10], Corollary 5])

Suppose that

(37a)
Consider a pair of conditional distributions and such that
(37b)
and such that under the distribution
(37c)
Then, for any ,
(37d)

3.2 Specializing to the DSBS

We now specialize the results of Section 3.1 to the DSBS. Throughout, we choose the auxiliary variable to be connected to by a binary symmetric channel with crossover probability ; with some abuse of notation, we write e.g. for the specialization of . Due to symmetry, we conjecture that this choice of is optimal, up to time sharing that can be applied according to Proposition 2; we do not explicitly write the time-sharing expressions.

The connection between the general and DSBS-specific results can be shown;.However, we follow a direction that is more relevant to this work, providing for each result a direct interpretation, explaining how it can be obtained for the DSBS; in doing that, we follow the interpretations of Rahman and Wagner [[10]].

The Ahlswede-Csiszár scheme of Proposition 3 amounts to quantization of the source X, without using Y as side information.

Corollary 3 (Proposition 3, DSBS with symmetric auxiliary)

For any ,

(38a)
where
(38b)

This performance can be obtained as follows. The encoder quantizes X using a code that is rate-distortion optimal under the Hamming distortion measure; specifically, averaging over the random quantizer, the source and reconstruction are jointly distributed according to the RDF-achieving test channel, that is, the reconstruction is obtained from the source X by a BSC with crossover probability that satisfies the RDF, namely . The decision function is which can be seen as two-stage: first the source difference sequence is estimated as , and then a threshold is applied to the weight of that sequence, as if it were the true noise. Notice that given , ; the exponents are thus the probabilities of such a vector to fall inside or outside a Hamming sphere of radius around the origin. As Proposition 3 relates to a Stein exponent, the threshold is set arbitrarily close to , resulting in the following; one can easily generalize to an achievable exponent region.

The Han scheme of Proposition 4 amounts (for the DSBS) to a similar approach, using a more favorable quantization scheme. In order to express its performance, we use the following exponent, which is explicitly evaluated in Appendix 9. While it is a bit more general than what we need at this point, this definition will allow us to present later results in a unified manner.

Definition 6

Fix some parameters . Let be a sequence of vectors such that . Let and let . Then:

(39)
Corollary 4 (Proposition 4, DSBS with symmetric auxiliary)

For any ,

(40a)
where
(40b)

One can show that , where the inequality is strict for all (recall that for , “testing against independence”, the Alswhede-Csiszár scheme is already optimal). The improvement comes from having quantization error that is fixed-type (recall Definition 5) rather than Bernoulli. Thus, is “mixed” uniform-Bernoulli; the probability of that noise to enter a ball around the origin is reduced with respect to that of the Bernoulli of Corollary 3.

The Shimokawa et al. scheme of Proposition 5 is similar in the DSBS case, except that the compression of X now uses side-information. Namely, Wyner-Ziv style binning is used. When the bin is not correctly decoded, a decision error may occur. The resulting performance is given in the following.

Corollary 5 (Proposition 5, DSBS with symmetric auxiliary)

For any ,

(41a)
where
(41b)

This exponent can be thought of as follows. The encoder performs fixed-type quantization as in Han’s scheme, except that the quantization type is now smaller than . The indices thus have rate . Now these indices are distributed to bins; as the rate of the bin index is , each bin is of rate . The decision function decodes the bin index using the side information Y, and then proceeds as in Han’s scheme.

The two terms in the minimization (41a) represent the sum of the events of decision error combined with bin-decoding success and error, respectively. The first is as before, hence the use of . For the second, it can be shown that as a worst-case assumption, resulting from a decoding error is uniformly distributed over all binary sequences. By considering volumes, the exponent of the probability of the reconstruction to fall inside an -sphere is thus at most ; a union bound over the bin gives .

Remark 1

It may be better not to use binning altogether (thus avoiding binning errors), i.e., the exponent of Corollary 5 is not always higher than that of Corollary 4.

Remark 2

An important special case of this scheme is when lossless compression is used, and Wyner-Ziv coding reduces to a side-information case of Slepian-Wolf coding. This amounts to forcing . If no binning error occurred, we are in the same situation as in the unconstrained case. Thus, we have the exponent:

(42a)
where
(42b)

We have seen thus that various combinations of quantization and binning; Table 1 summarizes the different possible schemes.

      Coding component Lossless Lossy
Oblivious to TS Q + TS [[4], [5]]
Using side-information Bin + TS Q + Bin + TS [[6]]
Table 1: Summary of possible schemes. TS stands for time-sharing, Q stands for quantization, Bin stands for binning.

An upper bound is obtained by specializing the Rahman-Wagner outer bound of Proposition 6 to the DSBS.

Corollary 6 (Proposition 6, DSBS with symmetric additional information)
(43)

where .

We note that it seems plausible that the exponent for , given by Corollary 4, is an upper to general , i.e.,

(44)

Next we compare the performance of these coding schemes in order to understand the effect of each of the different components of the coding schemes on the performance.

4 Background: Linear Codes and Error Exponents

In this section we define code ensembles that will be used in the sequel, and present their properties. Although the specific properties of linear codes are not required until Section 6, we put an emphasis on such codes already; this simplifies the proofs of some properties we need to show, and also helps to present the different results in a more unified manner.

4.1 Linear Codes

Definition 7 (Linear Code)

We define a linear code via a generating matrix G over the binary field. This induces the linear codebook:

(45)

where is a row vector.

Assuming that all rows of G are linearly independent, there are codewords in , so the code rate is

(46)

Clearly, for any rate (up to ), there exists a linear code of this rate asymptotically as .

A linear code is also called a parity-check code, and may be specified by a (binary) parity-check matrix H. The code contains all the -length binary row vectors c whose syndrome

(47)

is equal to the all zero row vector, i.e.,

(48)

Given some general syndrome , denote the coset of s by

(49)

The minimum Hamming distance quantizer of a vector with respect to a code is given by

(50)

For any syndrome s with respect to the code , the decoding function gives the coset leader, the minimum Hamming weight vector within the coset of s:

(51)

Maximum-likelihood decoding of a parity-check code, over a BSC , amounts to syndrome decoding  [[24], Theorem 6.1.1]. The basic “Voronoi” set is given by

(52)

The ML decision region of any codeword is equal to a translate of , i.e.,

(53)
(54)

4.2 Properties of Linear Codes

Definition 8 (Normalized Distance Distribution)

The normalized distance (or weight) distribution of a linear code for a parameter is defined to be the fraction of codewords , with normalized weight at most , i.e.,

(55)

where is the indicator function.

Definition 9 (Normalized Minimum Distance)

The normalized minimum distance of a linear code is defined as

(56)
Definition 10 (Normalized Covering Radius)

The normalized covering radius of a code is the smallest integer such that every vector is covered by a Hamming ball with radius and center at some , normalized by the blocklength, i.e.:

(57)
Definition 11 (Normalized Packing Radius)

The normalized packing radius of a linear code is defined to be half the normalized minimal distance of its codewords, i.e.,

(58)

4.3 Good Linear Codes

We need two notions of goodness of codes, as follows.