On Binary Distributed Hypothesis Testing

On Binary Distributed Hypothesis Testing

Eli Haim and Yuval Kochman
Abstract

We consider the problem of distributed binary hypothesis testing of two sequences that are generated by an i.i.d. doubly-binary symmetric source. Each sequence is observed by a different terminal. The two hypotheses correspond to different levels of correlation between the two source components, i.e., the crossover probability between the two. The terminals communicate with a decision function via rate-limited noiseless links. We analyze the tradeoff between the exponential decay of the two error probabilities associated with the hypothesis test and the communication rates. We first consider the side-information setting where one encoder is allowed to send the full sequence. For this setting, previous work exploits the fact that a decoding error of the source does not necessarily lead to an erroneous decision upon the hypothesis. We provide improved achievability results by carrying out a tighter analysis of the effect of binning error; the results are also more complete as they cover the full exponent tradeoff and all possible correlations. We then turn to the setting of symmetric rates for which we utilize Körner-Marton coding to generalize the results, with little degradation with respect to the performance with a one-sided constraint (side-information setting).

I Introduction

We consider the distributed hypothesis testing (DHT) problem, where there are two distributed sources, and , and the hypotheses are given by

 H0:(X,Y)∼P(0)X,Y (1a) H1:(X,Y)∼P(1)X,Y, (1b)

where and are different joint distributions of and . The test is performed based on information sent from two distributed terminals (over noiseless links), each observing i.i.d. realizations of a different source, where the rate of the information sent from each terminal is constrained. This setup, introduced in [Berger1979Spring, AhlswedeCsiszar1981], introduces a tradeoff between the information rates and the probabilities of the two types of error events. In this work we focus on the exponents of these error probabilities, with respect to the number of observations .

When at least one of the marginal distributions depends on the hypothesis, a test can be constructed based only on the type of the corresponding sequence. Although this test may not be optimal, it results in non-trivial performance (positive error exponents) with zero rate. In contrast, when the marginal distributions are the same under both hypotheses, a positive exponent cannot be achieved using a zero-rate scheme, see [ShalabyP1992].

One may achieve positive exponents while maintaining low rates, by effectively compressing the sources and then basing the decision upon their compressed versions. Indeed, many of the works that have considered the distributed hypothesis testing problem bear close relation to the distributed compression problem.

Ahlswede and Csiszár [AhlswedeCsiszar1986] have suggested a scheme based on compression without taking advantage of the correlation between the sources; Han [Han1987] proposed an improved scheme along the same lines. Correlation between the sources is exploited by Shimokawa et al. [ShimokawaHanAmari1994ISIT, Shimokawa:1994:MScThesis] to further reduce the coding rate, incorporating random binning following the Slepian-Wolf [SlepianWolf73] and Wyner-Ziv [WynerZiv76] schemes. Rahman and Wagner [RahmanWagner2012] generalized this setting and also derived an outer bound. They also give a “quantize and bin” interpretation to the results of [ShimokawaHanAmari1994ISIT]. Other related works include [HanKobayashi1989, Amari2011, Polyanskiy2012ISIT, Katz2017, Katz2016Asilomar]. See [HanAmari1998, RahmanWagner2012] for further references.

We note that in spite of considerable efforts over the years, the problem remains open. In many cases, the gap between the achievability results and the few known outer bounds is still large. Specifically, some of the stronger results are specific to testing against independence (i.e., under one of the hypotheses and are independent), or specific to the case where one of the error exponents is zero (“Stein’s-Lemma” setting). The present work significantly goes beyond previous works, extending and improving the achievability bounds. Nonetheless, the refined analysis comes at a price. Namely, in order to facilitate analysis, we choose to restrict attention to a simple source model.

To that end, we consider the case where is a doubly symmetric binary source (DSBS). That is, and are each binary and symmetric. Let be the modulo-two difference between the sources.111Notice that in this binary case, the uniform marginals mean that is necessarily independent of . We consider the following two hypotheses:

 H0:Z∼Ber(p0) (2a) H1:Z∼Ber(p1), (2b)

where we assume throughout that . Note that a sufficient statistic for hypothesis testing in this case is the weight (which is equivalent to the type) of the noise sequence Z. Under communication rate constraints, a plausible approach would be to use a distributed compression scheme that allows lossy reconstruction of the sequence , and then base the decision upon that sequence.

We first consider a one-sided rate constraint. That is, the -encoder is allocated the full rate of one bit per source sample, so that the Y sequence is available as side information at the decision function. In this case, compression of Z amounts to compression of X; a random binning scheme is optimal for this task of compression, lossless or lossy.222More precisely, it gives the optimal coding rates, as well as the best known error exponents when the rate is not too high. Indeed, in this case, the best known achievability result is due to [ShimokawaHanAmari1994ISIT], which basically employs a random binning scheme.333Interestingly, when (testing against independence), the simple scheme of [AhlswedeCsiszar1986] which ignores the side-information altogether is optimal.

A natural question that arises when using binning as part of the distributed hypothesis testing scheme is the effect of a “bin decoding error” on the decision error between the hypotheses. The connection between these two errors is non-trivial as a bin decoding error inherently results in a “large” noise reconstruction error, much in common with errors in channel coding (in the context of syndrome decoding). Specifically, when a binning error occurs, the reconstruction of the noise sequence Z is roughly consistent with an i.i.d. Bernoulli distribution. Thus, if one feeds the weight of this reconstructed sequence to a simple threshold test, it would typically result in deciding that the noise was distributed according to , regardless of whether that is the true distribution or not. This effect causes an asymmetry between the two error probabilities associated with the hypothesis test. Indeed, as the Stein exponent corresponds to highly asymmetric error probabilities, the exponent derived in [ShimokawaHanAmari1994ISIT] may be interpreted as taking advantage of this effect.444Another interesting direction, not pursued in this work, is to change the problem formulation to allow declaring an “erasure” when the probability of a bin decoding error exceeds a certain threshold.

The contribution of the present work is twofold. First we extend and strengthen the results of [ShimokawaHanAmari1994ISIT]. By explicitly considering and leveraging the properties of good codes, we bound the probability that the sequence Z happens to be such that is very close to some wrong yet “legitimate” X, much like an undetected error event in erasure decoding [Forney68]. This allows us to derive achievability results for the full tradeoff region, namely the tradeoff between the error exponents corresponding to the two types of hypothesis testing errors.

The second contribution is in considering a symmetric-rate constraint. For this case, the optimal distributed compression scheme for is the Körner-Marton scheme [KornerMarton79], which requires each of the users to communicate at a rate ; hence, the sum-rate is strictly smaller than the one of Slepian-Wolf, unless is symmetric. Thus, the Körner-Marton scheme is a natural candidate for this setting. Indeed, it was observed in [AhlswedeCsiszar1986, HanAmari1998] that a standard information-theoretic solution such as Slepian-Wolf coding may not always be the way to go, and [HanAmari1998] mentions the the Körner-Marton scheme in this respect. Further, Shimokawa and Amari [ShimokawaAmari:ISIT:1995] point out the possible application of the Körner-Marton scheme to distributed parameter estimation in a similar setting and a similar observation is made in [GamalL15AllertonArxiv]. However, to the best of our knowledge, the present work is the first to propose an actual Körner-Marton-based scheme for distributed hypothesis testing and to analyze its performance. Notably, the performance tradeoff obtained recovers the achievable tradeoff derived for a one-sided constraint.

The rest of this paper is organized as follows. In Section II we formally state the problem, define notations and present some basic results. Section  LABEL:sec:related_results and LABEL:sec:linear_codes_and_EE provide necessary background: the first surveys known results for the case of a one-sided rate constraint while the latter provides definitions and properties of good linear codes. In Section LABEL:sec:one_user_constrained_case we present the derivation of a new achievable exponents tradeoff region. Then, in Section LABEL:sec:symmetric_rate_constraint we present our results for a symmetric-rate constraint. Numerical results and comparisons appear in Section LABEL:sec:performance_comparison. Finally, Section LABEL:sec:future_work concludes the paper.

Ii Problem Statement and Notations

Ii-a Problem Statement

Consider the setup depicted in Figure 1. X and Y are random vectors of blocklength , drawn from the (finite) source alphabets and , respectively. Recalling the hypothesis testing problem (1), we have two possible i.i.d. distributions. In the sequel we will take a less standard notational approach, and define the hypotheses by random variable which takes the values , and assume a probability distribution function ; Therefore refers to of (1) and (2).555We do not assume any given distribution over , as we are always interested in probabilities given the hypotheses. We still use for the distribution (for ) the shortened notation . Namely, for any and , and for ,

 P(X=x,Y=y|H=i)=n∏j=1P(i)X,Y(xj,yj).

A scheme for the problem is defined as follows.

Definition 1

A scheme consists of encoders and which are mappings from the set of length- source vectors to the messages sets and :

 ϕX:Xn↦MX (3a) ϕY:Yn↦MY. (3b)

and a decision function, which is a mapping from the set of possible message pairs to one of the hypotheses:

 ψ:MX×MY↦{0,1}. (4)
Definition 2

For a given scheme , denote the decision given the pair by

 ^H△=ψ(ϕX(X),ϕY(Y)). (5)

The decision error probabilities of are given by

 ϵi△=P(^H≠H∣∣H=i),i=0,1. (6a)
Definition 3

For any and , the exponent pair is said to be achievable at rates if there exists a sequence of schemes

 Υ(n)△=(ϕ(n)X,ϕ(n)Y,ψ(n)),n=1,2,… (7)

with corresponding sequences of message sets and and error probabilities , , such that666All logarithms are taken to the base 2, and all rates are in units of bits per sample.

 limsupn→∞1nlog∣∣M(n)X∣∣≤RX (8a) limsupn→∞1nlog∣∣M(n)Y∣∣≤RY, (8b) and liminfn→∞−1nlogϵ(n)i≥Ei,i=0,1. (8c)

The achievable exponent region is the closure of the set of all achievable exponent pairs.777For simplicity of the notation we omit here and in subsequent definitions the explicit dependence on the distributions .

The case where only one of the error probabilities decays exponentially is of special interest; we call the resulting quantity the Stein exponent after Stein’s Lemma (see, e.g., [CoverBook, Chapter 12]). When is exponential, the Stein exponent is defined as:

 σ1(RX,RY) △=supE0>0{E1:∃(E0,E1)∈C(RX,RY)}. (9)

is defined similarly.

We will concentrate on this work on two special cases of rate constraints, where for simplicity we can make the notation more concise.

1. One-sided constraint where . We shall denote the achievable region and Stein exponents as , and .

2. Symmetric constraint where . We shall denote the achievable region and Stein exponents as , and .

Note that for any we have that .

Whenever considering a specific source distribution, we will take to be a DSBS. Recalling (2), that means that and are binary symmetric, and the “noise” satisfies:

 P(Z=1|H=i)=pi,i=0,1 (10)

for some parameters (note that there is loss of generality in assuming that both probabilities are on the same side of ).

Ii-B Further Notations

The following notations of probability distribution functions are demonstrated for random variables and over alphabets and , respectively. The probability distribution function of a random variable is denoted by , and the conditional probability distribution function of a random variable given a random variable is denoted by . A composition and is denoted by , leading to the following joint probability distribution function of and :

 (PXPY|X)(x,y)△=PX(x)PY|X(y|x), (11)

for any pair and .

The Shannon entropy of a random variable is denoted by , and the Kullback-Leibler divergence of a pair of probability distribution functions is denoted by . The mutual information of a pair of random variables is denoted by . The similar conditional functionals of the entropy, divergence and mutual information are defined by an expectation over the a-priori distribution: the conditional entropy of a random variable given a random variable is denoted by

 H(PX|Z∣∣PZ)△=∑x∈XPX(x)∑y∈YPY|X(y|x)log1PY|X(y|x). (12)

The divergence of a pair of conditional probability distribution functions and is denoted by

 D(PX|Z∥PY|Z∣∣PZ).

The conditional mutual information of a pair of random variables given a random variable is denoted by

 I(PX|Z,PY|X,Z∣∣PZ),

and notice that it is equal to

 H(PX|Z∣∣PZ)−H(PX|Y,Z∣∣PZPX|Z).

If there is a Markov chain , then we can omit the from and the expression becomes

 I(PX|Z,PY|X∣∣PZ).

Since we concentrate on a binary case, we need the following. Denote the binary divergence of a pair , where , by

 Db(p∥q) △=plogpq+(1−p)log1−p1−q, (13)

which is the Kullback-Leibler divergence of the pair of probability distributions . Denote the binary entropy of by

 Hb(p) △=plog1p+(1−p)log11−p, (14)

which is the entropy function of the probability distribution . Denote the Gilbert-Varshamov relative distance of a code of rate , by

 δGV(R)△=H−1b(1−R). (15)

The operator denotes addition over the binary field. The operator is equivalent to the operator over the binary field, but nevertheless, we keep them for the sake of consistency.

The Hamming weight of a vector is denoted by

 wH(u)=n∑k=1\mathds1{ui=1}, (16)

where denotes the indicator function, and the sum is over the reals. The normalized Hamming weight of this vector is denoted by

 δH(u)=1nwH(u). (17)

Denote the dimensional Hamming ball with center c and normalized radius by

 Bn(c,r)△={x∈{0,1}n∣∣δH(x⊖c)≤r}, (18)

The binary convolution of is defined by

 p∗q△=(1−p)q+p(1−q). (19)
Definition 4 (Bernoulli Noise)

A Bernoulli random variable with is denoted by . An dimensional random vector Z with i.i.d. entries for is called a Bernoulli noise, and denoted by

 Z∼BerV(n,p) (20)
Definition 5 (Fixed-Type Noise)

Denote the set of vectors with type by

 Tn(a)△={x∈{0,1}n:δH(x)=a}. (21a) A noise (21b)

is called an -dimensional fixed-type noise of type .

For any two sequences, and , we write if . We write if .

For any two sequences of random vectors (), we write

 Xn⋅=DYn (22)

if

 (23)

uniformly over , that is,

 limn→∞1nlogPXn(xn)PYn(xn)=0 (24)

uniformly over . We write if

 limn→∞1nlogPXn(xn)PYn(xn)≤0 (25)

uniformly over .

The set of non-negative integers are denoted by , and the set of natural numbers, i.e., , by .

Ii-C Some Basic Results

When the rate is not constrained, the decision function has access to the full source sequences. The optimal tradeoff of the two types of errors is given by the following decision function, depending on the parameter (Neyman-Pearson [NeymanPearson1933]),888In order to achieve the full Neyman-Pearson tradeoff, special treatment of the case of equality is needed. As this issue has no effect on error exponents, we ignore it. 0,P(0)X,Y(x,y)≥T⋅P(1)X,Y(x,y)1,otherwise.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters