On Binary Distributed Hypothesis Testing
Abstract
We consider the problem of distributed binary hypothesis testing of two sequences that are generated by an i.i.d. doublybinary symmetric source. Each sequence is observed by a different terminal. The two hypotheses correspond to different levels of correlation between the two source components, i.e., the crossover probability between the two. The terminals communicate with a decision function via ratelimited noiseless links. We analyze the tradeoff between the exponential decay of the two error probabilities associated with the hypothesis test and the communication rates. We first consider the sideinformation setting where one encoder is allowed to send the full sequence. For this setting, previous work exploits the fact that a decoding error of the source does not necessarily lead to an erroneous decision upon the hypothesis. We provide improved achievability results by carrying out a tighter analysis of the effect of binning error; the results are also more complete as they cover the full exponent tradeoff and all possible correlations. We then turn to the setting of symmetric rates for which we utilize KörnerMarton coding to generalize the results, with little degradation with respect to the performance with a onesided constraint (sideinformation setting).
I Introduction
We consider the distributed hypothesis testing (DHT) problem, where there are two distributed sources, and , and the hypotheses are given by
(1a)  
(1b) 
where and are different joint distributions of and . The test is performed based on information sent from two distributed terminals (over noiseless links), each observing i.i.d. realizations of a different source, where the rate of the information sent from each terminal is constrained. This setup, introduced in [Berger1979Spring, AhlswedeCsiszar1981], introduces a tradeoff between the information rates and the probabilities of the two types of error events. In this work we focus on the exponents of these error probabilities, with respect to the number of observations .
When at least one of the marginal distributions depends on the hypothesis, a test can be constructed based only on the type of the corresponding sequence. Although this test may not be optimal, it results in nontrivial performance (positive error exponents) with zero rate. In contrast, when the marginal distributions are the same under both hypotheses, a positive exponent cannot be achieved using a zerorate scheme, see [ShalabyP1992].
One may achieve positive exponents while maintaining low rates, by effectively compressing the sources and then basing the decision upon their compressed versions. Indeed, many of the works that have considered the distributed hypothesis testing problem bear close relation to the distributed compression problem.
Ahlswede and Csiszár [AhlswedeCsiszar1986] have suggested a scheme based on compression without taking advantage of the correlation between the sources; Han [Han1987] proposed an improved scheme along the same lines. Correlation between the sources is exploited by Shimokawa et al. [ShimokawaHanAmari1994ISIT, Shimokawa:1994:MScThesis] to further reduce the coding rate, incorporating random binning following the SlepianWolf [SlepianWolf73] and WynerZiv [WynerZiv76] schemes. Rahman and Wagner [RahmanWagner2012] generalized this setting and also derived an outer bound. They also give a “quantize and bin” interpretation to the results of [ShimokawaHanAmari1994ISIT]. Other related works include [HanKobayashi1989, Amari2011, Polyanskiy2012ISIT, Katz2017, Katz2016Asilomar]. See [HanAmari1998, RahmanWagner2012] for further references.
We note that in spite of considerable efforts over the years, the problem remains open. In many cases, the gap between the achievability results and the few known outer bounds is still large. Specifically, some of the stronger results are specific to testing against independence (i.e., under one of the hypotheses and are independent), or specific to the case where one of the error exponents is zero (“Stein’sLemma” setting). The present work significantly goes beyond previous works, extending and improving the achievability bounds. Nonetheless, the refined analysis comes at a price. Namely, in order to facilitate analysis, we choose to restrict attention to a simple source model.
To that end, we consider the case where is a doubly symmetric binary source (DSBS). That is, and are each binary and symmetric. Let be the modulotwo difference between the sources.^{1}^{1}1Notice that in this binary case, the uniform marginals mean that is necessarily independent of . We consider the following two hypotheses:
(2a)  
(2b) 
where we assume throughout that . Note that a sufficient statistic for hypothesis testing in this case is the weight (which is equivalent to the type) of the noise sequence Z. Under communication rate constraints, a plausible approach would be to use a distributed compression scheme that allows lossy reconstruction of the sequence , and then base the decision upon that sequence.
We first consider a onesided rate constraint. That is, the encoder is allocated the full rate of one bit per source sample, so that the Y sequence is available as side information at the decision function. In this case, compression of Z amounts to compression of X; a random binning scheme is optimal for this task of compression, lossless or lossy.^{2}^{2}2More precisely, it gives the optimal coding rates, as well as the best known error exponents when the rate is not too high. Indeed, in this case, the best known achievability result is due to [ShimokawaHanAmari1994ISIT], which basically employs a random binning scheme.^{3}^{3}3Interestingly, when (testing against independence), the simple scheme of [AhlswedeCsiszar1986] which ignores the sideinformation altogether is optimal.
A natural question that arises when using binning as part of the distributed hypothesis testing scheme is the effect of a “bin decoding error” on the decision error between the hypotheses. The connection between these two errors is nontrivial as a bin decoding error inherently results in a “large” noise reconstruction error, much in common with errors in channel coding (in the context of syndrome decoding). Specifically, when a binning error occurs, the reconstruction of the noise sequence Z is roughly consistent with an i.i.d. Bernoulli distribution. Thus, if one feeds the weight of this reconstructed sequence to a simple threshold test, it would typically result in deciding that the noise was distributed according to , regardless of whether that is the true distribution or not. This effect causes an asymmetry between the two error probabilities associated with the hypothesis test. Indeed, as the Stein exponent corresponds to highly asymmetric error probabilities, the exponent derived in [ShimokawaHanAmari1994ISIT] may be interpreted as taking advantage of this effect.^{4}^{4}4Another interesting direction, not pursued in this work, is to change the problem formulation to allow declaring an “erasure” when the probability of a bin decoding error exceeds a certain threshold.
The contribution of the present work is twofold. First we extend and strengthen the results of [ShimokawaHanAmari1994ISIT]. By explicitly considering and leveraging the properties of good codes, we bound the probability that the sequence Z happens to be such that is very close to some wrong yet “legitimate” X, much like an undetected error event in erasure decoding [Forney68]. This allows us to derive achievability results for the full tradeoff region, namely the tradeoff between the error exponents corresponding to the two types of hypothesis testing errors.
The second contribution is in considering a symmetricrate constraint. For this case, the optimal distributed compression scheme for is the KörnerMarton scheme [KornerMarton79], which requires each of the users to communicate at a rate ; hence, the sumrate is strictly smaller than the one of SlepianWolf, unless is symmetric. Thus, the KörnerMarton scheme is a natural candidate for this setting. Indeed, it was observed in [AhlswedeCsiszar1986, HanAmari1998] that a standard informationtheoretic solution such as SlepianWolf coding may not always be the way to go, and [HanAmari1998] mentions the the KörnerMarton scheme in this respect. Further, Shimokawa and Amari [ShimokawaAmari:ISIT:1995] point out the possible application of the KörnerMarton scheme to distributed parameter estimation in a similar setting and a similar observation is made in [GamalL15AllertonArxiv]. However, to the best of our knowledge, the present work is the first to propose an actual KörnerMartonbased scheme for distributed hypothesis testing and to analyze its performance. Notably, the performance tradeoff obtained recovers the achievable tradeoff derived for a onesided constraint.
The rest of this paper is organized as follows. In Section II we formally state the problem, define notations and present some basic results. Section LABEL:sec:related_results and LABEL:sec:linear_codes_and_EE provide necessary background: the first surveys known results for the case of a onesided rate constraint while the latter provides definitions and properties of good linear codes. In Section LABEL:sec:one_user_constrained_case we present the derivation of a new achievable exponents tradeoff region. Then, in Section LABEL:sec:symmetric_rate_constraint we present our results for a symmetricrate constraint. Numerical results and comparisons appear in Section LABEL:sec:performance_comparison. Finally, Section LABEL:sec:future_work concludes the paper.
Ii Problem Statement and Notations
Iia Problem Statement
Consider the setup depicted in Figure 1. X and Y are random vectors of blocklength , drawn from the (finite) source alphabets and , respectively. Recalling the hypothesis testing problem (1), we have two possible i.i.d. distributions. In the sequel we will take a less standard notational approach, and define the hypotheses by random variable which takes the values , and assume a probability distribution function ; Therefore refers to of (1) and (2).^{5}^{5}5We do not assume any given distribution over , as we are always interested in probabilities given the hypotheses. We still use for the distribution (for ) the shortened notation . Namely, for any and , and for ,
A scheme for the problem is defined as follows.
Definition 1
A scheme consists of encoders and which are mappings from the set of length source vectors to the messages sets and :
(3a)  
(3b) 
and a decision function, which is a mapping from the set of possible message pairs to one of the hypotheses:
(4) 
Definition 2
For a given scheme , denote the decision given the pair by
(5) 
The decision error probabilities of are given by
(6a) 
Definition 3
For any and , the exponent pair is said to be achievable at rates if there exists a sequence of schemes
(7) 
with corresponding sequences of message sets and and error probabilities , , such that^{6}^{6}6All logarithms are taken to the base 2, and all rates are in units of bits per sample.
(8a)  
(8b)  
and  
(8c) 
The achievable exponent region is the closure of the set of all achievable exponent pairs.^{7}^{7}7For simplicity of the notation we omit here and in subsequent definitions the explicit dependence on the distributions .
The case where only one of the error probabilities decays exponentially is of special interest; we call the resulting quantity the Stein exponent after Stein’s Lemma (see, e.g., [CoverBook, Chapter 12]). When is exponential, the Stein exponent is defined as:
(9) 
is defined similarly.
We will concentrate on this work on two special cases of rate constraints, where for simplicity we can make the notation more concise.

Onesided constraint where . We shall denote the achievable region and Stein exponents as , and .

Symmetric constraint where . We shall denote the achievable region and Stein exponents as , and .
Note that for any we have that .
Whenever considering a specific source distribution, we will take to be a DSBS. Recalling (2), that means that and are binary symmetric, and the “noise” satisfies:
(10) 
for some parameters (note that there is loss of generality in assuming that both probabilities are on the same side of ).
IiB Further Notations
The following notations of probability distribution functions are demonstrated for random variables and over alphabets and , respectively. The probability distribution function of a random variable is denoted by , and the conditional probability distribution function of a random variable given a random variable is denoted by . A composition and is denoted by , leading to the following joint probability distribution function of and :
(11) 
for any pair and .
The Shannon entropy of a random variable is denoted by , and the KullbackLeibler divergence of a pair of probability distribution functions is denoted by . The mutual information of a pair of random variables is denoted by . The similar conditional functionals of the entropy, divergence and mutual information are defined by an expectation over the apriori distribution: the conditional entropy of a random variable given a random variable is denoted by
(12) 
The divergence of a pair of conditional probability distribution functions and is denoted by
The conditional mutual information of a pair of random variables given a random variable is denoted by
and notice that it is equal to
If there is a Markov chain , then we can omit the from and the expression becomes
Since we concentrate on a binary case, we need the following. Denote the binary divergence of a pair , where , by
(13) 
which is the KullbackLeibler divergence of the pair of probability distributions . Denote the binary entropy of by
(14) 
which is the entropy function of the probability distribution . Denote the GilbertVarshamov relative distance of a code of rate , by
(15) 
The operator denotes addition over the binary field. The operator is equivalent to the operator over the binary field, but nevertheless, we keep them for the sake of consistency.
The Hamming weight of a vector is denoted by
(16) 
where denotes the indicator function, and the sum is over the reals. The normalized Hamming weight of this vector is denoted by
(17) 
Denote the dimensional Hamming ball with center c and normalized radius by
(18) 
The binary convolution of is defined by
(19) 
Definition 4 (Bernoulli Noise)
A Bernoulli random variable with is denoted by . An dimensional random vector Z with i.i.d. entries for is called a Bernoulli noise, and denoted by
(20) 
Definition 5 (FixedType Noise)
Denote the set of vectors with type by
(21a)  
A noise  
(21b) 
is called an dimensional fixedtype noise of type .
For any two sequences, and , we write if . We write if .
For any two sequences of random vectors (), we write
(22) 
if
(23) 
uniformly over , that is,
(24) 
uniformly over . We write if
(25) 
uniformly over .
The set of nonnegative integers are denoted by , and the set of natural numbers, i.e., , by .
IiC Some Basic Results
When the rate is not constrained, the decision function has access to the full source sequences. The optimal tradeoff of the two types of errors is given by the following decision function, depending on the parameter (NeymanPearson [NeymanPearson1933]),^{8}^{8}8In order to achieve the full NeymanPearson tradeoff, special treatment of the case of equality is needed. As this issue has no effect on error exponents, we ignore it. 0,P(0)X,Y(x,y)≥T⋅P(1)X,Y(x,y)1,otherwise.