Hypotheses testing by convex optimization

Hypotheses testing by convex optimization

Abstract

We discuss a general approach to hypothesis testing. The main “building block” of the proposed construction is a test for a pair of hypotheses in the situation where each particular hypothesis states that the vector of parameters identifying the distribution of observations belongs to a convex compact set associated with the hypothesis. This test, under appropriate assumptions, is provably nearly optimal and is yielded by a solution to a convex optimization problem, so that the construction admits computationally efficient implementation. We further demonstrate that our assumptions are satisfied in several important and interesting applications. Finally, we show how our approach can be applied to a rather general testing problems encompassing several classical statistical settings.

1 Introduction

In this paper we promote a unified approach to a class of decision problems, based on Convex Programming. Our main building block (which we believe is important by its own right) is a construction, based on Convex Programming (and thus computationally efficient) allowing, under appropriate assumptions, to build a provably nearly optimal test for deciding between a pair of composite hypotheses on the distribution of observed random variable. Our approach is applicable in several important situations, primarily, those when observation (a) comes from Gaussian distribution on parameterized by its expectation, the covariance matrix being once for ever fixed, (b) is an -dimensional vector with independent Poisson entries, parameterized by the collection of intensities of the entries, (c) is a randomly selected point from a given -point set , with the straightforward parametrization of the distribution by the vector of probabilities for the observation to take values ,…, , (d) comes from a “direct product of the outlined observation schemes,” e.g., is a collection of independent realizations of a random variable described by (a)-(c). In contrast to rather restrictive assumptions on the families of distributions we are able to handle, we are very flexible as far as the hypotheses are concerned: all we require from a hypothesis is to correspond to a convex and compact set in the “universe” of parameters of the family of distributions we are working with.

As a consequence, the spirit of the results to follow is quite different from that of a “classical” statistical inquiry, where one assumes that the signals underlying noisy observations belong to some “regularity classes” and the goal is to characterize analytically the minimax rates of detection for those classes. With our approach allowing for highly diverse hypotheses, an attempt to describe analytically the quality of a statistical routine seems to be pointless. For instance, in the two-hypotheses case, all we know in advance is that the test yielded by our construction, assuming the latter applicable, is provably nearly optimal, with explicit specification of what “nearly” means presented in Theorem 2.1.ii. By itself, this “near optimality” usually is not all we need — we would like to know what actually are the performance guarantees (say, probability of wrong detection, or the number of observations sufficient to make an inference satisfying given accuracy and/or reliability specifications). The point is that with our approach, rather detailed information of this sort can be obtained by efficient situation-oriented computation. In this respect our approach follows the one of [35, 36, 7, 9, 11, 37] where what we call below “simple tests” were used to test composite hypotheses represented by convex sets of distributions1; later this approach was successfully applied to nonparametric estimation of signals and functionals [10, 18, 19, 12]. On the other hand, what follows can be seen as a continuation of another line of research focusing on testing [14, 15, 31] and on a closely related problem of estimating linear functionals [29, 30, 17] in white noise model. In the present paper we propose a general framework which mirrors that of [32]. Here the novelty (to the best of our understanding, essential) is in applying techniques of the latter paper to hypotheses testing rather than to estimating linear forms, which allows to naturally encompass and extend the aforementioned approaches to get provably good tests for observations schemes mentioned in (a) – (d). We strongly believe that this approach allows to handle a diverse spectrum of applications, and in this paper our focus is on efficiently implementable testing routines2 and related elements of the “calculus of tests”.

The contents and organization of the paper are as follows. We start with near-optimal testing of pairs of hypotheses, both in its general form and for particular cases of (a) – (d) (section 2). We then demonstrate (section 3) that our tests (same as other tests of similar structure) for deciding on pairs of hypotheses are well suited for “aggregation,” via Convex Programming and simple Linear Algebra, into tests with efficiently computable performance guarantees deciding on composite hypotheses. In the concluding section 4 our focus is on applications. Here we illustrate the implementation of the approaches developed in the preceding sections by building models and carrying out numerical experimentation for several statistical problems including Positron Emission Tomography, detection and identification of signals in a convolution model, Markov chain related inferences, and some others.

In all experiments optimization was performed using Mosek optimization software [1]. The proofs missing in the main body of the paper can be found in the appendix.

2 Situation and Main result

In the sequel, given a parametric family of probability distributions on a space and an observation with unknown , we intend to test some composite hypotheses about the parameter . In the situation to be considered in this paper, provably near-optimal testing reduces to Convex Programming, and we start with describing this situation.

2.1 Assumptions and goal

In what follows, we make the following assumptions on our “observation environment:”

1. is a convex set which coincides with its relative interior;

2. is a Polish (i.e., separable complete metric) space equipped with a Borel -additive -finite measure , , and distributions possess densities w.r.t. . We assume that

• is continuous in , and is positive;

• the densities are “locally uniformly summable:” for every compact set , there exists a Borel function on such that and for all , ;

3. We are given a finite-dimensional linear space of continuous functions on containing constants such that whenever .
Note that the latter assumption implies that distributions , belong to an exponential family.

4. For every , the function is well defined and concave in .

In the just described situation, where assumptions 1-4 hold, we refer to the collection as good observation scheme.

Now suppose that, on the top of a good observation scheme, we are given two nonempty convex compact sets , . Given an observation with some unknown known to belong either to (hypothesis ) or to (hypothesis ), our goal is to decide which of the two hypotheses takes place. Let be a test, i.e. a Borel function on taking values in , which receives on input an observation (along with the data participating in the description of and ). Given observation , the test accepts and rejects when , and accepts and rejects when . The quality of the test is characterized by its error probabilities – the probabilities of rejecting erroneously each of the hypotheses:

 ϵX=supx∈XPx{ω:T(ω)=−1},ϵY=supy∈YPy{ω:T(ω)=1},

and we define the risk of the test as the maximal error probability:

In the sequel, we focus on simple tests. By definition, a simple test is specified by a detector ; it accepts , the observation being , if , and accepts otherwise. We define the risk of a detector on as the smallest such that

 ∫Ωexp{−ϕ(ω)}px(ω)P(dω)≤ϵ∀x∈X,∫Ωexp{ϕ(ω)}py(ω)P(dω)≤ϵ∀y∈Y. (1)

For a simple test with detector we have

 ϵX=supx∈XPx{ω:ϕ(ω)<0},ϵY=supy∈YPy{ω:ϕ(ω)≥0},

and the risk of such test clearly does not exceed the risk of the detector .

2.2 Main result

We are about to show that in the situation in question, an efficiently computable via Convex Programming detector results in a nearly optimal test. The precise statement is as follows:

Theorem 2.1

In the just described situation and under the above assumptions,

(i) The function

 Φ(ϕ,[x;y])=ln(∫Ωexp{−ϕ(ω)}px(ω)P(dω))+ln(∫Ωexp{ϕ(ω)}py(ω)P(dω)):\omitF×(X×Y)→R. (2)

is continuous on its domain, is convex in , concave in , and possesses a saddle point ( in , in ) on . w.l.o.g. can be assumed to satisfy the relation3

 ∫Ωexp{−ϕ∗(ω)}px∗(ω)P(dω)=∫Ωexp{ϕ∗(ω)}py∗(ω)P(dω). (3)

Denoting the common value of the two quantities in (3) by , the saddle point value

 minϕ∈Fmax[x;y]∈X×YΦ(ϕ,[x;y])

is , and the risk of the simple test associated with the detector on the composite hypotheses , is . Moreover, for every , for the test with the detector , the probabilities to reject when the hypothesis is true and to reject when the hypothesis is true can be upper-bounded as

 ϵX≤exp{a}ε⋆,ϵY≤exp{−a}ε⋆. (4)

(ii) Let be such that there exists a (whatever) test for deciding between two simple hypotheses

 (A):ω∼p(⋅):=px∗(⋅),(B):ω∼q(⋅):=py∗(⋅) (5)

with the sum of error probabilities . Then

 ε⋆≤2√ϵ(1−ϵ).

In other words, if the simple hypotheses , can be decided, by a whatever test, with the sum of error probabilities , then the risk of the simple test with detector on the composite hypotheses , does not exceed .

(iii) The detector specified in (i) is readily given by the -component of the associated saddle point of , specifically,

 ϕ∗(⋅)=\small12ln(px∗(⋅)/py∗(⋅)). (6)

Remark. At this point let us make a small summary of the properties of simple tests in the problem setting and under assumptions of section 2.1:

• One has

 ε∗=exp(\rm Opt/2)=ρ(x∗,y∗),

where is the -component of the saddle point solution of  (2), and

 ρ(x,y)=∫Ω√px(ω)py(ω)P(dω),

is the Hellinger affinity of distributions and [34, 37];

• the optimal detector as in  (6) satisfies  (1) with ;

• the simple test with detector can be “skewed”, by using instead of detector , to attain error probabilities of the test and .

As we will see in an instant, the properties (i) – (iii) of simple tests allow to “propagate” the near-optimality property of the tests in the case of repeated observations and multiple testing, and underline all further developments.

Of course, the proposed setting and construction of simple test are by no means unique. For instance, any test in the problem of deciding between and , with the risk bounded with , gives rise to the detector

 ¯ϕ(ω)=\small12ln(1−¯ϵ¯ϵ)¯¯¯¯T(ω)

(recall that when , as applied to observation , accepts , and otherwise). One can easily see that the risk of satisfies the bounds of  (1) with

 ϵ=2√¯ϵ(1−¯ϵ).

In other words, in the problem of deciding upon and , any test with the risk brings about a simple test with detector , albeit with a larger risk .

2.3 Basic examples

We list here some situations where our assumptions are satisfied and thus Theorem 2.1 is applicable.

Gaussian observation scheme

In the Gaussian observation scheme we are given an observation with unknown parameter and known covariance matrix . Here the family is defined with being with the Lebesque measure, , , and is the space of all affine functions on . Taking into account that

 ln(∫RmeaTω+bpμ(ω)dω))=b+aTμ+\small12aTΣa,

we conclude that Gaussian observation scheme is good. The test yielded by Theorem 2.1 is particularly simple in this case: assuming that the nonempty convex compact sets , do not intersect4, and that the covariance matrix of the distribution of observation is nondegenerate, we get

 ϕ∗(ω)=ξTω−α,ξ=\small12Σ−1[x∗−y∗],α=\small12ξTΣ−1[x∗+y∗], ε⋆=exp(−\small18(x∗−y∗)TΣ−1(x∗−y∗)) [[x∗;y∗]∈\rm Argmaxx∈X,y∈Y[ψ(x,y)=−\small14(x−y)TΣ−1(x−y)]]. (7)

One can easily verify that the error probabilities and of the associated simple test do not exceed , where is the error function:

 Erf(t)=(2π)−1/2∫∞texp{−s2/2}ds.

Moreover, in the case in question the sum of the error probabilities of our test is exactly the minimal, over all possible tests, sum of error probabilities when deciding between the simple hypotheses stating that and .

Remarks. Consider the simple situation where the covariance matrix is proportional to the identity matrix: (the case of general reduces to this “standard case” by simple change of variables). In this case, in order to construct the optimal test, one should find the closest in the Euclidean distance points and , so that the affine form strongly separates and . On the other hand, testing in the white Gaussian noise between the closed half-spaces and (which contain and , respectively) is exactly the same as deciding on two simple hypotheses stating that , and . Though this result is almost self-evident, it seems first been noticed in [14] in the problem of testing in white noise model, and then exploited in [15, 31] in the important to us context of hypothesis testing.

As far as numerical implementation of the testing routines is concerned, numerical stability of the proposed test is an important issue. For instance, it may be useful to know the testing performance when the optimization problem  (7) is not solved to exact optimality, or when errors may be present in description of the sets and . Note that one can easily bound the error of the obtained test in terms of the magnitude of violation of first-order optimality conditions for  (7), which read:

 (y∗−x∗)TΣ−1(x−x∗)+(x∗−y∗)TΣ−1(y−y∗)≤0,∀x∈X,y∈Y.

Now assume that instead of the optimal test we have at our disposal an “approximated” simple test associated with

 ~ϕ(ω)=~ξTω−~α,~ξ=% \small12Σ−1[~x−~y],~α=% \small12~ξT[~x+~y],

where , satisfy

 (~y−~x)TΣ−1(x−~x)+(~x−~y)TΣ−1(y−~y)≤δ,∀x∈X,y∈Y, (8)

with some . This implies the bound for the risk of the test with detector :

 max[ϵX,ϵY]≤~ϵ=Erf(\small12∥Σ−1/2(~x−~y)∥2−δ∥Σ−1/2(~x−~y)∥2). (9)

Indeed,  (8) implies that As a result,

 ~ξTx−~α=~ξT(x−~x)+~ξTΣ~ξ≥−δ2+~ξTΣ~ξ∀x∈X.

and for all ,

 \rm Probx{~ϕ(ω)<0}=\rm Probx{~ξT(ω−x)<−~ξTx+~α}=\rm Probx{∥Σ1/2~ξ∥2η<−∥Σ1/2~ξ∥22+δ2},

where . We conclude that

 ϵX=supx∈X\rm Probx{~ϕ(ω)<0}≤Erf(\small12∥Σ1/2~ξ∥2−δ2∥Σ1/2~ξ∥2)

what implies the bound  (9) for . The corresponding bound for is obtained in the same way.

Discrete observation scheme

Assume that we observe a realization of a random variable taking values in with probabilities :

 μi=Prob{ω=i},i=1,...,m.

The just described Discrete observation scheme corresponds to being with counting measure, , , In this case , and for ,

is concave in . We conclude that Discrete observation scheme is good. Furthermore, when assuming the convex compact sets , (recall that in this case is the relative interior of the standard simplex in ) not intersecting, we get

 ϕ∗(ω)=ln(√[x∗]ω/[y∗]ω),ε⋆=exp{\rm Opt/2}=ρ(x∗,y∗),[[x∗;y∗]∈\rm Argmaxx∈X,y∈Y[ψ(x,y)=2lnρ(x,y),\rm Opt=ψ(x∗,y∗)],] (10)

where is the Hellinger affinity of distributions and . One has , the Hellinger affinity of the sets and , where

 h2(x,y)=\small12m∑ℓ=1(√xℓ−√yℓ)2

is the Hellinger distance between distributions and . Thus the result of Theorem 2.1, as applied to Discrete observation model, allows for the following simple interpretation: to construct the simple test one should find the closest in Hellinger distance points and ; then the risk of the likelihood ratio test for distinguishing from , as applied to our testing problem, is bounded with , the Hellinger affinity of sets and .

Remarks. Discrete observation scheme considered in this section is a simple particular case – that of finite – of the result of [8, 9] on distinguishing convex sets of distributions. Roughly, the situation considered in those papers is as follows: let be a Polish space, be a -finite -additive Borel measure on , and be a density w.r.t. of probability distribution of observation . Note that the corresponding observation scheme (with being the set of densities with respect to on ) does not satisfy the premise of section 2.1 because the linear space spanned by constants and functions of the form , is not finite-dimensional. Now assume that we are given two non-overlapping convex closed subsets , of the set of probability densities with respect to on . Observe that for every positive Borel function , the detector given by for evident reasons satisfies the relation

 maxp∈X,q∈Y[∫Ωe−ϕ(ω)p(ω)P(dω),∫Ωeϕ(ω)q(ω)P(dω)]≤ϵ,ϵ=max[supp∈X∫ψ−1(ω)p(ω)P(dω),supq∈Y∫ψ(ω)q(ω)P(dω)]

Let now

 \rm Opt=maxp∈X,q∈Y{ρ(p,q)=∫Ω√p(ω)q(ω)P(dω)}, (11)

which is an infinite-dimensional convex program with respect to and . Assuming the program solvable with an optimal solution composed of distribution , which are positive, and setting , under some “regularity assumptions” (see, e.g., Proposition 4.2 of [9]) the optimality conditions for  (11) read:

 minp∈X,q∈Y[∫Ωψ−1∗(ω)[p∗(ω)−p(ω)]P(dω)+∫Ωψ∗(ω)[q∗(ω)−q(ω)]P(dω)]=0.

In other words,

 maxp∈X∫Ωψ−1∗(ω)p(ω)dP(ω)≤∫Ωψ−1∗(ω)p∗(ω)dP(ω)=\rm Opt,

and similarly,

 maxq∈Y∫Ωψ∗(ω)q(ω)dP(ω)≤∫Ωψ∗(ω)q∗(ω)dP(ω)=\rm Opt,

so that for our , we have .

Note that, although this approach is not restricted to the Discrete case per se, when is not finite, the optimization problem in  (11) is generally computationally intractable (the optimal detectors can be constructed explicitly for some special sets of distribution, see [9, 11]).

The bound for the risk of the simple test can be compared to the testing affinity between and ,

 π(X,Y)=maxx∈X,y∈Y{π(x,y)=m∑ℓ=1min[xℓ,yℓ]},

which is the least possible sum of error probabilities when distinguishing between and (cf. [35, 37]). The corresponding minimax test is a simple test with detector , defined according to

 ¯¯¯ϕ(ω)=ln(√[¯¯¯x]ω/[¯¯¯y]ω),[[¯¯¯x;¯¯¯y]∈\rm Argmaxx∈X,y∈Y[∑mℓ=1min[xℓ,yℓ].].

Unfortunately, this test cannot be easily extended to the case where repeated observations (e.g., independent realizations , , of ) are available. In [27] such an extension has been proposed in the case where and are dominated by bi-alternating capacities (see, e.g., [28, 5, 13, 3], and references therein); explicit constructions of the test were proposed for some special sets of distributions [26, 42, 41]. On the other hand, as we shall see in section 2.4, the simple test allows for a straightforward generalization to the repeated observations case with the same (near-)optimality guaranties as those of Theorem 2.1.ii.

Finally, same as in the Gaussian observation scheme, the risk of a simple test with detector , defined by a pair of distributions , can be assessed through the magnitude of violation by and of the first-order optimality conditions for the optimization problem in  (10). Indeed, assume that

 m∑ℓ=1√~yℓ~xℓ(xℓ−~xℓ)+m∑ℓ=1√~xℓ~yℓ(yℓ−~yℓ)≤δ∀x∈X,y∈Y.

We conclude that

 ϵX ≤ maxx∈Xm∑ℓ=1e−~ϕℓxℓ=maxx∈Xm∑ℓ=1√~yℓ~xℓxℓ≤m∑ℓ=1√~yℓ~xℓ+δ, ϵY ≤ maxy∈Ym∑ℓ=1e~ϕℓyℓ=maxy∈Ym∑ℓ=1√~xℓ~yℓyℓ≤m∑ℓ=1√~xℓ~yℓ+δ,

so that the risk of the test is bounded with .

Poisson observation scheme

Suppose that we are given realizations of independent Poisson random variables

 ωi∼\rm Poisson(μi)

with parameters . The Poisson observation scheme is given by being with counting measure, where , and, similarly to the Gaussian case, is comprised of the restrictions onto of affine functions: . Since

 ln⎛⎜⎝∑ω∈Zm+exp(aTω+b)pμ(ω)⎞⎟⎠=m∑i=1(eai−1)μi+b

is concave in , we conclude that Poisson observation scheme is good.

Assume now that, same as above, in the Poisson observation scheme, the convex compact sets , do not intersect. Then the data associated with the simple test yielded by Theorem 2.1 is as follows:

 ϕ∗(ω)=ξTω−α,ξℓ=% \small12ln([x∗]ℓ/[y∗]ℓ),α=\small12∑mℓ=1[x∗−y∗]ℓ,ε⋆=exp{\rm Opt/2}[[x∗;y∗]∈\rm Argmaxx∈X,y∈Y[ψ(x,y)=−2h2(x,y)],\rm Opt=ψ(x∗,y∗),] (12)

where is the Hellinger distance between and .

Remark. Let be a detector, generated by , namely, such that

 ~ξℓ=\small12ln(~xℓ/~yℓ),~α=\small12m∑ℓ=1(~xℓ−~yℓ).

We assume that is an approximate solution to  (12) in the sense that the first-order optimality condition of  (12) is ’-satisfied”:

 m∑ℓ=1[(√~yℓ/~xℓ−1)(xℓ−~xℓ)+(√~xℓ/~yℓ−1)(yℓ−~yℓ)]≤δ∀x∈X,y∈Y.

One can easily verify that the risk of the test, associated with , is bounded with (cf. the corresponding bounds for the Gaussian and Discrete observation schemes).

2.4 Repeated observations

Good observation schemes admit naturally defined direct products. To simplify presentation, we start with explaining the corresponding construction in the case of stationary repeated observations described as follows.

K-repeated stationary observation scheme

We are given a good observation scheme and a positive integer , along with same as above . Instead of a single realization , we now observe a sample of independent realizations , . Formally, this corresponds to the observation scheme with the observation space equipped with the measure , the family of densities of repeated observations w.r.t. , and . The components of our setup are the same as for the original single-observation scheme, and the composite hypotheses we intend to decide upon state now that the -element observation comes from a distribution with (hypothesis ) or with (hypothesis ).

It is immediately seen that the just described -repeated observation scheme is good (i.e., satisfies all our assumptions), provided that the “single observation” scheme we start with is so. Moreover, the detectors , and risk bounds , given by Theorem 2.1 as applied to the original and the -repeated observation schemes are linked by the relations

 ϕK∗(ω1,...,ωK)=∑Kk=1ϕ∗(ωk),ε(K)⋆=(ε⋆)K. (13)

As a result, the “near-optimality claim” Theorem 2.1.ii can be reformulated as follows:

Proposition 2.1

Assume that for some integer and some , the hypotheses , can be decided, by a whatever procedure utilising observations, with error probabilities . Then with

 K+=⎥⎥ ⎥ ⎥ ⎥⎦2¯K1−2ln[2]ln[1/ϵ]⎢⎢ ⎢ ⎢ ⎢⎣

observations, being the smallest integer , the simple test with the detector decides between and with risk .

Indeed, applying (13) with and utilizing Theorem 2.1.ii, we get and therefore, by the same (13), for all . Thus, , and therefore the conclusion of Proposition follows from Theorem 2.1.i as applied to observations .

We see that for small , the “suboptimality ratio” (i.e., the ratio ) of the proposed test when -reliable testing is sought is close to 2 for small .

Non-stationary repeated observations

We are about to define the notion of a general-type direct product of good observation schemes. The situation now is as follows: we are given good observation schemes

 Ok=((Ωk,Pk),Mk⊂Rmk,{pk,μk(⋅):μk∈Mk},Fk),k=1,...,K

and observe a sample of realizations drawn independently of each other from the distributions with densities, w.r.t. , being , for a collection with , . Setting

 ΩK=Ω1×...×ΩK={ωK=(ω1,...,ωK):ωk∈Ωk∀k≤K}, PK=P1×...×PK MK=M1×...×MK={μK=(μ1,...,μK):μk∈Mk∀k≤K}, pμK(ωK)=p1,μ1(ω1)p2,μ2(ω2)...pK,μK(ωK)[μK∈MK,ωK∈ΩK], FK={ϕK(ωK)=ϕ1(ω1)+ϕ2(ω2)+...+ϕK(ωK):ΩK→R:ϕk(⋅)∈Fk∀k≤K},

we get an observation scheme which we call the direct product of and denote . It is immediately seen that this scheme is good. Note that the already defined stationary repeated observation scheme deals with a special case of the direct product construction, the one where all factors in the product are identical to each other, and where, in addition, we replace with its “diagonal part” .

Let , where, for every ,

 Ok=((Ωk,Pk),Mk,{pμk(⋅):μk∈Mk},Fk)

is a good observation scheme, specifically, either Gaussian, or Discrete, or Poisson (see section 2.3). To simplify notation, we assume that all Poisson factors are “scalar,” that is, is drawn from Poisson distribution with parameter .5 For

 ϕK(ωK)=K∑k