A procedure for the change point problem in parametric models based on \phi-divergence test-statistics

A procedure for the change point problem in parametric models based on ϕ-divergence test-statistics

Abstract

This paper studies the change point problem for a general parametric, univariate or multivariate family of distributions. An information theoretic procedure is developed which is based on general divergence measures for testing the hypothesis of the existence of a change. For comparing the accuracy of the new test-statistic a simulation study is performed for the special case of a univariate discrete model. Finally, the procedure proposed in this paper is illustrated through a classical change-point example.

MSC: primary 62F03; 62F05; secondary 62H15

Keywords: Change point; Information criterion; Divergence; Wald test-statistic; General distributions.

1 Introduction

The change point problem has been considered and studied by several authors the last five decades. Change point analysis is a statistical tool for determining whether a change has taken place at a point of a sequence of observations, such that the observations are described by one distribution up to that point and by another distribution after that point. Change-point analysis concerns with the detection and estimation of the point at which the distribution changes. One change point problem or multiple change points problem have been studied in the literature, depending on whether one or more change points are observed in a sequence of random variables. Several methods, parametric or non-parametric, have been developed to approach the solution of this problem while the range of applications of change point analysis is broad. Applications can be encountered in many areas such as statistical quality control, public health, medicine, finance, biomedical signal processing, meteorology, seismology, etc. The monograph by Chen and Gupta (2000) summarizes recent developments in parametric change-point analysis.

Typical situations encountered in the literature of parametric multiple change points analysis are as follows: Let be independent -variate observations () and let the statistical space associated with the random variable (r.v.) , . The probability density function with respect to a -finite measure given by , , , . For simplicity, is either the Lebesgue measure or a counting measure. We adopt in the sequel the formulation of the multiple change point problem as it appeared in Srivastava and Worsley (1986) and Chen and Gupta (2000, 2004). Based on these authors, suppose that adjacent observations are grouped in groups, so that , are in the first group, , are in the second group and we continue in a similar manner until are in the -th group.

Consider the model for changes in the parameters. This is formulated as a problem of testing the following hypotheses,

 H0: θ1=θ2=...=θK (=θ0, θ0 unknown), (1)

versus the alternative

 H1: θ1=...=θk1≠θk1+1=...=θk2≠...≠θkq−1+1=...=θkq=θK,

where , , is the unknown number of changes and are the unknown positions of the change points. The above hypotheses can be equivalently stated in the form

 H0:Xi are described by fθ0, i=1,...,K and θ0 unknown, (2)

versus the alternative

 H1:Xkj+1,Xkj+2,...,Xkj+1, j=0,...,q−1 are described by fθj+1,

with .

There is an extensive bibliography on the subject and several methods to search for the change point problem have appeared in the literature. Among them, the generalized likelihood ratio test, Bayesian solution of the problem, information criterion approaches, cumulative sum method, etc. Based on these methods, several papers discuss the change-point problems in specific probabilistic models, like the univariate and multivariate normal distribution, the gamma model and the exponential model. For instance, Sen and Srivastava (1980) focused on the single change-point problem. Moreover, they consider that within each section, the distributions are the same, while the distribution in a section is different from that in the preceding and the following section in mean vector or covariance matrix. For an exposition of these methods and their application to specific distributions we refer to the monograph or the survey paper by Chen and Gupta (2000, 2001) and the references appeared therein.

It has been proposed in these and other treatments (cf., for instance, Vostrikova (1981)), that in order to study the multiple change point problem, which is formulated by (1) or (2), we just need to test the single change point hypothesis and then to repeat the procedure for each subsequence. Hence, we turn to the testing of (2) against the alternative,

 H1: Xi≡fθ0, i=1,...,κ \ and \ Xi≡fθ1, i=κ+1,...,K, (3)

where the symbol  is used to denote that the observations on the left follow the parametric density on the right. In (3), represents the position a single change point, which is supposed to be unknown. A general description of this technique in the detection of the changes is summarized in the following steps by Chen and Gupta (2001). First we test for no change point versus one change point, that is, we test the null hypothesis given by (2) versus the alternative given by (3) and equivalently stated by : . Here, is the unknown location of the single change point. If is not rejected, then the procedure is finished and there is no change point. If is rejected, then there is a change point and we continue with the step 2. In the second step we test separately the two subsequences before and after the change point found in the first step for a change. In the sequel, we repeat these two steps until no further subsequences have change points. At the end of the procedure, the collection of change point locations found by the previous steps constitute the set of the change points.

The subject of change point analysis is twofold. On the one hand to detect if there is one or more changes in a sequence of observation. The second aspect of change point analysis is the estimation of the number of changes and their corresponding locations. In this paper we will develop an information theoretic procedure which is based on divergence, in order to study the change point problem. The measures background is a general parametric, univariate or multivariate family of distributions. We describe formally the framework and the problem in Section 2, and the main results are presented in Section 3. In Section 4 we focus our interest on a specific distribution, the binomial distribution and a simulation study is performed in order to compare the accuracy the new test-statistic with some pre-existing test-statistics. In the final Section 5, the general results of this paper are illustrated by means of the well-known Lisdisfarne scribes data set.

2 Information theoretic procedure

Consider now the single change point problem, that is the problem of testing the pair of hypotheses

 H0 : Xi≡fθ0, i=1,...,K (4a) H1 : Xi≡fθ0, i=1,...,κ and\ Xi≡fθ1, i=κ+1,...,K, (4b) which are presented by (2) and (3), respectively. In the above formulation, θ0 and θ1 are unknown. Since κ is the unknown location of the single change point, we will consider all the candidate points k∈{1,...,K−1}. Let ˆθ(K)0,k denotes the maximum likelihood estimator (MLE) of θ0 which is based on the random sample X1,...,Xk from fθ0 and let ˆθ(K)1,k denotes the m.l.e. of θ1 which is based on the random sample Xk+1,...,XK from fθ1. If the hypothesis H1 is true, then there is a difference between the probabilistic models fˆθ(K)0,k and fˆθ(K)1,k, which cause a large value for a measure of the distance between fˆθ(K)0,k and fˆθ(K)1,k. Given that the ϕ-divergence is a broad family of distance measures between probability distributions, the ϕ-divergence between fˆθ(K)0,k and fˆθ(K)1,k is large if H1 is true and hence it can be used in order to decide if the candidate point k in (4b) is a change point (κ=k). Taking into account that the m.l.e. ˆθ(K)0,k and ˆθ(K)1,k of θ0 and θ1, respectively, depend on the candidate change point k, we will adopt the following notation for the ϕ-divergence between fˆθ(K)0,k and fˆθ(K)1,k,
 D(k)ϕ=D(k)ϕ(fˆθ(K)0,k,fˆθ(K)1,k)=∫X(d)fˆθ(K)1,k(x)ϕ⎛⎜⎝fˆθ(K)0,k(x)fˆθ(K)1,k(x)⎞⎟⎠dμ(x), (5)

provided that the convex function satisfies some additional conditions (see page 408 in Pardo (2006)) which ensure the existence of the above integral. Moreover, we consider convex functions which satisfy and . Large values of support the existence of a change point and therefore large values of suggest rejection of the null hypothesis . Hence can be used as a test statistic for testing the hypotheses (4a). Then, motivated by the fact that large values of are in favor of , a test for testing the existence of a single change point, that is the hypotheses (4a), should be based on the -divergence test statistic,

 T(K)ϕ=maxk∈{1,...,K−1}T(K)ϕ(k), (6)

where

 T(K)ϕ(k)=k(K−k)K2ϕ′′(1)D(k)ϕ(fˆθ(K)0,k,fˆθ(K)1,k). (7)

Moreover, the unknown position of the change point is estimated by such that

 ˆκϕ=argmaxk∈{1,...,K−1}T(K)ϕ(k)=argmaxk∈{1,...,K−1}k(K−k)KD(k)ϕ(fˆθ(K)0,k,fˆθ(K)1,k). (8)

Based on the above discussion, in (4a) is rejected for , where is a constant to be determined by the null distribution of . Hence, in order to use of (6) for testing hypotheses (4a), it is necessary the knowledge of the distribution of , under .

There are two important reasons why working directly with test-statistics , defined in (6), is avoided, on one hand, its asymptotic distribution , is not an easy to handle random variable (see for instance Theorem 1.2 and 1.3 in Gombay and Horváth (1989)) and on the other hand, in practice cases such that are very difficult to detect. Let be the set all possible integers such that , with , small enough. We shall modify (6) to be maximized in , i.e.

 ϵT(K)ϕ=maxk∈N(ϵ)T(K)ϕ(k), (9)

and in the same manner (8) becomes

 ϵˆκϕ=argmaxk∈N(ϵ)T(K)ϕ(k)=argmaxk∈N(ϵ)k(K−k)KD(k)ϕ(fˆθ(K)0,k,fˆθ(K)1,k). (10)

3 Main result

In order to get the asymptotic distribution of the family of tests statistics , given in (6), we shall assume the usual regularity assumptions for the multiparameter Central Limit Theorem (see for instance Theorem 5.2.2. in Sen and Singer (1993)):

(i)

The parameter space, , is either or a rectangle in .

(ii)

For all ,

 μ({x∈X(d):fθ(x)≠fθ′(x)})>0.
(iii)

For ,

 ∂∂θifθ(x)and∂2∂θi∂θjfθ(x), i,j∈{1,...,m},

exist almost everywhere and are such that

 ∣∣∣∂∂θifθ(x)∣∣∣≤Hi(x)and∣∣∣∂2∂θi∂θjfθ(x)∣∣∣≤Gij(x), i,j∈{1,...,m},

where

 ∫X(d)Hi(x)dμ(x)<∞and∫X(d)Gij(x)dμ(x)<∞, i,j∈{1,...,m}.
(iv)

Denoting ,

 ∂∂θiℓ(x;θ)and∂2∂θi∂θjℓ(x;θ), i,j∈{1,...,m},

exist almost everywhere and are such that the Fisher information matrix is finite and positive definite. In addition, where

 ψ(δ)=Eθ[sup{h:∥h∥≤δ}∥∥∥∂2∂θ∂θTℓ(x;θ+h)−∂2∂θ∂θTℓ(x;θ)∥∥∥],

with and is the Euclidean norm.

Theorem 1

Under in (4a) and the previous regularity assumptions, (i)-(iv), the asymptotic distribution of (9) is given by

 ϵT(K)ϕLK→∞⟶Tm,ϵ (11)

where ,

 Missing or unrecognized delimiter for \left (12)

with , being an -dimensional vector of independent Brownian bridges and .

Proof. According to the properties of the MLEs we know that

 √k(ˆθ(K)0,k−θ0)L⟶k→∞N(0,IF(θ0)−1), √K−k(ˆθ(K)1,k−θ1)L⟶(K−k)→∞N(0,IF(θ1)−1),

where for , such that , , is the information matrix. If we consider that , then

 √k(K−k)K(ˆθ(K)0,k−θ0)L⟶K→∞N(0,(1−λ(K)k)IF(θ0)−1), √k(K−k)K(ˆθ(K)1,k−θ1)L⟶K→∞N(0,λ(K)kIF(θ1)−1).

This means that under , i.e. ,

 √k(K−k)K(ˆθ(K)0,k−ˆθ(K)1,k)L⟶K→∞N(0,IF(θ0)−1),

and hence we can construct a Wald-type test-statistic as follows

 Q(K)k=k(K−k)K(ˆθ(K)0,k−ˆθ(K)1,k)TˆIF(θ0)(ˆθ(K)0,k−ˆθ(K)1,k), (13)

where is any consistent estimator of . From Theorem 1 in Hawkins (1987) we know that

 maxk∈N(ϵ)Q(K)kL⟶K→∞Tm,ϵ

In addition from Pardo (2006), page 443, we have

 T(K)ϕ(k)=k(K−k)K2ϕ′′(1)Dϕ(fˆθ(K)0,k,fˆθ(K)1,k)=Q(K)k+oP(1)

where

 Dϕ(fˆθ(K)0,k,fˆθ(K)1,k)=∫fˆθ(K)1,k(x)ϕ⎛⎜⎝fˆθ(K)0,k(x)fˆθ(K)1,k(x)⎞⎟⎠dx.

With both results we conclude (11).

Remark 2

If we compare (13) with formula (2.3) in Hawkins (1987), both apparently are not equivalent because in our case appears rather than of formula (2.3). This difference is associated with the way of understanding Fisher Information matrix, in fact our Wald test-statistic coincide with the empirical stochastic process denoted by  at the beginning of Section 3 in Hawkins (1987).

Remark 3

The probability distribution function of random variable , for , given in (12), can be found in Sen (1981, page 397) and De Long (1981). The computation of the probability distribution function is complex, however it is possible to approximate the -value of the test in which the distribution of is considered under the null hypothesis. In Estrella (2003), for instance,

 ˜p−value(x,ϵ)=1Γ(m2)(x2)m2exp{−x2}(log((1−ϵ)2ϵ2)(1−mx)+2x), (14)

with  being the Gamma function, is proposed as an approximation of

 p−value(x,ϵ) =Pr(Tm,ϵ>x)=Pr⎛⎜⎝sups∈(1,(1−ϵ)2/ϵ2)1√s∥∥W(m)0(s)∥∥>√x⎞⎟⎠ =1Γ(m2)(x2)m2exp{−x2}(log((1−ϵ)2ϵ2)(1−mx)+2x+O(1x2)).

When calibrating the approximation for the univariate parameter (), we can take into account that the exact quantiles of order for , are , and  respectively, i.e. , , . If we use (14) with and the aforementioned quantiles, we obtain , , . We can see that in particular, approximates very well when is the quantile of order , which is in practice of major interest.

4 Simulation Study

In this section we are going to focus on the change point analysis for a particular discrete probability model, the binomial model. For this special case we will give an explicit expression for divergence based test-statistics. The accuracy will be compared by simulation with respect to pre-existing test-statistics. In this context, suppose we are dealing with a sequence of independent r.v.’s , , for which we are interested in testing (1). In order to do that we are going to consider a sequence of independent Bernoulli r.v.’s , , , whose probability mass function (p.m.f.) is given by , and . If we denote the cumulative steps between consecutive Binomial r.v.’s by

 Nk=k∑i=1ni.

the change points are located at for and at  for . Hence, is the only sequence of r.v.’s which are strictly identically distributed, but the change points of interest are located in . This means that we can construct the test-statistic by considering a sequence of i.i.d. r.v.’s but in addition we restrict the set of possible change points to , rather than one step change points. When the change point is located at , the MLEs of and  are given by

 ˆθ(K)0,k =YkNk,ˆθ(K)1,k=YK−YkNK−Nk, Yk =k∑i=1˜Xi=k∑i=1ni∑h=1Xih.

The likelihood ratio test-statistic is given by , where

 S(K)k =2⎡⎢⎣Nk⎛⎜⎝ˆθ(K)0,klog⎛⎜⎝ˆθ(K)0,kˆθ(K)0,K⎞⎟⎠+(1−ˆθ(K)0,k)log⎛⎜⎝1−ˆθ(K)0,kˆθ(K)0,K⎞⎟⎠⎞⎟⎠ +(NK−Nk)⎛⎜⎝ˆθ(K)1,klog⎛⎜⎝ˆθ(K)1,kˆθ(K)0,K⎞⎟⎠+(1−ˆθ(K)1,k)log⎛⎜⎝1−ˆθ(K)1,kˆθ(K)0,K⎞⎟⎠⎞⎟⎠⎤⎥⎦ (15)

Two important papers which cover are Worsley (1983), and Horváth (1989). The expression they gave for is not exactly the same, but it is equivalent to (15) (see formula (3.22) in Horváth and Serbinowska (1995)). Horváth (1989) found that the asymptotic distribution for a kind of normalization of based on the Darling-Erdös formula

 G(K)=√2logNKS(K)−2logNK−12loglogNK+12logπ,

is asymptotically equal to a Extreme Value random variable with parameters and . In addition, in Theorem 1.2 of Horváth and Serbinowska (1995), a modified version of the likelihood ratio test-statistic was given, , where

 ˜S(K)k=Nk(NK−Nk)N2KS(K)k.

The asymptotic distribution of is the supremum in of a standard univariate Brownian bridge (its probability distribution function is tabulated in Kiefer (1959)). We consider the version of the Wald test-statistic , with

 Q(K)k=Nk(NK−Nk)NK(ˆθ(K)0,k−ˆθ(K)1,k)2ˆIF(θ0),

where the consistent estimator of is given by

 ˆIF(θ0)=NkNKIF(ˆθ(K)0,k)+NK−NkNKIF(ˆθ(K)1,k)=NkNK1ˆθ(K)0,k(1−ˆθ(K)0,k)+NK−NkNK1ˆθ(K)1,k(1−ˆθ(K)1,k).

Finally, in order to give an explicit expression for divergence based test-statistics we are going to focus on a family of divergences, power divergences (see Read and Cressie (1988)), for which , if and , if , that is for each we obtain a different divergence measure between the p.m.f.s and ,

 Dλ(pθ0,pθ1)=1λ(1+λ)(θλ+10θλ1+(1−θ0)λ+1(1−θ1)λ−1), if λ(1+λ)≠0.

When the power divergence coincides with the so called Kullback divergence

 D0(pθ0,pθ1)=DKull(pθ0,pθ1)=(θ0log(θ0θ1)+(1−θ0)log(1−θ01−θ1)),

and when the power divergence coincides with the modified Kullback divergence . Hence, the shape of the power-divergence based test-statistics is , where

 Tλ(ˆθ(K)0,k,ˆθ(K)1,k)=2Nk(NK−Nk)NKDλ(pˆθ(K)0,k,pˆθ(K)1,k),

that is

 Tλ(ˆθ(K)0,k,ˆθ(K)1,k)=Nk(NK−Nk)NK2λ(1+λ)⎛⎜ ⎜ ⎜ ⎜ ⎜⎝(ˆθ(K)0,k)λ+1(ˆθ(K)1,k)λ+(1−ˆθ(K)0,k)λ+1(1−ˆθ(K)1,k)λ−1⎞⎟ ⎟ ⎟ ⎟ ⎟⎠, for λ(1+λ)≠0, (16)

and

 T0(ˆθ(K)0,k,ˆθ(K)1,k)=2Nk(NK−Nk)NK⎛⎜⎝ˆθ(K)0,klog⎛⎜⎝ˆθ(K)0,kˆθ(K)1,k⎞⎟⎠+(1−ˆθ(K)0,k)log⎛⎜⎝1−ˆθ(K)0,k1−ˆθ(K)1,k⎞⎟⎠⎞⎟⎠. (17)

Assuming that there is a monotone, continuous function such that and

 limK→∞maxk∈N(ϵ)∣∣ ∣∣Nk(NK−Nk)NK−g(k(K−k)K)∣∣ ∣∣=0,

the asymptotic distribution of and , for all , is the supremum in of the univariate tied-down Bessel process, i.e. (12) with . This assumption is very similar to the assumption given in Horváth and Serbinowska (1995) for the asymptotic distribution of .

A simulation study is performed in order to compare the accuracy of the proposed power divergence type test with respect to pre-existing test-statistics. In this context we apply test-statistics , , , , , with replication. The design is essentially the same as the study performed in Horváth and Serbinowska (1995):