Nonparametric maximum likelihood approach to multiple change-point problems\thanksrefT1

# Nonparametric maximum likelihood approach to multiple change-point problems\thanksrefT1

[ [    [ [    [ [    [ [ Nankai University, University of Hong Kong, Nankai University
and Nankai University
C. Zou
Z. Wang
Institute of Statistics
Nankai University
Tianjin 300071
China
G. Yin
Department of Statistics
and Actuarial Science
University of Hong Kong
Hong Kong
L. Feng
School of Mathematical Sciences
Nankai University
Tianjin 300071
China
\smonth11 \syear2013\smonth2 \syear2014
\smonth11 \syear2013\smonth2 \syear2014
\smonth11 \syear2013\smonth2 \syear2014
###### Abstract

In multiple change-point problems, different data segments often follow different distributions, for which the changes may occur in the mean, scale or the entire distribution from one segment to another. Without the need to know the number of change-points in advance, we propose a nonparametric maximum likelihood approach to detecting multiple change-points. Our method does not impose any parametric assumption on the underlying distributions of the data sequence, which is thus suitable for detection of any changes in the distributions. The number of change-points is determined by the Bayesian information criterion and the locations of the change-points can be estimated via the dynamic programming algorithm and the use of the intrinsic order structure of the likelihood function. Under some mild conditions, we show that the new method provides consistent estimation with an optimal rate. We also suggest a prescreening procedure to exclude most of the irrelevant points prior to the implementation of the nonparametric likelihood method. Simulation studies show that the proposed method has satisfactory performance of identifying multiple change-points in terms of estimation accuracy and computation time.

[
\kwd
\doi

10.1214/14-AOS1210 \volume42 \issue3 2014 \firstpage970 \lastpage1002 \newproclaimremRemark

\runtitle

Nonparametric detection of multiple change-points

{aug}

A]\fnmsChangliang \snmZoulabel=e1]nk.chlzou@gmail.com, B]\fnmsGuosheng \snmYin\correflabel=e2]gyin@hku.hk, C]\fnmsLong \snmFenglabel=e3]flnankai@126.com
and A]\fnmsZhaojun \snmWanglabel=e4]zjwang@nankai.edu.cn \thankstextT1Supported in part by the NNSF of China Grants 11131002, 11101306, 11371202, the RFDP of China Grant 20110031110002, the Foundation for the Author of National Excellent Doctoral Dissertation of PR China 201232, New Century Excellent Talents in University and also by a Grant (784010) from the Research grants Council of Hong Kong.

class=AMS] \kwd[Primary ]62G05\kwd[; secondary ]62G20 BIC \kwdchange-point estimation \kwdCramér–von Mises statistic \kwddynamic programming \kwdempirical distribution function \kwdgoodness-of-fit test

## 1 Introduction

The literature devoted to change-point models is vast, particularly in the areas of economics, genome research, quality control, and signal processing. When there are notable changes in a sequence of data, we can typically break the sequence into several data segments, so that the observations within each segment are relatively homogeneous. In the conventional change-point problems, the posited models for different data segments are often of the same structure but with different parameter values. However, the underlying distributions are typically unknown, and thus parametric methods potentially suffer from model misspecification. The least-squares fitting is the standard choice for the MCP, while its performance often deteriorates when the error follows a heavy-tailed distribution or when the data contain outliers.

Without imposing any parametric modeling assumption, we consider the multiple change-point problem (MCP) based on independent data , such that

 Xi∼Fk(x),τk−1≤i≤τk−1,k=1,…,Kn+1;i=1,…,n, (1)

where is the true number of change-points, ’s are the locations of these change-points with the convention of and , and is the cumulative distribution function (C.D.F.) of segment satisfying . The number of change-points is allowed to grow with the sample size .

Although extensive research has been conducted to estimate the number of change-points and the locations of these change-points ’s, most of the work assumes that ’s belong to some-known parametric functional families or that they differ only in their locations (or scales). For a comprehensive coverage on single change-point problems , see Csörgő and Horváth (1997). The standard approach to the MCP is based on least-squares or likelihood methods via a dynamic programming (DP) algorithm in conjunction with a selection procedure such as the Bayesian information criterion (BIC) for determining the number of change-points [Yao (1988); Yao and Au (1989); Chen and Gupta (1997); Bai and Perron (1998, 2003); Braun, Braun and Müller (2000); Hawkins (2001); Lavielle (2005)]. By reframing the MCP in a variable selection context, Harchaoui and Lévy-Leduc (2010) proposed a penalized least-squares criterion with a LASSO-type penalty [Tibshirani (1996)]. Chen and Zhang (2012) developed a graph-based approach to detecting change-points, which is applicable in high-dimensional data and non-Euclidean data. Other recent development in this area includes Rigaill (2010), Killick, Fearnhead and Eckley (2012) and Arlot, Celisse and Harchaoui (2012).

Our goal is to develop an efficient nonparametric procedure for the MCP in (1) without imposing any parametric structure on the ’s; virtually any salient difference between two successive C.D.F.’s (say, and ) would ensure detection of the change-point asymptotically. In the nonparametric context, most of the existing work focuses on the single change-point problem by using some seminorm on the difference between pre- and post-empirical distributions at the change-point [Darkhovskh (1976); Carlstein (1988); Dümbgen (1991)]. Guan (2004) studied a semiparametric change-point model based on the empirical likelihood, and applied the method to detect the change from a distribution to a weighted one. Zou et al. (2007) proposed another empirical likelihood approach without assuming any relationship between the two distributions. However, extending these methods to the MCP is not straightforward. Lee (1996) proposed to use the weighted empirical measure to detect two different nonparametric distributions over a window of observations and then run the window through the full data sequence to detect the number of change-points. Although the approach of Lee (1996) is simple and easy to implement, our simulation studies show that even with elaborately chosen tuning parameters the estimates of the locations ’s as well as the number of change-points are not satisfactory. This may be partly due to the “local” nature of the running window, and thus the information in the data is not fully and efficiently utilized. Matteson and James (2014) proposed a new estimation method, ECP, under multivariate settings, which is based on hierarchical clustering by recursively using a single change-point estimation procedure.

Observing the connection between multiple change-points and goodness-of-fit tests, we propose a nonparametric maximum likelihood approach to the MCP. Our proposed nonparametric multiple change-point detection (NMCD) procedure can be regarded as a nonparametric counterpart of the classical least-squares MCP method [Yao (1988)]. Under some mild conditions, we demonstrate that the NMCD can achieve the optimal rate, , for the estimation of the change-points without any distributional assumptions. Due to the use of empirical distribution functions, technical arguments for controlling the supremum of the nonparametric likelihood function are nontrivial and are interesting in their own rights. As a matter of fact, some techniques regarding the empirical process have been nicely integrated with the MCP methodologies. In addition, our theoretical results are applicable to the situation with a diverging number of change-points, that is, when the number of change-points, , grows as goes to infinity. This substantially enlarges the scope of applicability of the proposed method, from a traditional fixed dimensionality to a more challenging high-dimensional setting.

In the proposed NMCD procedure, the number of change-points, , is determined by the BIC. Given , the DP algorithm utilizes the intrinsic order structure of the likelihood to recursively compute the maximizer of the objective function with a complexity of . To exclude most of the irrelevant points, we also suggest an initial screening procedure so that the NMCD is implemented in a much lower-dimensional space. Compared with existing parametric and nonparametric approaches, the proposed NMCD has satisfactory performance of identifying multiple change-points in terms of estimation accuracy and computation time. It offers robust and effective detection capability regardless of whether the ’s differ in the location, scale, or shape.

The remainder of the paper is organized as follows. In Section 2, we first describe how to recast the MCP in (1) into a maximization problem and then introduce our nonparametric likelihood method followed by its asymptotic properties. The algorithm and practical implementation are presented in Section 3. The numerical performance and comparisons with other existing methods are presented in Section 4. Section 5 contains a real data example to illustrate the application of our NMCD method. Several remarks draw the paper to its conclusion in Section 6. Technical proofs are provided in the Appendix, and the proof of a corollary and additional simulation results are given in the supplementary material [Zou et al. (2014)].

## 2 Nonparametric multiple change-point detection

### 2.1 NMCD method

Assume that are independent and identically distributed from , and let denote the empirical C.D.F. of the sample, then . If we regard the sample as binary data with the probability of success , this leads to the nonparametric maximum log-likelihood

In the context of (1), we can write the joint log-likelihood for a candidate set of change-points as

 Lu(τ′1,…,τ′L) = +(1−ˆFτ′k+1τ′k(u))log(1−ˆFτ′k+1τ′k(u))},

where is the empirical C.D.F. of the subsample with and . To estimate the change-points , we can maximize (2.1) in an integrated form

 Rn(τ′1,…,τ′L)=∫∞−∞Lu(τ′1,…,τ′L)dw(u), (3)

where is some positive weight function so that is finite, and the integral is used to combine all the information across . The rationale of using (3) can be clearly seen from the behavior of its population counterpart. For simplicity, we assume that there exists only one change-point , and let and . Through differentiation with respect to , it can be verified that the limiting function of ,

 Qu(θ) = θ{F(1)θ(u)log(F(1)θ(u))+(1−F(1)θ(u))log(1−F(1)θ(u))} +(1−θ){F(2)θ(u)log(F(2)θ(u))+(1−F(2)θ(u))log(1−F(2)θ(u))},

increases as approaches from both sides, where

 F(1)θ(u) = min(q1,θ)F1(u)+max(θ−q1,0)F2(u)min(q1,θ)+max(θ−q1,0)and F(2)θ(u) = max(q1−θ,0)F1(u)+min(1−θ,1−q1)F2(u)max(q1−θ,0)+min(1−θ,1−q1),

are the limits of and , respectively. This implies that the function attains its local maximum at the true location of the change-point, .

{rem}

The log-likelihood function (2.1) is essentially related to the two-sample goodness-of-fit (GOF) test statistic based on the nonparametric likelihood ratio [Einmahl and McKeague (2003); Zhang (2006)]. To see this, let be independent, and suppose that have a common continuous distribution function , and have . We are interested in testing the null hypothesis that for all against that for some . For each fixed , a natural approach is to apply the likelihood ratio test,

 Gu = n1{ˆFn1+11(u)log(ˆFn1+11(u)ˆFn(u))+(1−ˆFn1+11(u))log(1−ˆFn1+11(u)1−ˆFn(u))}

where corresponds to the C.D.F. of the pooled sample. By noting that , would be of the same form as (2.1) with up to a constant which does not depend on the segmentation point . Einmahl and McKeague (2003) considered using to test whether there is at most one change-point.

In the two-sample GOF test, Zhang (2002, 2006) demonstrated that by choosing appropriate weight functions we can produce new omnibus tests that are generally much more powerful than the conventional ones such as Kolmogorov–Smirnov, Cramér–von Mises and Anderson–Darling test statistics. If we take , and also note that is zero for and where represent the order statistics, the objective function in (3) can be rewritten as

 Rn(τ′1,…,τ′L) =∫X(n)X(1)Lu(τ′1,…,τ′L){ˆFn(u)(1−ˆFn(u))}−1dˆFn(u) (4) =nL∑k=0n−1∑l=2(τ′k+1−τ′k)ˆFkllogˆFkl+(1−ˆFkl)log(1−ˆFkl)l(n−l),

where . As recommended by Zhang (2002), we take a common “continuity correction” by replacing with for all  and .

To determine in the MCP, we observe that is a convex function with respect to , and thus

 maxτ′1<⋯<τ′LRn(τ′1,…,τ′L)≤maxτ′1<⋯<τ′L+1Rn(τ′1,…,τ′L+1),

which means that the maximum log-likelihood is a nondecreasing function in . Hence, we can use Schwarz’s Bayesian information criterion (BIC) to strike a balance between the likelihood and the number of change-points by incorporating a penalty for large . More specifically, we identify the value of by minimizing

 (5)

and is a proper sequence going to infinity. Yao (1988) used the BIC with to select the number of change-points and showed its consistency in the least-squares framework. However, the traditional BIC tends to select a model with some spurious change-points. Detailed discussions on the choice of and other tuning parameters are given in Section 3.2.

### 2.2 Asymptotic theory

In the context of change-point estimation, it is well known that the points around the true change-point cannot be distinguished asymptotically with a fixed change magnitude. In the least-squares fitting, the total variation with perfect segmentation is asymptotically equivalent to that with an estimate of the change-point in a neighborhood of the true change-point [Yao and Au (1989)]. For example, suppose that there is only one change-point with a change size , then we can only achieve as , where denotes the maximum likelihood estimator (MLE) of [see Chapter 1 of Csörgő and Horváth (1997)]. For single change-point nonparametric models, Darkhovskh (1976) obtained a rate of , Carlstein (1988) derived a rate of a.s. (almost surely) for any , and Dümbgen (1991) achieved a rate of . The estimator in Lee (1996) is shown to be consistent a.s. and the differences between the estimated and true locations of change-points are of order a.s.

Let denote the set of estimates of the change-points using the proposed NMCD. The next theorem establishes the desirable property for the NMCD estimator when is prespecified— is asymptotically close to the true change-point set. Let contain all the sets in the -neighborhood of the true locations,

 CKn(δn) ={(τ′1,…,τ′Kn)\dvtx1<τ′1<⋯<τ′Kn≤n,∣∣τ′s−τs∣∣≤δn for 1≤s≤Kn},

where is some positive sequence. Denote for . For , define

 η(u;Fr,Fr,θ)=Fr(u)log(Fr(u)Fr,θ(u))+(1−Fr(u))log(1−Fr(u)1−Fr,θ(u)),

which is the Kullback–Leibler distance between two Bernoulli distributions with respective success probabilities and . Hence, whenever , and accordingly , is strictly larger than zero. Furthermore, for , define

 ηr(u)=η(u;Fr,Fr,1/2)+η(u;Fr+1,Fr,1/2).

To establish the consistency of the proposed NMCD, the following assumptions are imposed: {longlist}[(A2)]

are continuous and for .

Let ; as .

uniformly in , where is the C.D.F. of the pooled sample.

is a positive constant. Assumption (A1) is required in some exponential tail inequalities as detailed in the proof of Lemma 2, while the ’s can be discrete or mixed distributions in practice. Assumption (A2) is a standard requirement for the theoretical development in the MCP, which allows the change-points to be asymptotically distinguishable. Assumption (A3) is a technical condition that is trivially satisfied by the Glivenko–Cantelli theorem when is finite. Generally, it can be replaced by the conditions that exists and converges to 0 a.s. By the Dvoretzky–Kiefer–Wolfowitz inequality, the latter one holds if for any . Assumption (A4) means that the smallest signal strength among all the changes is bounded away from zero.

We may consider relaxing in assumption (A2) by allowing as . It is intuitive that if two successive distributions are very different, then we do not need a very large to locate the change point. For the mean change problem, Niu and Zhang (2012) and Hao, Niu and Zhang (2013) revealed that in order to obtain the consistency, a condition is required, where is the minimal jump size at the change-points (similar to ). In our nonparametric setting, such an extension warrants future investigation.

###### Theorem 1

Under assumptions (A1)–(A4), if and , then

 Pr{Gn(Kn)∈CKn(δn)}→1as n→∞.

Under the classical mean change-point model, Yao and Au (1989) studied the property of the least-squares estimator,

 argminτ′1<⋯<τ′KnKn+1∑k=1τ′k−1∑i=τ′k−1{Xi−^μ(τ′k−1,τ′k)}2, (6)

where denotes the average of the observations . It is well known that the least-squares estimator is consistent with the optimal rate , when the number of change-points is known (and does not depend on ) and the change magnitudes are fixed; see Hao, Niu and Zhang (2013) and the references therein. Under a similar setting with , we can establish the same rate of for our nonparametric approach.

###### Corollary 1

Under assumptions (A1), (A2) and (A4), for .

The proof is similar to that of Theorem 1, which is provided in the supplementary material [Zou et al. (2014)]. With the knowledge of , we can obtain an optimal rate of without specifying the distributions, which is consistent with the single change-point case in Dümbgen (1991).

The next theorem establishes the consistency of the NMCD procedure with the BIC in (5). Let , where is an upper bound on the true number of change-points.

###### Theorem 2

Under assumptions (A1)–(A4), , with any , then as .

It is remarkable that in the conventional setting where is bounded, we can use of order instead of its least-squares counterpart in Yao (1988). In conjunction with Theorem 1, this result implies that with a fixed number of change-points.

## 3 Implementation of NMCD

### 3.1 Algorithm

One important property of the proposed maximum likelihood approach is that (2.1) is separable. The optimum for splitting cases into  segments conceptually consists of first finding the rightmost change-point , and then finding the remaining change-points from the fact that they constitute the optimum for splitting cases into segments. This separability is called Bellman’s “principle of optimality” [Bellman and Dreyfus (1962)]. Thus, (2.1) can be maximized via the DP algorithm and fitting such a nonparametric MCP model is straightforward and fast. The total computational complexity is for a given ; see Hawkins (2001) and Bai and Perron (2003) for the pseudo-codes of the DP. Hawkins (2001) suggested using the DP on a grid of values. Harchaoui and Lévy-Leduc (2010) proposed using a LASSO-type penalized estimator to achieve a reduced version of the least-squares method. Niu and Zhang (2012) developed a screening and ranking algorithm to detect DNA copy number variations in the MCP framework.

Due to the DP’s computational complexity in , an optimal segmentation of a very long sequence could be computationally intensive; for example, DNA sequences nowadays are often extremely long [Fearnhead and Vasileiou (2009)]. To alleviate the computational burden, we introduce a preliminary screening step which can exclude most of the irrelevant points and, as a consequence, the NMCD is implemented in a much lower-dimensional space.

#### Screening algorithm

{longlist}

[(iii)]

Choose an appropriate integer which is the length of each subsequence of the data, and take the estimated change-point set .

Initialize for ; and for , update to be the Cramér–von Mises two-sample test statistic for the samples and .

For , define . If , update .

Intuitively speaking, this screening step finds the most influential points that have the largest local jump sizes quantified by the Cramér–von Mises statistic, and thus helps to avoid including too many candidate points around the true change-point. As a result, we can obtain a candidate change-point set, , of which the cardinality, , is usually much smaller than . Finally, we run the NMCD procedure within the set using the DP algorithm to find the solution of

 argmaxτ′1<⋯<τ′L∈ORn(τ′1,…,τ′L).

Apparently, the screening procedure is fast because it mainly requires calculating Cramér–von Mises statistics. In contrast, Lee (1996) used a thresholding step to determine the number of change-points. The main difference between Lee (1996) and Niu and Zhang (2012) lies in the choice of the local test statistic; the former uses some seminorm of empirical distribution functions and the latter is based on the two-sample mean difference.

We next clarify how to choose , which formally establishes the consistency of the screening procedure.

###### Proposition 1

Under assumptions (A1)–(A2), if and , then we have , where

 Hl(δn) = there exists at least a τ′s so that |τ′s−τr|≤δn}.

This result follows by verifying condition (A3) in Lee (1996); see Example II of Dümbgen (1991). With probability tending to one, the screening algorithm can at least include one -neighborhood of the true location set by choosing an appropriate . Given a candidate , the computation of NMCD reduces to , which is of order in conjunction with the BIC. Both the R and FORTRAN codes for implementing the entire procedure are available from the authors upon request.

### 3.2 Selection of tuning parameters

We propose to take , which is found to be more powerful than simply using . The function attains its minimum at , that is when is the median of the sample. Intuitively, when two successive distributions mainly differ in their centers, both choices of would be powerful because a large portion of observations are around the center. However, if the difference between two adjacent distributions lies in their tails, using may not work well because only very limited information is included in the integral of (3). In contrast, our weight would be larger for those more extreme observations (far way from the median).

To better understand this, we analyze the term , which reflects the detection ability to a large extent. Consider a special case

and thus

 η1(u) = (ulog2+(1−u)log1−u1−u/2)I(0

It is easy to check that is unbounded, while the counterpart is finite. Consequently, the NMCDprocedure would be more powerful by using the weight .

Under the assumption that with and , we establish the consistency of the BIC in (5) for model selection. The choice of depends on and which are unknown. The value of depends on the practical consideration of how many change-points are to be identified, while reflects the length of the smallest segment. For practical use, we take to be fixed and recommend with . A small value of helps to prevent underfitting, as one is often reluctant to miss any important change-point. The performance of NMCD insensitive to the choice of , as long as is not too small, which is also to avoid underfitting. We suggest , that is, the cardinality of the candidate change-point set in the screening algorithm.

## 4 Simulation studies

### 4.1 Model setups

To evaluate the finite-sample performance of the proposed NMCD procedure, we conduct extensive simulation studies, and also make comparisons with existing methods. We calculate the distance between the estimated set and the true change-point set [Boysen et al. (2009)],

 ξ(ˆGn∥Ct)=supb∈Ctinfa∈ˆGn|a−b|andξ(Ct∥ˆGn)=supb∈ˆGninfa∈Ct|a−b|,

which quantify the over-segmentation error and the under-segmentation error, respectively. A desirable estimator should be able to balance both quantities. In addition, we consider the average Rand index [Fowlkes and Mallows (1983)], which measures the discrepancy of two sets from an average viewpoint.

Following model (I) introduced by Donoho and Johnstone (1995), we generate the Blocks datasets, which contains change-points:

 Model (I)\dvtXi = Kn∑j=1hjJ(nti−τj)+σεi,J(x)={1+sgn(x)}/2, {τj/n} = {0.1,0.13,0.15,0.23,0.25,0.40,0.44,0.65,0.76,0.78,0.81}, {hj} = {2.01,−2.51,1.51,−2.01,2.51,−2.11,1.05,2.16, −1.56,2.56,−2.11},

where there are equally spaced covariates in . Three error distributions for  are considered: , Student’s distribution with three degrees of freedom , and the standardized (zero mean and unit variance) chi-squared distribution with one degree of freedom . The Blocks datasets with , as depicted in the top three plots of Figure A.1 in the supplementary material [Zou et al. (2014)], are generally considered difficult for multiple change-point estimation due to highly heterogeneous segment levels and lengths.

In a more complicated setting with both location and scale changes, we consider model (II) with :

 Model (II)\dvtXi = Kn∑j=1hjJ(nti−τj)+σεi∑Knj=1J(nti−τj)∏j=1vj, {hj} = {3,0,−2,0},{τj/n}={0.20,0.40,0.65,0.85}and {vj} = {1,5,1,0.25},

where all the other setups are the same as those of model (I). As shown by the bottom three plots in Figure A.1, there are two location changes and two scale changes.

In addition, we include a simulation study when the distributions differ in the skewness and kurtosis. In particular, we consider

 Model (III)\dvtXi∼Fj(x),τj/n={0.20,0.50,0.75},j=1,2,3,4,

where correspond to the standard normal, the standardized (with zero mean and unit variance), the standardized , and the standard normal distribution, respectively. Because there is no mean or variance difference between the ’s, as depicted in the left panel of Figure A.4, the estimation for such a change-point problem is rather difficult. All the simulation results are obtained with 1000 replications.

### 4.2 Calibration of tuning parameters

To study the sensitivity of the choice of , Figure 1(a) shows the curves of versus the value of with under model (I). Clearly, the estimation is reasonably well with a value of around 1. For more adaptive model selection, a data-adaptive complexity penalty in Shen and Ye (2002) could be considered.

In the screening procedure, the choice of needs to balance the computation and underfitting. By Proposition 1, , while is typically unknown. In practice, we recommend to choose , which is the smallest integer that is larger than . Figure 1(b) shows the curves of under-segmentation errors versus the value of with under model (I). In a neighborhood of , our method provides a reasonably effective reduction of the subset and the performance is relatively stable. In general, we do not recommend a too large value of so as to avoid underfitting. From the results shown in Section 4.6, the choice of and works also well when the number of change-points increases as the sample size increases.

### 4.3 Comparison between NMCD and PL

Firstly, under model (I) with location changes only, we make a comparison of NMCD with the parametric likelihood (PL) method which coincides with the classical least-squares method in (6) under the normality assumption [Yao (1988)]. We also consider a variant of NMCD by using (abbreviated as NMCD*). The comparison is conducted with and without knowing the true number of change-points , respectively. Table 1 presents the average values of and for and 1000 and when is known to be 11. To gain more insight, we also present the standard deviations of the two distances in parentheses. Simulation results with other values of can be found in the supplementary material [Zou et al. (2014)].

As expected, the PL has superior efficiency for the case with normal errors, since the parametric model is correctly specified. The NMCD procedure also offers satisfactory performance and the differences in the two values between NMCD and PL are extremely small, while both methods significantly outperform the NMCD* procedure. For the cases with and errors, the NMCD procedure almost uniformly outperforms the PL in terms of estimation accuracy of the locations. Not only are the distance values of and smaller, but the corresponding standard deviations are also much smaller using the NMCD.

Next, we consider the unknown case, for which both the NMCD and PL procedures are implemented by setting and using the BIC to choose the number of change-points. The average values of the distances and are tabulated in Table 2. In addition, we also present the average values of with standard deviations in parentheses, which reflect the overall estimation accuracy of . Clearly, the two methods have comparable performances under the normal error, while the proposed NMCD significantly outperforms PL in terms of and for the two nonnormal cases, because the efficiency of the BIC used in PL relies heavily on the parametric assumption. When we compare the results across Tables 1 and 2, the standard deviations for the distance measures increase from the known to the unknown cases, as estimating further enlarges the variability.

We turn to the comparison between NMCD and PL under model (II) in which both location and scale changes are exhibited. In this situation, the standard least-squares method (6) does not work well because it is constructed for location changes only. To further allow for scale changes under the PL method, we consider

 argminτ′1<⋯<τK′nL∑k=1(τ′k+1−τ′k)log^σ2k, (7)

where , and the BIC is modified accordingly. The bottom panels of Tables 1 and 2 tabulate the values of and when is specified in advance and estimated by using the BIC, respectively. Clearly, the NMCD method delivers a satisfactory detection performance for the normal case and performs much better than the PL method for the two nonnormal cases. Therefore, the conclusion remains that the PL method is generally sensitive to model specification, while the NMCD does not depend on any parametric modeling assumption and thus is much more robust.

### 4.4 Comparisons of NMCD with other nonparametric methods

We consider the methods of Lee (1996) and Matteson and James (2014), as they also do not make any assumptions regarding the nature of the changes. The NMCD is implemented with the initial nonparametric screening procedure, and is selected by the BIC. In both our screening procedure and Lee’s (1996) method, the window is set as , and the threshold value of the latter is chosen as . The ECP method of Matteson and James (2014) is implemented using the “ecp” R package with the false alarm rate 0.05 and .

Table 3 shows the comparison results based on